WeChat Mini Program
Old Version Features

Policy composition in reinforcement learning via multi-objective policy optimization

Computing Research Repository (CoRR)(2023)

Cited 0|Views80
Abstract
We enable reinforcement learning agents to learn successful behavior policies by utilizing relevant pre-existing teacher policies. The teacher policies are introduced as objectives, in addition to the task objective, in a multi-objective policy optimization setting. Using the Multi-Objective Maximum a Posteriori Policy Optimization algorithm \citep{abdolmaleki2020distributional}, we show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In two domains with continuous observation and action spaces, our agents successfully compose teacher policies in sequence and in parallel, and are also able to further extend the policies of the teachers in order to solve the task. Depending on the specified combination of task and teacher(s), teacher(s) may naturally act to limit the final performance of an agent. The extent to which agents are required to adhere to teacher policies are determined by hyperparameters which determine both the effect of teachers on learning speed and the eventual performance of the agent on the task. In the {\tt humanoid} domain \citep{deepmindcontrolsuite2018}, we also equip agents with the ability to control the selection of teachers. With this ability, agents are able to meaningfully compose from the teacher policies to achieve a superior task reward on the {\tt walk} task than in cases without access to the teacher policies. We show the resemblance of composed task policies with the corresponding teacher policies through videos.
More
Translated text
Key words
policy composition,reinforcement learning,optimization,multi-objective
PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Try using models to generate summary,it takes about 60s
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Related Papers
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出了一种基于多目标策略优化的强化学习方法,利用预存的教师策略作为学习目标,以加速学习过程并提高任务表现。

方法】:通过将教师策略引入多目标策略优化框架中,同时考虑任务目标和教师策略目标,使用多目标最大后验策略优化算法进行学习。

实验】:在具有连续观测和动作空间的两个领域进行实验,使用{\tt humanoid}域的数据集,证明了代理能够成功地顺序和并行组合教师策略,并在没有塑造奖励的情况下扩展教师策略以解决任务。实验结果表明,通过控制教师策略的选择,代理能够在{\tt walk}任务上获得比没有访问教师策略时更优的任务奖励。