期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
Boundary Data Augmentation for Offline Reinforcement Learning
1
作者 SHEN Jiahao JIANG Ke TAN Xiaoyang 《ZTE Communications》 2023年第3期29-36,共8页
Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the m... Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the mismatch between the knowledge of the learned policy and the reality of the underlying environment.Recent works usually handle this in a too pessimistic manner to avoid out-of-distribution(OOD)queries as much as possible,but this can influence the robustness of the agents at unseen states.In this paper,we propose a simple but effective method to address this issue.The key idea of our method is to enhance the robustness of the new policy learned offline by weakening its confidence in highly uncertain regions,and we propose to find those regions by simulating them with modified Generative Adversarial Nets(GAN)such that the generated data not only follow the same distribution with the old experience but are very difficult to deal with by themselves,with regard to the behavior policy or some other reference policy.We then use this information to regularize the ORL algorithm to penalize the overconfidence behavior in these regions.Extensive experiments on several publicly available offline RL benchmarks demonstrate the feasibility and effectiveness of the proposed method. 展开更多
关键词 offline reinforcement learning out‐of‐distribution state ROBUSTNESS UNCERTAINTY
在线阅读 下载PDF
Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making
2
作者 Liwei Dong Ni Li +1 位作者 Guanghong Gong Xin Lin 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2024年第5期1422-1440,共19页
Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)acce... Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)accelerating RL deployment with rich offline data.Existing RL methods fail to handle these two issues simultaneously,thereby we propose a novel offline RL method targeting hybrid action space.A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way.This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy.Critically,a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions.Our method demonstrates superior performance and generality across different tasks,particularly in typical realistic wargaming scenarios. 展开更多
关键词 offline reinforcement learning(RL) WARGAMING DECISION-MAKING hybrid action space
原文传递
OSCAR:OOD State-Conservative Offline Reinforcement Learning for Sequential Decision Making
3
作者 Yi Ma Chao Wang +4 位作者 Chen Chen Jinyi Liu Zhaopeng Meng Yan Zheng Jianye Hao 《CAAI Artificial Intelligence Research》 2023年第1期91-101,共11页
Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution s... Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL.To tackle this problem,some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy.However,since these generated states or state-action pairs are not guaranteed to be OOD,staying conservative on them may adversely affect the in-distribution ones.In this paper,we propose an OOD state-conservative offline RL method(OSCAR),which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset,and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states.In this way,we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement.We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states.OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS.Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return,substantially outperforming existing offline RL methods. 展开更多
关键词 offline reinforcement learning out-of-distribution decision making
原文传递
A Practical Reinforcement Learning Framework for Automatic Radar Detection
4
作者 YU Junpeng CHEN Yiyu 《ZTE Communications》 2023年第3期22-28,共7页
At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars fo... At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars for high efficiency,high precision,and high automation.Therefore,it is necessary to explore a new intelligent radar control learning framework and technology to improve the capability and automation of radar detection.Reinforcement learning is popular in decision task learning,but the shortage of samples in radar control tasks makes it difficult to meet the requirements of reinforcement learning.To address the above issues,we propose a practical radar operation reinforcement learning framework,and integrate offline reinforcement learning and meta-reinforcement learning methods to alleviate the sample requirements of reinforcement learning.Experimental results show that our method can automatically perform as humans in radar detection with real-world settings,thereby promoting the practical application of reinforcement learning in radar operation. 展开更多
关键词 meta-reinforcement learning radar detection reinforcement learning offline reinforcement learning
在线阅读 下载PDF
Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation
5
作者 Shuo Cao Xuesong Wang Yuhu Cheng 《IEEE/CAA Journal of Automatica Sinica》 CSCD 2024年第12期2497-2511,共15页
To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SA... To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%. 展开更多
关键词 offline reinforcement learning off-policy QL-style on-policy SARSA-style policy evaluation(PE) Q-value estimation
在线阅读 下载PDF
Offline Pre-trained Multi-agent Decision Transformer 被引量:3
6
作者 Linghui Meng Muning Wen +8 位作者 Chenyang Le Xiyun Li Dengpeng Xing Weinan Zhang Ying Wen Haifeng Zhang Jun Wang Yaodong Yang Bo Xu 《Machine Intelligence Research》 EI CSCD 2023年第2期233-248,共16页
Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement... Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement learning(MARL)tasks,given the combinatorially increased interactions among agents and with the environment.However,in MARL,the paradigm of offline pre-training with online fine-tuning has not been studied,nor even datasets or benchmarks for offline MARL research are available.In this paper,we facilitate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of MARL.We investigate the generalization of MARL offline pre-training in the following three aspects:1)between single agents and multiple agents,2)from offline pretraining to online fine tuning,and 3)to that of multiple downstream tasks with few-shot and zero-shot capabilities.We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment,and then propose the novel architecture of multi-agent decision transformer(MADT)for effective offline learning.MADT leverages the transformer′s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks.A significant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios.On the StarCraft II offline dataset,MADT outperforms the state-of-the-art offline reinforcement learning(RL)baselines,including BCQ and CQL.When applied to online tasks,the pre-trained MADT significantly improves sample efficiency and enjoys strong performance in both few-short and zero-shot cases.To the best of our knowledge,this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL. 展开更多
关键词 Pre-training model multi-agent reinforcement learning(MARL) decision making TRANSFORMER offline reinforcement learning
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部