Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the m...Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the mismatch between the knowledge of the learned policy and the reality of the underlying environment.Recent works usually handle this in a too pessimistic manner to avoid out-of-distribution(OOD)queries as much as possible,but this can influence the robustness of the agents at unseen states.In this paper,we propose a simple but effective method to address this issue.The key idea of our method is to enhance the robustness of the new policy learned offline by weakening its confidence in highly uncertain regions,and we propose to find those regions by simulating them with modified Generative Adversarial Nets(GAN)such that the generated data not only follow the same distribution with the old experience but are very difficult to deal with by themselves,with regard to the behavior policy or some other reference policy.We then use this information to regularize the ORL algorithm to penalize the overconfidence behavior in these regions.Extensive experiments on several publicly available offline RL benchmarks demonstrate the feasibility and effectiveness of the proposed method.展开更多
Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)acce...Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)accelerating RL deployment with rich offline data.Existing RL methods fail to handle these two issues simultaneously,thereby we propose a novel offline RL method targeting hybrid action space.A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way.This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy.Critically,a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions.Our method demonstrates superior performance and generality across different tasks,particularly in typical realistic wargaming scenarios.展开更多
Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution s...Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL.To tackle this problem,some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy.However,since these generated states or state-action pairs are not guaranteed to be OOD,staying conservative on them may adversely affect the in-distribution ones.In this paper,we propose an OOD state-conservative offline RL method(OSCAR),which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset,and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states.In this way,we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement.We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states.OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS.Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return,substantially outperforming existing offline RL methods.展开更多
At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars fo...At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars for high efficiency,high precision,and high automation.Therefore,it is necessary to explore a new intelligent radar control learning framework and technology to improve the capability and automation of radar detection.Reinforcement learning is popular in decision task learning,but the shortage of samples in radar control tasks makes it difficult to meet the requirements of reinforcement learning.To address the above issues,we propose a practical radar operation reinforcement learning framework,and integrate offline reinforcement learning and meta-reinforcement learning methods to alleviate the sample requirements of reinforcement learning.Experimental results show that our method can automatically perform as humans in radar detection with real-world settings,thereby promoting the practical application of reinforcement learning in radar operation.展开更多
To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SA...To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%.展开更多
Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement...Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement learning(MARL)tasks,given the combinatorially increased interactions among agents and with the environment.However,in MARL,the paradigm of offline pre-training with online fine-tuning has not been studied,nor even datasets or benchmarks for offline MARL research are available.In this paper,we facilitate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of MARL.We investigate the generalization of MARL offline pre-training in the following three aspects:1)between single agents and multiple agents,2)from offline pretraining to online fine tuning,and 3)to that of multiple downstream tasks with few-shot and zero-shot capabilities.We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment,and then propose the novel architecture of multi-agent decision transformer(MADT)for effective offline learning.MADT leverages the transformer′s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks.A significant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios.On the StarCraft II offline dataset,MADT outperforms the state-of-the-art offline reinforcement learning(RL)baselines,including BCQ and CQL.When applied to online tasks,the pre-trained MADT significantly improves sample efficiency and enjoys strong performance in both few-short and zero-shot cases.To the best of our knowledge,this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.展开更多
基金supported by the National Key R&D program of China under Grant No.2021ZD0113203National Science Foundation of China under Grant No.61976115.
文摘Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the mismatch between the knowledge of the learned policy and the reality of the underlying environment.Recent works usually handle this in a too pessimistic manner to avoid out-of-distribution(OOD)queries as much as possible,but this can influence the robustness of the agents at unseen states.In this paper,we propose a simple but effective method to address this issue.The key idea of our method is to enhance the robustness of the new policy learned offline by weakening its confidence in highly uncertain regions,and we propose to find those regions by simulating them with modified Generative Adversarial Nets(GAN)such that the generated data not only follow the same distribution with the old experience but are very difficult to deal with by themselves,with regard to the behavior policy or some other reference policy.We then use this information to regularize the ORL algorithm to penalize the overconfidence behavior in these regions.Extensive experiments on several publicly available offline RL benchmarks demonstrate the feasibility and effectiveness of the proposed method.
文摘Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)accelerating RL deployment with rich offline data.Existing RL methods fail to handle these two issues simultaneously,thereby we propose a novel offline RL method targeting hybrid action space.A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way.This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy.Critically,a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions.Our method demonstrates superior performance and generality across different tasks,particularly in typical realistic wargaming scenarios.
基金supported by the National Key R&D Program of China(No.2022ZD0116402)the National Natural Science Foundation of China(No.62106172).
文摘Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL.To tackle this problem,some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy.However,since these generated states or state-action pairs are not guaranteed to be OOD,staying conservative on them may adversely affect the in-distribution ones.In this paper,we propose an OOD state-conservative offline RL method(OSCAR),which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset,and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states.In this way,we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement.We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states.OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS.Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return,substantially outperforming existing offline RL methods.
基金supported by Science and Technology Innovation 2030 New Generation Artificial Intelligence Major Project under Grant No.2021ZD0113303the National Natural Science Foundation of China under Grant Nos.62192783 and 62276128,and in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization.
文摘At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars for high efficiency,high precision,and high automation.Therefore,it is necessary to explore a new intelligent radar control learning framework and technology to improve the capability and automation of radar detection.Reinforcement learning is popular in decision task learning,but the shortage of samples in radar control tasks makes it difficult to meet the requirements of reinforcement learning.To address the above issues,we propose a practical radar operation reinforcement learning framework,and integrate offline reinforcement learning and meta-reinforcement learning methods to alleviate the sample requirements of reinforcement learning.Experimental results show that our method can automatically perform as humans in radar detection with real-world settings,thereby promoting the practical application of reinforcement learning in radar operation.
基金supported in part by the National Natural Science Foundation of China(62176259,62373364)the Key Research and Development Program of Jiangsu Province(BE2022095)。
文摘To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning(QL-style)on static datasets,this article utilizes the on-policy state-action-reward-state-action(SARSA-style)to develop an offline reinforcement learning(RL)method termed robust offline Actor-Critic with on-policy regularized policy evaluation(OPRAC).With the help of SARSA-style bootstrap actions,a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style.This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style,thus facilitating the acquisition of stable estimated Q-function.Even with limited data sampling errors,the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed.In addition,the sub-optimality of learned optimal policy merely stems from sampling errors.Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective tasksolving policies owing to the stable estimate of Q-value,outperforming state-of-the-art offline RLs by at least 15%.
基金Linghui Meng was supported in part by the Strategic Priority Research Program of the Chinese Academy of Sciences(No.XDA27030300)Haifeng Zhang was supported in part by the National Natural Science Foundation of China(No.62206289).
文摘Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement learning(MARL)tasks,given the combinatorially increased interactions among agents and with the environment.However,in MARL,the paradigm of offline pre-training with online fine-tuning has not been studied,nor even datasets or benchmarks for offline MARL research are available.In this paper,we facilitate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of MARL.We investigate the generalization of MARL offline pre-training in the following three aspects:1)between single agents and multiple agents,2)from offline pretraining to online fine tuning,and 3)to that of multiple downstream tasks with few-shot and zero-shot capabilities.We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment,and then propose the novel architecture of multi-agent decision transformer(MADT)for effective offline learning.MADT leverages the transformer′s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks.A significant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios.On the StarCraft II offline dataset,MADT outperforms the state-of-the-art offline reinforcement learning(RL)baselines,including BCQ and CQL.When applied to online tasks,the pre-trained MADT significantly improves sample efficiency and enjoys strong performance in both few-short and zero-shot cases.To the best of our knowledge,this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.