عنوان انگلیسی مقاله:
An effective maximum entropy exploration approach for deceptive game in reinforcement learning
ترجمه فارسی عنوان مقاله:
یک رویکرد اکتشاف حداکثر آنتروپی موثر برای بازی فریبنده در یادگیری تقویتی
Sciencedirect - Elsevier - Neurocomputing, 403 (2020) 98-108. doi:10.1016/j.neucom.2020.04.068
Chunmao Li 1 , Xuanguang Wei 1 , Yinliang Zhao ∗, Xupeng Geng
Deceptive games are games that utilize the reward structure to keep the agent away from the global optimization and have been grown up to become a huge challenge in the field of deep reinforcement learning intelligent exploration. Most of the cutting-edge exploration approaches, such as count-based and curiosity-driven, even with intrinsic motivation, which achieves better performance in the sparse re- ward game, still easily fall into local optimal traps in the deceptive game. To address this shortfall, we introduce a further exploration approach called Maximum Entropy Explore (MEE). Based on entropy re- wards and the off-policy actor-critic reinforcement learning algorithm, we divided the agent exploration policy into two independent parts, namely, the target policy and the explorer policy. The explorer policy, taking the maximum entropy of the target policy as the optimization goal, is used to interact with the environment and generated trajectories for the target policy. The target policy regards the maximization of external reward as the optimization goal to achieve the global solution. To alleviate the catastrophic forgetting problem which leads to the training of the agent not stabilized during the off-policy explo- ration phrase, the optimal experience replay is applied. An on-policy mode switch trick is used to validly prevent the unstable and diverge which caused by the deadly triad. We conduct experiments comparing our approach with state-of-the-art deep reinforcement learning algorithm and exploration methods in the grid world and StarCraft II environments with deceptive reward. The experiment indicates that the MME approach sets out to be in the present paper effectively avoids the deceptive reward trap and learns the global optimal strategy.
Keywords: Deep reinforcement learning | Deceptive game | Maximum entropy explorer approach | Experience replay | On-policy switch