基于分位数回归的多智能体强化学习
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


Multi-agent Reinforcement Learning Based on Quantile Regression
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    多智能体强化学习是多智能体系统研究的重要组成部分, 在复杂协同任务中成效显著. 然而, 在需要长期决策的场景下, 由于长期回报的估计难度更大, 且难以对环境中的不确定性进行精准建模, 多智能体的表现往往不佳. 为解决上述问题, 本文提出了一种基于分位数回归的多智能体记忆强化学习模型. 该模型不仅选择性地利用了历史决策经验用于辅助长期决策, 还通过分位数函数对回报分布进行建模, 从而有效地捕捉了回报的不确定性. 该模型由记忆索引模块、隐式分位数决策网络和值分布分解模块这3部分组成, 其中记忆索引模块利用历史决策经验生成内在奖励, 促进智能体充分利用已有经验. 隐式分位数决策网络通过分位数回归, 对奖励分布进行建模, 为长期决策提供有力支持. 值分布分解模块将整体的回报分布分解为单个智能体的回报分布, 用于辅助单个智能体策略的学习. 本文的算法在星际争霸环境中进行了广泛的实验, 实验结果表明, 本文提出的方法提升了智能体在长期决策任务中的表现, 并具有较快的收敛速度.

    Abstract:

    Multi-agent reinforcement learning (MARL) is a crucial part of multi-agent system research, demonstrating remarkable effectiveness in complex collaborative tasks. However, in scenarios requiring long-term decision-making, multi-agent systems often underperform due to the difficulty in estimating long-term returns and accurately modeling environmental uncertainties. To this end, this study proposes a multi-agent memory-reinforcement learning model based on quantile regression. The model not only selectively utilizes historical decision-making experience to assist long-term decision-making but also employs quantile functions to model the return distribution, thereby effectively capturing return uncertainties. The model comprises three components, including a memory indexing module, an implicit quantile decision network, and a value distribution decomposition module. Specifically, the memory indexing module generates intrinsic reward by adopting historical decision-making experience to enhance the agents’ full utilization of existing experience. The implicit quantile decision network models reward distribution via quantile regression, providing powerful support for long-term decision-making. The value distribution decomposition module decomposes overall return distributions into the distribution of an individual agent to support single-agent strategy learning. Extensive experiments conducted in StarCraft II environments demonstrate that the proposed method enhances the performance of agents in long-term decision-making tasks, with fast convergence rates.

    参考文献
    相似文献
    引证文献
引用本文

张志文,周长东,马裕博,张博.基于分位数回归的多智能体强化学习.计算机系统应用,,():1-9

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-31
  • 最后修改日期:2025-07-07
  • 录用日期:
  • 在线发布日期: 2025-11-17
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62661041 传真: Email:csa@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号