深度研究一下Gymnasium

▼thought

用户想让我深度研究一下"Gymnasium"。这很可能指的是 Farama Foundation 的 Gymnasium 库——它是 OpenAI Gym 的一个维护分支，是强化学习（RL）领域广泛使用的标准 API。让我从多个角度来研究：站内索引内容、网页搜索，以及可能的深度抓取。让我先同时进行多项搜索：站内搜索、网页搜索，以及获取当前时间。

▼✓当前时间time_zone: "Asia/Shanghai"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"unix_ms\": 1778230255171,\n  \"utc_iso\": \"2026-05-08T08:50:55.171Z\",\n  \"server_timezone\": \"UTC\",\n  \"server_local_calendar\": {\n    \"year\": 2026,\n    \"month\": 5,\n    \"day\": 8,\n    \"hour\": 8,\n    \"minute\": 50,\n    \"second\": 55\n  },\n  \"server_local_display_zh\": \"2026年5月8日星期五 08:50:55\",\n  \"requested_zone\": {\n    \"time_zone\": \"Asia/Shanghai\",\n    \"local_calendar\": {\n      \"year\": 2026,\n      \"month\": 5,\n      \"day\": 8,\n      \"hour\": 16,\n      \"minute\": 50,\n      \"second\": 55\n    },\n    \"local_display_zh\": \"2026年5月8日星期五 16:50:55\"\n  }\n}"
    }
  ],
  "details": {
    "unix_ms": 1778230255171,
    "utc_iso": "2026-05-08T08:50:55.171Z",
    "server_timezone": "UTC",
    "server_local_calendar": {
      "year": 2026,
      "month": 5,
      "day": 8,
      "hour": 8,
      "minute": 50,
      "second": 55
    },
    "server_local_display_zh": "2026年5月8日星期五 08:50:55",
    "requested_zone": {
      "time_zone": "Asia/Shanghai",
      "local_calendar": {
        "year": 2026,
        "month": 5,
        "day": 8,
        "hour": 16,
        "minute": 50,
        "second": 55
      },
      "local_display_zh": "2026年5月8日星期五 16:50:55"
    }
  }
}

▼✓信息流全文检索关键词: "Gymnasium Farama reinforcement learning", 条数: 30

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"items\": [\n    {\n      \"id\": \"https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5\",\n      \"url\": \"https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5\",\n      \"source_url\": \"https://www.machinebrief.com/rss.xml\",\n      \"title\": \"Revolutionizing Language Models: A New Era with Adaptive Power-Mean Optimization\",\n      \"author\": [\n        \"Felix Navarro\"\n      ],\n      \"summary\": \"Adaptive Power-Mean Policy Optimization (APMPO) transforms reinforcement learning for Large Language Models. By embracing dynamic policy strategies, it outperforms traditional static methods, enhancing reasoning capabilities dramatically.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T14:08:40.000Z\",\n      \"fetched_at\": \"2026-05-07T16:40:31.825Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04920v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04920\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization\",\n      \"author\": [\n        \"Xiyan Fu\",\n        \"Wei Liu\"\n      ],\n      \"summary\": \"arXiv:2605.04920v1 Announce Type: cross Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn\",\n      \"url\": \"https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn\",\n      \"source_url\": \"https://www.machinebrief.com/rss.xml\",\n      \"title\": \"Reinforcement Learning Gets a Safety Upgrade with ALaM\",\n      \"author\": [\n        \"Julian Voss\"\n      ],\n      \"summary\": \"The ALaM framework tackles safety in reinforcement learning by optimizing state-wise constraints. It's a big deal for stable training and risk management.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T15:53:58.000Z\",\n      \"fetched_at\": \"2026-05-05T16:10:22.708Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02943v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02943\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Healthcare AI GYM for Medical Agents\",\n      \"author\": [\n        \"Minbyul Jeong\"\n      ],\n      \"summary\": \"arXiv:2605.02943v1 Announce Type: new Abstract: Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \\\\gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2412.08893v3\",\n      \"url\": \"https://arxiv.org/abs/2412.08893\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Optimal Control with Natural Images: Efficient Reinforcement Learning using Overcomplete Sparse Codes\",\n      \"author\": [\n        \"Peter N. Loxley\"\n      ],\n      \"summary\": \"arXiv:2412.08893v3 Announce Type: replace Abstract: Optimal control and sequential decision making are widely used in many complex tasks. Optimal control over a sequence of natural images is a first step towards understanding the role of vision in control. Here, we formalize this problem as a reinforcement learning task, and derive general conditions under which an image includes enough information to implement an optimal policy. Reinforcement learning is shown to provide a computationally efficient method for finding optimal policies when natural images are encoded into \\\"efficient\\\" image representations. This is demonstrated by introducing a new reinforcement learning benchmark that easily scales to large numbers of states and long horizons. In particular, by representing each image as an overcomplete sparse code, we are able to efficiently solve an optimal control task that is orders of magnitude larger than those tasks solvable using complete codes. Theoretical justification for this behaviour is provided. This work also demonstrates that deep learning is not necessary for efficient optimal control with natural images.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03327v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03327\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment\",\n      \"author\": [\n        \"Hongbo Jin\",\n        \"Rongpeng Zhu\",\n        \"Zhongjing Du\",\n        \"Xu Jiang\",\n        \"Jingqi Tian\",\n        \"Qiaoman Zhang\",\n        \"Jiayu Ding\"\n      ],\n      \"summary\": \"arXiv:2605.03327v1 Announce Type: new Abstract: Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04066v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04066\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning\",\n      \"author\": [\n        \"Yiming Huang\",\n        \"Zhenbo Shi\",\n        \"Shuzheng Gao\",\n        \"Cuiyun Gao\",\n        \"Peiyi Han\",\n        \"Chuanyi Liu\"\n      ],\n      \"summary\": \"arXiv:2605.04066v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model's evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04065v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04065\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs\",\n      \"author\": [\n        \"Yiming Huang\",\n        \"Zhenbo Shi\",\n        \"Xin-Cheng Wen\",\n        \"Jichuan Zeng\",\n        \"Cuiyun Gao\",\n        \"Peiyi Han\",\n        \"Chuanyi Liu\"\n      ],\n      \"summary\": \"arXiv:2605.04065v1 Announce Type: new Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2506.07548v2\",\n      \"url\": \"https://arxiv.org/abs/2506.07548\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage\",\n      \"author\": [\n        \"Weiqiang Jin\",\n        \"Yang Liu\",\n        \"Shixiang Tang\",\n        \"Jinhu Qi\",\n        \"Wentao Zhang\",\n        \"Junli Wang\",\n        \"Biao Zhao\",\n        \"Hongyang Du\"\n      ],\n      \"summary\": \"arXiv:2506.07548v2 Announce Type: replace Abstract: Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04266v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04266\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Explaining and Preventing Alignment Collapse in Iterative RLHF\",\n      \"author\": [\n        \"Etienne Gauthier\",\n        \"Francis Bach\",\n        \"Michael I. Jordan\"\n      ],\n      \"summary\": \"arXiv:2605.04266v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy's parameter-steering effect on RM updates. We instantiate FPO via a scalable first-order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama-3.2-1B.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.14872v2\",\n      \"url\": \"https://arxiv.org/abs/2602.14872\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards\",\n      \"author\": [\n        \"Yu Huang\",\n        \"Zixin Wen\",\n        \"Yuejie Chi\",\n        \"Yuting Wei\",\n        \"Aarti Singh\",\n        \"Yingbin Liang\",\n        \"Yuxin Chen\"\n      ],\n      \"summary\": \"arXiv:2602.14872v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RLVR for transformers on compositional reasoning tasks. Our theory shows that mixed-difficulty training naturally follows an implicit curriculum: without any explicit schedule, easier problems become learnable first and shape the frontier for harder ones, creating a learning progression from easy to hard during optimization. The effectiveness of this curriculum is governed by the smoothness of the difficulty spectrum. When the spectrum is smooth, training dynamics enters a well-behaved relay regime, in which persistent gradient signals on easier problems make slightly harder ones tractable and keep training at the edge of competence. When the spectrum contains abrupt discontinuities, training undergoes grokking-type phase transitions with prolonged plateaus before progress recurs. As a technical contribution, our analysis develops and adapts techniques from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2601.07389v2\",\n      \"url\": \"https://arxiv.org/abs/2601.07389\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training\",\n      \"author\": [\n        \"Xueyan Niu\",\n        \"Bo Bai\",\n        \"Wei Han\",\n        \"Weixi Zhang\"\n      ],\n      \"summary\": \"arXiv:2601.07389v2 Announce Type: replace Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.19837v3\",\n      \"url\": \"https://arxiv.org/abs/2602.19837\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent\",\n      \"author\": [\n        \"Bj\\\\\\\"orn Hoppmann\",\n        \"Christoph Scholz\"\n      ],\n      \"summary\": \"arXiv:2602.19837v3 Announce Type: replace-cross Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind's Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04960v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04960\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance\",\n      \"author\": [\n        \"Song Yu\",\n        \"Li Li\",\n        \"Wenwen Zhao\",\n        \"Zhisheng Yang\"\n      ],\n      \"summary\": \"arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04712v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04712\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning\",\n      \"author\": [\n        \"Lirui Luo\",\n        \"Guoxi Zhang\",\n        \"Hongming Xu\",\n        \"Cong Fang\",\n        \"Qing Li\"\n      ],\n      \"summary\": \"arXiv:2605.04712v1 Announce Type: new Abstract: In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2604.27859v2\",\n      \"url\": \"https://arxiv.org/abs/2604.27859\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"A Brief Overview: Agentic Reinforcement Learning In Large Language Models\",\n      \"author\": [\n        \"Fangming Cui\",\n        \"Ruixiao Zhu\",\n        \"Cheng Fang\",\n        \"Sunan Li\",\n        \"Jiahong Li\"\n      ],\n      \"summary\": \"arXiv:2604.27859v2 Announce Type: replace Abstract: Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02063v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02063\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition\",\n      \"author\": [\n        \"Vik Pant\",\n        \"Eric Yu\"\n      ],\n      \"summary\": \"arXiv:2605.02063v1 Announce Type: cross Abstract: We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward-type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game-theoretic oracles, 2 heuristic baselines, and 101 constant-action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708-run training corpus and a 1,116-run behavioral audit corpus, both released under CC-BY-4.0 with Croissant 1.0 metadata. Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-05T16:07:19.561Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.03452v2\",\n      \"url\": \"https://arxiv.org/abs/2602.03452\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing\",\n      \"author\": [\n        \"Yujuan Pang\",\n        \"Jiaxin Li\",\n        \"Xin Sheng\",\n        \"Ran Peng\",\n        \"Yong Ma\"\n      ],\n      \"summary\": \"arXiv:2602.03452v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \\\\emph{positive--negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03434v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03434\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits\",\n      \"author\": [\n        \"Yu-Ting Lee\",\n        \"Samuel Yen-Chi Chen\",\n        \"Fu-Chieh Chang\"\n      ],\n      \"summary\": \"arXiv:2605.03434v1 Announce Type: new Abstract: Reinforcement learning is one of the most challenging learning paradigms where efficacy and efficiency gains are extremely valuable. Hierarchical reinforcement learning is a variant that leverages temporal abstraction to structure decision-making. While parametrized quantum computations have shown success in non-hierarchical reinforcement learning, whether these advantages adapt to hierarchical decision-making remains a critical open question. In this work, we develop a hybrid hierarchical agent based on the option-critic architecture. This hybrid agent substitutes classical components with variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies. Evaluated on standard benchmarking environments, results show that a hybrid agent utilizing a quantum feature extractor can outperform classical baselines while saving up to 66\\\\% trainable parameters. We also identify an architectural bottleneck that quantum option-value estimation severely degrades performance. Further ablation studies reveal how architectural choices of the quantum circuits affect performance. Our work establishes design principles for parameter-efficient hybrid hierarchical agents.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2511.08717v4\",\n      \"url\": \"https://arxiv.org/abs/2511.08717\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Optimal control of the future via prospective learning with control\",\n      \"author\": [\n        \"Yuxin Bai\",\n        \"Aranyak Acharyya\",\n        \"Ashwin De Silva\",\n        \"Zeyu Shen\",\n        \"James Hassett\",\n        \"Joshua T. Vogelstein\"\n      ],\n      \"summary\": \"arXiv:2511.08717v4 Announce Type: replace-cross Abstract: Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PLuC), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control: foraging, a canonical task relevant to both natural and artificial agents. We illustrate that modern RL algorithms, which assume stationarity, struggle in these non-stationary reset-free environments. Even with time-aware modifications, they converge orders of magnitude slower than our prospective foraging agents on a simple 1-D foraging benchmark. Code is available at: https://github.com/neurodata/procontrol.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.05890v2\",\n      \"url\": \"https://arxiv.org/abs/2602.05890\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training\",\n      \"author\": [\n        \"Dingwei Zhu\",\n        \"Zhiheng Xi\",\n        \"Shihan Dou\",\n        \"Jiahan Li\",\n        \"Chenhao Huang\",\n        \"Junjie Ye\",\n        \"Sixian Li\",\n        \"Mingxu Chai\",\n        \"Yuhui Wang\",\n        \"Yajie Yang\",\n        \"Ming Zhang\",\n        \"Jiazheng Zhang\",\n        \"Shichun Liu\",\n        \"Caishuang Huang\",\n        \"Yunke Zhang\",\n        \"Yuran Wang\",\n        \"Tao Gui\",\n        \"Xipeng Qiu\",\n        \"Qi Zhang\",\n        \"Xuanjing Huang\"\n      ],\n      \"summary\": \"arXiv:2602.05890v2 Announce Type: replace-cross Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04880v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04880\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs\",\n      \"author\": [\n        \"Erel Shtossel\",\n        \"Alicia Vidler\",\n        \"Uri Shaham\",\n        \"Gal A. Kaminka\"\n      ],\n      \"summary\": \"arXiv:2605.04880v1 Announce Type: new Abstract: Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04939v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04939\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Modular Reinforcement Learning For Cooperative Swarms\",\n      \"author\": [\n        \"Erel Shtossel\",\n        \"Gal A. Kaminka\"\n      ],\n      \"summary\": \"arXiv:2605.04939v1 Announce Type: cross Abstract: A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36\",\n      \"url\": \"https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36\",\n      \"source_url\": \"https://www.machinebrief.com/rss.xml\",\n      \"title\": \"Revolutionizing Code Generation with Requirement-Aware Learning\",\n      \"author\": [\n        \"Felix Navarro\"\n      ],\n      \"summary\": \"RECRL, a new curriculum reinforcement learning framework, significantly enhances code generation by addressing requirement difficulty. It shows a notable performance boost across five benchmarks.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T15:25:34.000Z\",\n      \"fetched_at\": \"2026-05-05T16:10:22.708Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04368v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04368\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Extending Differential Temporal Difference Methods for Episodic Problems\",\n      \"author\": [\n        \"Kris De Asis\",\n        \"Mohamed Elsayed\",\n        \"Jiamin He\"\n      ],\n      \"summary\": \"arXiv:2605.04368v1 Announce Type: new Abstract: Differential temporal difference (TD) methods are value-based reinforcement learning algorithms that have been proposed for infinite-horizon problems. They rely on reward centering, where each reward is centered by the average reward. This keeps the return bounded and removes a value function's state-independent offset. However, reward centering can alter the optimal policy in episodic problems, limiting its applicability. Motivated by recent works that emphasize the role of normalization in streaming deep reinforcement learning, we study reward centering in episodic problems and propose a generalization of differential TD. We prove that this generalization maintains the ordering of policies in the presence of termination, and thus extends differential TD to episodic problems. We show equivalence with a form of linear TD, thereby inheriting theoretical guarantees that have been shown for those algorithms. We then extend several streaming reinforcement learning algorithms to their differential counterparts. Across a range of base algorithms and environments, we empirically validate that reward centering can improve sample efficiency in episodic problems.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05110v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05110\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts\",\n      \"author\": [\n        \"Seungeun Rho\",\n        \"Shamel Fahmi\",\n        \"Jeonghwan Kim\",\n        \"Arianna Ilvonen\",\n        \"Sehoon Ha\",\n        \"Gabriel Nelson\"\n      ],\n      \"summary\": \"arXiv:2605.05110v1 Announce Type: cross Abstract: Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05112v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05112\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime\",\n      \"author\": [\n        \"Tianshu Zhu\",\n        \"Wenyu Zhang\",\n        \"Xiaoying Zuo\",\n        \"Lun Tian\",\n        \"Haotian Zhao\",\n        \"Yucheng Zeng\",\n        \"Jingnan Gu\",\n        \"Daxiang Dong\",\n        \"Jianmin Wu\",\n        \"Dawei Yin\",\n        \"Dou Shen\"\n      ],\n      \"summary\": \"arXiv:2605.05112v1 Announce Type: new Abstract: SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success--failure contrastive structure. Guided by this principle, we propose Prefix Sampling (PS), which replays trajectory prefixes to steer skewed groups toward this regime: successful prefixes serve as head starts for mostly failing groups, while failing prefixes serve as handicaps for mostly passing groups. In stateful agent environments, prefix states are reconstructed through replay while replayed tokens are excluded from the loss, restricting optimization to continuations generated by the current policy. On SWE-bench-style agentic RL, PS delivers end-to-end wall-clock speedups of 2.01x on Qwen3-14B and 1.55x on Qwen3-32B while preserving or improving final verified performance. For 14B, the SWE-bench Verified peak rises from the baseline peak of 0.273 to 0.295 under PS. Additional mathematical reasoning experiments on AIME 2025 show the same pass-rate control pattern and decompose the gains into replay, bidirectional coverage, and adaptive control.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02913v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02913\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning\",\n      \"author\": [\n        \"Rohan Surana\",\n        \"Gagan Mundada\",\n        \"Xunyi Jiang\",\n        \"Chuhan Wang\",\n        \"Zhenwei Tang\",\n        \"Difan Jiao\",\n        \"Zihan Huang\",\n        \"Yuxin Xiong\",\n        \"Junda Wu\",\n        \"Sheldon Yu\",\n        \"Xintong Li\",\n        \"Raghav Jain\",\n        \"Nikki Kuang\",\n        \"Sizhe Zhou\",\n        \"Bowen Jin\",\n        \"Zhendong Chu\",\n        \"Tong Yu\",\n        \"Ryan Rossi\",\n        \"Kuan-Hao Huang\",\n        \"Jingbo Shang\",\n        \"Jiawei Han\",\n        \"Julian McAuley\"\n      ],\n      \"summary\": \"arXiv:2605.02913v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04542v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04542\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation\",\n      \"author\": [\n        \"Akiyoshi Tomihari\",\n        \"Issei Sato\"\n      ],\n      \"summary\": \"arXiv:2605.04542v1 Announce Type: new Abstract: Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2604.01345v2\",\n      \"url\": \"https://arxiv.org/abs/2604.01345\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning\",\n      \"author\": [\n        \"Vikram Krishnamurthy\",\n        \"Luke Snow\"\n      ],\n      \"summary\": \"arXiv:2604.01345v2 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses. Adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    }\n  ],\n  \"total\": 3509\n}"
    }
  ],
  "details": {
    "items": [
      {
        "id": "https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5",
        "url": "https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5",
        "source_url": "https://www.machinebrief.com/rss.xml",
        "title": "Revolutionizing Language Models: A New Era with Adaptive Power-Mean Optimization",
        "author": [
          "Felix Navarro"
        ],
        "summary": "Adaptive Power-Mean Policy Optimization (APMPO) transforms reinforcement learning for Large Language Models. By embracing dynamic policy strategies, it outperforms traditional static methods, enhancing reasoning capabilities dramatically.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T14:08:40.000Z",
        "fetched_at": "2026-05-07T16:40:31.825Z"
      },
      {
        "id": "oai:arXiv.org:2605.04920v1",
        "url": "https://arxiv.org/abs/2605.04920",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization",
        "author": [
          "Xiyan Fu",
          "Wei Liu"
        ],
        "summary": "arXiv:2605.04920v1 Announce Type: cross Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn",
        "url": "https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn",
        "source_url": "https://www.machinebrief.com/rss.xml",
        "title": "Reinforcement Learning Gets a Safety Upgrade with ALaM",
        "author": [
          "Julian Voss"
        ],
        "summary": "The ALaM framework tackles safety in reinforcement learning by optimizing state-wise constraints. It's a big deal for stable training and risk management.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T15:53:58.000Z",
        "fetched_at": "2026-05-05T16:10:22.708Z"
      },
      {
        "id": "oai:arXiv.org:2605.02943v1",
        "url": "https://arxiv.org/abs/2605.02943",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Healthcare AI GYM for Medical Agents",
        "author": [
          "Minbyul Jeong"
        ],
        "summary": "arXiv:2605.02943v1 Announce Type: new Abstract: Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \\gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2412.08893v3",
        "url": "https://arxiv.org/abs/2412.08893",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Optimal Control with Natural Images: Efficient Reinforcement Learning using Overcomplete Sparse Codes",
        "author": [
          "Peter N. Loxley"
        ],
        "summary": "arXiv:2412.08893v3 Announce Type: replace Abstract: Optimal control and sequential decision making are widely used in many complex tasks. Optimal control over a sequence of natural images is a first step towards understanding the role of vision in control. Here, we formalize this problem as a reinforcement learning task, and derive general conditions under which an image includes enough information to implement an optimal policy. Reinforcement learning is shown to provide a computationally efficient method for finding optimal policies when natural images are encoded into \"efficient\" image representations. This is demonstrated by introducing a new reinforcement learning benchmark that easily scales to large numbers of states and long horizons. In particular, by representing each image as an overcomplete sparse code, we are able to efficiently solve an optimal control task that is orders of magnitude larger than those tasks solvable using complete codes. Theoretical justification for this behaviour is provided. This work also demonstrates that deep learning is not necessary for efficient optimal control with natural images.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.03327v1",
        "url": "https://arxiv.org/abs/2605.03327",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment",
        "author": [
          "Hongbo Jin",
          "Rongpeng Zhu",
          "Zhongjing Du",
          "Xu Jiang",
          "Jingqi Tian",
          "Qiaoman Zhang",
          "Jiayu Ding"
        ],
        "summary": "arXiv:2605.03327v1 Announce Type: new Abstract: Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2605.04066v1",
        "url": "https://arxiv.org/abs/2605.04066",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning",
        "author": [
          "Yiming Huang",
          "Zhenbo Shi",
          "Shuzheng Gao",
          "Cuiyun Gao",
          "Peiyi Han",
          "Chuanyi Liu"
        ],
        "summary": "arXiv:2605.04066v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model's evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2605.04065v1",
        "url": "https://arxiv.org/abs/2605.04065",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs",
        "author": [
          "Yiming Huang",
          "Zhenbo Shi",
          "Xin-Cheng Wen",
          "Jichuan Zeng",
          "Cuiyun Gao",
          "Peiyi Han",
          "Chuanyi Liu"
        ],
        "summary": "arXiv:2605.04065v1 Announce Type: new Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2506.07548v2",
        "url": "https://arxiv.org/abs/2506.07548",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage",
        "author": [
          "Weiqiang Jin",
          "Yang Liu",
          "Shixiang Tang",
          "Jinhu Qi",
          "Wentao Zhang",
          "Junli Wang",
          "Biao Zhao",
          "Hongyang Du"
        ],
        "summary": "arXiv:2506.07548v2 Announce Type: replace Abstract: Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.04266v1",
        "url": "https://arxiv.org/abs/2605.04266",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Explaining and Preventing Alignment Collapse in Iterative RLHF",
        "author": [
          "Etienne Gauthier",
          "Francis Bach",
          "Michael I. Jordan"
        ],
        "summary": "arXiv:2605.04266v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy's parameter-steering effect on RM updates. We instantiate FPO via a scalable first-order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama-3.2-1B.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2602.14872v2",
        "url": "https://arxiv.org/abs/2602.14872",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards",
        "author": [
          "Yu Huang",
          "Zixin Wen",
          "Yuejie Chi",
          "Yuting Wei",
          "Aarti Singh",
          "Yingbin Liang",
          "Yuxin Chen"
        ],
        "summary": "arXiv:2602.14872v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RLVR for transformers on compositional reasoning tasks. Our theory shows that mixed-difficulty training naturally follows an implicit curriculum: without any explicit schedule, easier problems become learnable first and shape the frontier for harder ones, creating a learning progression from easy to hard during optimization. The effectiveness of this curriculum is governed by the smoothness of the difficulty spectrum. When the spectrum is smooth, training dynamics enters a well-behaved relay regime, in which persistent gradient signals on easier problems make slightly harder ones tractable and keep training at the edge of competence. When the spectrum contains abrupt discontinuities, training undergoes grokking-type phase transitions with prolonged plateaus before progress recurs. As a technical contribution, our analysis develops and adapts techniques from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2601.07389v2",
        "url": "https://arxiv.org/abs/2601.07389",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training",
        "author": [
          "Xueyan Niu",
          "Bo Bai",
          "Wei Han",
          "Weixi Zhang"
        ],
        "summary": "arXiv:2601.07389v2 Announce Type: replace Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2602.19837v3",
        "url": "https://arxiv.org/abs/2602.19837",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent",
        "author": [
          "Bj\\\"orn Hoppmann",
          "Christoph Scholz"
        ],
        "summary": "arXiv:2602.19837v3 Announce Type: replace-cross Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind's Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04960v1",
        "url": "https://arxiv.org/abs/2605.04960",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance",
        "author": [
          "Song Yu",
          "Li Li",
          "Wenwen Zhao",
          "Zhisheng Yang"
        ],
        "summary": "arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04712v1",
        "url": "https://arxiv.org/abs/2605.04712",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning",
        "author": [
          "Lirui Luo",
          "Guoxi Zhang",
          "Hongming Xu",
          "Cong Fang",
          "Qing Li"
        ],
        "summary": "arXiv:2605.04712v1 Announce Type: new Abstract: In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2604.27859v2",
        "url": "https://arxiv.org/abs/2604.27859",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "A Brief Overview: Agentic Reinforcement Learning In Large Language Models",
        "author": [
          "Fangming Cui",
          "Ruixiao Zhu",
          "Cheng Fang",
          "Sunan Li",
          "Jiahong Li"
        ],
        "summary": "arXiv:2604.27859v2 Announce Type: replace Abstract: Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.02063v1",
        "url": "https://arxiv.org/abs/2605.02063",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition",
        "author": [
          "Vik Pant",
          "Eric Yu"
        ],
        "summary": "arXiv:2605.02063v1 Announce Type: cross Abstract: We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward-type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game-theoretic oracles, 2 heuristic baselines, and 101 constant-action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708-run training corpus and a 1,116-run behavioral audit corpus, both released under CC-BY-4.0 with Croissant 1.0 metadata. Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T04:00:00.000Z",
        "fetched_at": "2026-05-05T16:07:19.561Z"
      },
      {
        "id": "oai:arXiv.org:2602.03452v2",
        "url": "https://arxiv.org/abs/2602.03452",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing",
        "author": [
          "Yujuan Pang",
          "Jiaxin Li",
          "Xin Sheng",
          "Ran Peng",
          "Yong Ma"
        ],
        "summary": "arXiv:2602.03452v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \\emph{positive--negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.03434v1",
        "url": "https://arxiv.org/abs/2605.03434",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits",
        "author": [
          "Yu-Ting Lee",
          "Samuel Yen-Chi Chen",
          "Fu-Chieh Chang"
        ],
        "summary": "arXiv:2605.03434v1 Announce Type: new Abstract: Reinforcement learning is one of the most challenging learning paradigms where efficacy and efficiency gains are extremely valuable. Hierarchical reinforcement learning is a variant that leverages temporal abstraction to structure decision-making. While parametrized quantum computations have shown success in non-hierarchical reinforcement learning, whether these advantages adapt to hierarchical decision-making remains a critical open question. In this work, we develop a hybrid hierarchical agent based on the option-critic architecture. This hybrid agent substitutes classical components with variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies. Evaluated on standard benchmarking environments, results show that a hybrid agent utilizing a quantum feature extractor can outperform classical baselines while saving up to 66\\% trainable parameters. We also identify an architectural bottleneck that quantum option-value estimation severely degrades performance. Further ablation studies reveal how architectural choices of the quantum circuits affect performance. Our work establishes design principles for parameter-efficient hybrid hierarchical agents.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2511.08717v4",
        "url": "https://arxiv.org/abs/2511.08717",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Optimal control of the future via prospective learning with control",
        "author": [
          "Yuxin Bai",
          "Aranyak Acharyya",
          "Ashwin De Silva",
          "Zeyu Shen",
          "James Hassett",
          "Joshua T. Vogelstein"
        ],
        "summary": "arXiv:2511.08717v4 Announce Type: replace-cross Abstract: Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PLuC), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control: foraging, a canonical task relevant to both natural and artificial agents. We illustrate that modern RL algorithms, which assume stationarity, struggle in these non-stationary reset-free environments. Even with time-aware modifications, they converge orders of magnitude slower than our prospective foraging agents on a simple 1-D foraging benchmark. Code is available at: https://github.com/neurodata/procontrol.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2602.05890v2",
        "url": "https://arxiv.org/abs/2602.05890",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training",
        "author": [
          "Dingwei Zhu",
          "Zhiheng Xi",
          "Shihan Dou",
          "Jiahan Li",
          "Chenhao Huang",
          "Junjie Ye",
          "Sixian Li",
          "Mingxu Chai",
          "Yuhui Wang",
          "Yajie Yang",
          "Ming Zhang",
          "Jiazheng Zhang",
          "Shichun Liu",
          "Caishuang Huang",
          "Yunke Zhang",
          "Yuran Wang",
          "Tao Gui",
          "Xipeng Qiu",
          "Qi Zhang",
          "Xuanjing Huang"
        ],
        "summary": "arXiv:2602.05890v2 Announce Type: replace-cross Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2605.04880v1",
        "url": "https://arxiv.org/abs/2605.04880",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs",
        "author": [
          "Erel Shtossel",
          "Alicia Vidler",
          "Uri Shaham",
          "Gal A. Kaminka"
        ],
        "summary": "arXiv:2605.04880v1 Announce Type: new Abstract: Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04939v1",
        "url": "https://arxiv.org/abs/2605.04939",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Modular Reinforcement Learning For Cooperative Swarms",
        "author": [
          "Erel Shtossel",
          "Gal A. Kaminka"
        ],
        "summary": "arXiv:2605.04939v1 Announce Type: cross Abstract: A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36",
        "url": "https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36",
        "source_url": "https://www.machinebrief.com/rss.xml",
        "title": "Revolutionizing Code Generation with Requirement-Aware Learning",
        "author": [
          "Felix Navarro"
        ],
        "summary": "RECRL, a new curriculum reinforcement learning framework, significantly enhances code generation by addressing requirement difficulty. It shows a notable performance boost across five benchmarks.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T15:25:34.000Z",
        "fetched_at": "2026-05-05T16:10:22.708Z"
      },
      {
        "id": "oai:arXiv.org:2605.04368v1",
        "url": "https://arxiv.org/abs/2605.04368",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Extending Differential Temporal Difference Methods for Episodic Problems",
        "author": [
          "Kris De Asis",
          "Mohamed Elsayed",
          "Jiamin He"
        ],
        "summary": "arXiv:2605.04368v1 Announce Type: new Abstract: Differential temporal difference (TD) methods are value-based reinforcement learning algorithms that have been proposed for infinite-horizon problems. They rely on reward centering, where each reward is centered by the average reward. This keeps the return bounded and removes a value function's state-independent offset. However, reward centering can alter the optimal policy in episodic problems, limiting its applicability. Motivated by recent works that emphasize the role of normalization in streaming deep reinforcement learning, we study reward centering in episodic problems and propose a generalization of differential TD. We prove that this generalization maintains the ordering of policies in the presence of termination, and thus extends differential TD to episodic problems. We show equivalence with a form of linear TD, thereby inheriting theoretical guarantees that have been shown for those algorithms. We then extend several streaming reinforcement learning algorithms to their differential counterparts. Across a range of base algorithms and environments, we empirically validate that reward centering can improve sample efficiency in episodic problems.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.05110v1",
        "url": "https://arxiv.org/abs/2605.05110",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts",
        "author": [
          "Seungeun Rho",
          "Shamel Fahmi",
          "Jeonghwan Kim",
          "Arianna Ilvonen",
          "Sehoon Ha",
          "Gabriel Nelson"
        ],
        "summary": "arXiv:2605.05110v1 Announce Type: cross Abstract: Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.05112v1",
        "url": "https://arxiv.org/abs/2605.05112",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime",
        "author": [
          "Tianshu Zhu",
          "Wenyu Zhang",
          "Xiaoying Zuo",
          "Lun Tian",
          "Haotian Zhao",
          "Yucheng Zeng",
          "Jingnan Gu",
          "Daxiang Dong",
          "Jianmin Wu",
          "Dawei Yin",
          "Dou Shen"
        ],
        "summary": "arXiv:2605.05112v1 Announce Type: new Abstract: SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success--failure contrastive structure. Guided by this principle, we propose Prefix Sampling (PS), which replays trajectory prefixes to steer skewed groups toward this regime: successful prefixes serve as head starts for mostly failing groups, while failing prefixes serve as handicaps for mostly passing groups. In stateful agent environments, prefix states are reconstructed through replay while replayed tokens are excluded from the loss, restricting optimization to continuations generated by the current policy. On SWE-bench-style agentic RL, PS delivers end-to-end wall-clock speedups of 2.01x on Qwen3-14B and 1.55x on Qwen3-32B while preserving or improving final verified performance. For 14B, the SWE-bench Verified peak rises from the baseline peak of 0.273 to 0.295 under PS. Additional mathematical reasoning experiments on AIME 2025 show the same pass-rate control pattern and decompose the gains into replay, bidirectional coverage, and adaptive control.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.02913v1",
        "url": "https://arxiv.org/abs/2605.02913",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning",
        "author": [
          "Rohan Surana",
          "Gagan Mundada",
          "Xunyi Jiang",
          "Chuhan Wang",
          "Zhenwei Tang",
          "Difan Jiao",
          "Zihan Huang",
          "Yuxin Xiong",
          "Junda Wu",
          "Sheldon Yu",
          "Xintong Li",
          "Raghav Jain",
          "Nikki Kuang",
          "Sizhe Zhou",
          "Bowen Jin",
          "Zhendong Chu",
          "Tong Yu",
          "Ryan Rossi",
          "Kuan-Hao Huang",
          "Jingbo Shang",
          "Jiawei Han",
          "Julian McAuley"
        ],
        "summary": "arXiv:2605.02913v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2605.04542v1",
        "url": "https://arxiv.org/abs/2605.04542",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation",
        "author": [
          "Akiyoshi Tomihari",
          "Issei Sato"
        ],
        "summary": "arXiv:2605.04542v1 Announce Type: new Abstract: Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2604.01345v2",
        "url": "https://arxiv.org/abs/2604.01345",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning",
        "author": [
          "Vikram Krishnamurthy",
          "Luke Snow"
        ],
        "summary": "arXiv:2604.01345v2 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses. Adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      }
    ],
    "total": 3509
  }
}

▼✓网页搜索查询: "Gymnasium Farama Foundation reinforcement learning 2024 2025", 结果数: 15

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Python Reinforcement Learning using Gymnasium – Full Course\",\n      \"url\": \"https://www.youtube.com/watch?v=vufTSJbzKGU\",\n      \"description\": \"Gymnasium is an open source Python library maintained by the Farama Foundation ... Stanford CS224R Deep Reinforcement Learning | Spring 2025\"\n    },\n    {\n      \"title\": \"Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/index.html\",\n      \"description\": \"Light Logo Dark Logo Gymnasium Documentation. Farama Foundation logo Farama Foundation. ## An API standard for reinforcement learning with a diverse collection of reference environments. **Gymnasium is a maintained fork of OpenAI’s Gym library.** The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a migration guide for old Gym environments:. import  gymnasium  as  gym # Initialise the environment env = gym. make(\\\"LunarLander-v3\\\", render_mode = \\\"human\\\") # Reset the environment to generate the first observation observation, info = env. reset(seed = 42) for _ in range(1000): # this is where you would insert your policy action = env. sample()# step (transition) through the environment with the action# receiving the next observation, reward and if the episode has terminated or truncated observation, reward, terminated, truncated, info = env. step(action) # If the episode has ended then we can reset to start a new episode if terminated or truncated: observation, info = env.\"\n    },\n    {\n      \"title\": \"[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...\",\n      \"url\": \"https://arxiv.org/pdf/2407.17032\",\n      \"description\": \"Gymnasium: A Standardized Interface for Reinforcement Learning Environments Mark Towers† University of Southampton & Farama Foundation mt5g17@soton.ac.uk Ariel Kwiatkowski†‡ Meta AI, FAIR & Farama Foundation kwiat@meta.com Jordan Terry† Farama Foundation jkterry@farama.org John U. Gymnasium is an open-source library that provides a standardized API for RL environments, aiming to tackle this issue, with over 18 million installations. Gymnasium’s main feature is a set of abstractions that allow for wide interoperability between envi-ronments and training algorithms, making it easier for researchers to develop and test new environments and/or RL algorithms. In addition, Gymnasium provides a collection of built-in easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL re-search. In summary, Gymnasium provides the following novel contributions to the field of DRL: • A maintained API for handling Reinforcement Learning Environment with a wide range of built-in environments (Figure 1), a collection of compatible external environments, and support from numerous training libraries.\"\n    },\n    {\n      \"title\": \"[PDF] arXiv:2407.17032v1 [cs.LG] 24 Jul 2024 - ePrints Soton\",\n      \"url\": \"https://eprints.soton.ac.uk/493069/1/2407.17032v1.pdf\",\n      \"description\": \"This paper outlines the main design decisions for Gymnasium, its key features, and the differences to alternative APIs. 1 Introduction With the publication of a Deep Q-Networks (DQN) [Mnih et al., 2013], Reinforcement Learning (RL) was awoken from its Artificial Intelligence (AI) winter, showing that a general neural network-based algorithm can achieve expert-level performance across a range of complex tasks. • Third party - Arcade Learning Environments [Bellemare et al., 2013] for playing Atari 2600 ROMs, Safety Gymnasium [Ji et al., 2023] for testing safe RL algorithms in robotics environments, HighwayEnv [Leurent, 2018] for simulating various driving situations and PyFlyt [Tai et al., 2023] for drone flying simulation.\"\n    },\n    {\n      \"title\": \"External Environments - Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/environments/third_party_environments/\",\n      \"description\": \"The Farama Foundation maintains a number of other projects, which use the Gymnasium API, environments include: gridworlds (Minigrid), robotics (Gymnasium-Robotics), 3D navigation (Miniworld), web interaction (MiniWoB++), arcade games (Arcade Learning Environment), Doom (ViZDoom), Meta-objective robotics (Metaworld), autonomous driving (HighwayEnv), Retro Games (stable-retro), and many more. * BlueSky-Gym: Reinforcement Learning Environments for Air Traffic Applications. AnyTrading is a collection of Gym environments for reinforcement learning-based trading algorithms with a great focus on simplicity, flexibility, and comprehensiveness. Craftium wraps the Minetest game engine into the Gymnasium API, providing a modern and easy-to-use platform for designing Minecraft-like RL environments. * gym-pybullet-drones: Environments for quadcopter control. * panda-gym: Robotics environments using the PyBullet physics engine. A simple environment for benchmarking single and multi-agent reinforcement learning algorithms on a clone of Slime Volleyball game. Robo-gym provides a collection of reinforcement learning environments involving robotic tasks applicable in both simulation and real-world robotics. Gym environments that let you control real robots in a laboratory via the internet.\"\n    },\n    {\n      \"title\": \"Farama-Foundation/Gymnasium: An API standard for single ... - GitHub\",\n      \"url\": \"https://github.com/farama-foundation/gymnasium\",\n      \"description\": \"Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between\"\n    },\n    {\n      \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning Environments | OpenReview\",\n      \"url\": \"https://openreview.net/forum?id=qPMLvJxtPK&referrer=%5Bthe%20profile%20of%20John%20U.%20Balis%5D(%2Fprofile%3Fid%3D~John_U._Balis1)\",\n      \"description\": \"back arrowBack to **the profile of John U. ## Gymnasium: A Standard Interface for Reinforcement Learning Environments. **TL;DR:** An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym). **Abstract:** Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \\\\url{https://github.com/Farama-Foundation/Gymnasium}.\"\n    },\n    {\n      \"title\": \"Blog | The Farama Foundation\",\n      \"url\": \"https://farama.org/blog\",\n      \"description\": \"Gymnasium offers three options, for which, we present descriptions ... reinforcement learning (“RL”) libraries in a neutral nonprofit body.\"\n    },\n    {\n      \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning ...\",\n      \"url\": \"https://ui.adsabs.harvard.edu/abs/2024arXiv240717032T/abstract\",\n      \"description\": \"Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue.\"\n    },\n    {\n      \"title\": \"‪Jordan K Terry‬ - ‪Google Scholar‬\",\n      \"url\": \"https://scholar.google.com/citations?user=gb83gcIAAAAJ&hl=en\",\n      \"description\": \"Gymnasium: A standard interface for reinforcement learning environments. M ... URL http://github. com/Farama-Foundation/Gymnasium-Robotics, 10, 0. 66\"\n    },\n    {\n      \"title\": \"A Standardized Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://arxiv.org/html/2407.17032v4\",\n      \"description\": \"By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full\"\n    },\n    {\n      \"title\": \"Build Custom Gymnasium Environments: Step-by-Step Guide\",\n      \"url\": \"https://ai.gopubby.com/creating-custom-environments-with-gymnasium-a-step-by-step-guide-266f23398fcd\",\n      \"description\": \"Learn How to Build a 2D Grid Game Environment for AI and Reinforcement Learning Using Farama Foundation's Gymnasium.\"\n    },\n    {\n      \"title\": \"Reinforcement Learning with Gymnasium: A Practical Guide\",\n      \"url\": \"https://www.datacamp.com/tutorial/reinforcement-learning-with-gymnasium\",\n      \"description\": \"# Reinforcement Learning with Gymnasium: A Practical Guide. Understand the basics of Reinforcement Learning (RL) and explore the Gymnasium software package to build and test RL algorithms using Python. Reinforcement Learning (RL) is one of the three main machine learning paradigms, the other two being supervised and unsupervised learning. In RL, an agent learns to interact with its environment to maximize the cumulative rewards. Reinforcement Learning with Human Feedback (RLHF) allows the agent to modify behavior based on human inputs at each step. When deep neural networks are used to apply RL algorithms, it is called Deep Reinforcement Learning. In this tutorial, I’ll show you how to get started with Gymnasium, an open-source Python library for developing and comparing reinforcement learning algorithms. I'll demonstrate how to set it up, explore various RL environments, and use Python to build a simple agent to implement an RL algorithm. Gymnasium is an open-source Python library designed to support the development of RL algorithms.\"\n    },\n    {\n      \"title\": \"[PDF] GYMNASIUM - OpenReview\",\n      \"url\": \"https://openreview.net/pdf?id=feFlfuOse1\",\n      \"description\": \"Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on\"\n    },\n    {\n      \"title\": \"Gymnasium for Reinforcement Learning | by ANSHUMAN PRAJAPATI\",\n      \"url\": \"https://medium.com/@abdprajapati090/gymnasium-for-reinforcement-learning-412fd895e7b9\",\n      \"description\": \"Gymnasium is an open-source Python library that provides: A standard API for defining environments in reinforcement learning (RL). OpenReview+\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Python Reinforcement Learning using Gymnasium – Full Course",
        "url": "https://www.youtube.com/watch?v=vufTSJbzKGU",
        "description": "Gymnasium is an open source Python library maintained by the Farama Foundation ... Stanford CS224R Deep Reinforcement Learning | Spring 2025"
      },
      {
        "title": "Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/index.html",
        "description": "Light Logo Dark Logo Gymnasium Documentation. Farama Foundation logo Farama Foundation. ## An API standard for reinforcement learning with a diverse collection of reference environments. **Gymnasium is a maintained fork of OpenAI’s Gym library.** The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a migration guide for old Gym environments:. import  gymnasium  as  gym # Initialise the environment env = gym. make(\"LunarLander-v3\", render_mode = \"human\") # Reset the environment to generate the first observation observation, info = env. reset(seed = 42) for _ in range(1000): # this is where you would insert your policy action = env. sample()# step (transition) through the environment with the action# receiving the next observation, reward and if the episode has terminated or truncated observation, reward, terminated, truncated, info = env. step(action) # If the episode has ended then we can reset to start a new episode if terminated or truncated: observation, info = env."
      },
      {
        "title": "[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...",
        "url": "https://arxiv.org/pdf/2407.17032",
        "description": "Gymnasium: A Standardized Interface for Reinforcement Learning Environments Mark Towers† University of Southampton & Farama Foundation mt5g17@soton.ac.uk Ariel Kwiatkowski†‡ Meta AI, FAIR & Farama Foundation kwiat@meta.com Jordan Terry† Farama Foundation jkterry@farama.org John U. Gymnasium is an open-source library that provides a standardized API for RL environments, aiming to tackle this issue, with over 18 million installations. Gymnasium’s main feature is a set of abstractions that allow for wide interoperability between envi-ronments and training algorithms, making it easier for researchers to develop and test new environments and/or RL algorithms. In addition, Gymnasium provides a collection of built-in easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL re-search. In summary, Gymnasium provides the following novel contributions to the field of DRL: • A maintained API for handling Reinforcement Learning Environment with a wide range of built-in environments (Figure 1), a collection of compatible external environments, and support from numerous training libraries."
      },
      {
        "title": "[PDF] arXiv:2407.17032v1 [cs.LG] 24 Jul 2024 - ePrints Soton",
        "url": "https://eprints.soton.ac.uk/493069/1/2407.17032v1.pdf",
        "description": "This paper outlines the main design decisions for Gymnasium, its key features, and the differences to alternative APIs. 1 Introduction With the publication of a Deep Q-Networks (DQN) [Mnih et al., 2013], Reinforcement Learning (RL) was awoken from its Artificial Intelligence (AI) winter, showing that a general neural network-based algorithm can achieve expert-level performance across a range of complex tasks. • Third party - Arcade Learning Environments [Bellemare et al., 2013] for playing Atari 2600 ROMs, Safety Gymnasium [Ji et al., 2023] for testing safe RL algorithms in robotics environments, HighwayEnv [Leurent, 2018] for simulating various driving situations and PyFlyt [Tai et al., 2023] for drone flying simulation."
      },
      {
        "title": "External Environments - Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/environments/third_party_environments/",
        "description": "The Farama Foundation maintains a number of other projects, which use the Gymnasium API, environments include: gridworlds (Minigrid), robotics (Gymnasium-Robotics), 3D navigation (Miniworld), web interaction (MiniWoB++), arcade games (Arcade Learning Environment), Doom (ViZDoom), Meta-objective robotics (Metaworld), autonomous driving (HighwayEnv), Retro Games (stable-retro), and many more. * BlueSky-Gym: Reinforcement Learning Environments for Air Traffic Applications. AnyTrading is a collection of Gym environments for reinforcement learning-based trading algorithms with a great focus on simplicity, flexibility, and comprehensiveness. Craftium wraps the Minetest game engine into the Gymnasium API, providing a modern and easy-to-use platform for designing Minecraft-like RL environments. * gym-pybullet-drones: Environments for quadcopter control. * panda-gym: Robotics environments using the PyBullet physics engine. A simple environment for benchmarking single and multi-agent reinforcement learning algorithms on a clone of Slime Volleyball game. Robo-gym provides a collection of reinforcement learning environments involving robotic tasks applicable in both simulation and real-world robotics. Gym environments that let you control real robots in a laboratory via the internet."
      },
      {
        "title": "Farama-Foundation/Gymnasium: An API standard for single ... - GitHub",
        "url": "https://github.com/farama-foundation/gymnasium",
        "description": "Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between"
      },
      {
        "title": "Gymnasium: A Standard Interface for Reinforcement Learning Environments | OpenReview",
        "url": "https://openreview.net/forum?id=qPMLvJxtPK&referrer=%5Bthe%20profile%20of%20John%20U.%20Balis%5D(%2Fprofile%3Fid%3D~John_U._Balis1)",
        "description": "back arrowBack to **the profile of John U. ## Gymnasium: A Standard Interface for Reinforcement Learning Environments. **TL;DR:** An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym). **Abstract:** Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \\url{https://github.com/Farama-Foundation/Gymnasium}."
      },
      {
        "title": "Blog | The Farama Foundation",
        "url": "https://farama.org/blog",
        "description": "Gymnasium offers three options, for which, we present descriptions ... reinforcement learning (“RL”) libraries in a neutral nonprofit body."
      },
      {
        "title": "Gymnasium: A Standard Interface for Reinforcement Learning ...",
        "url": "https://ui.adsabs.harvard.edu/abs/2024arXiv240717032T/abstract",
        "description": "Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue."
      },
      {
        "title": "‪Jordan K Terry‬ - ‪Google Scholar‬",
        "url": "https://scholar.google.com/citations?user=gb83gcIAAAAJ&hl=en",
        "description": "Gymnasium: A standard interface for reinforcement learning environments. M ... URL http://github. com/Farama-Foundation/Gymnasium-Robotics, 10, 0. 66"
      },
      {
        "title": "A Standardized Interface for Reinforcement Learning Environments",
        "url": "https://arxiv.org/html/2407.17032v4",
        "description": "By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full"
      },
      {
        "title": "Build Custom Gymnasium Environments: Step-by-Step Guide",
        "url": "https://ai.gopubby.com/creating-custom-environments-with-gymnasium-a-step-by-step-guide-266f23398fcd",
        "description": "Learn How to Build a 2D Grid Game Environment for AI and Reinforcement Learning Using Farama Foundation's Gymnasium."
      },
      {
        "title": "Reinforcement Learning with Gymnasium: A Practical Guide",
        "url": "https://www.datacamp.com/tutorial/reinforcement-learning-with-gymnasium",
        "description": "# Reinforcement Learning with Gymnasium: A Practical Guide. Understand the basics of Reinforcement Learning (RL) and explore the Gymnasium software package to build and test RL algorithms using Python. Reinforcement Learning (RL) is one of the three main machine learning paradigms, the other two being supervised and unsupervised learning. In RL, an agent learns to interact with its environment to maximize the cumulative rewards. Reinforcement Learning with Human Feedback (RLHF) allows the agent to modify behavior based on human inputs at each step. When deep neural networks are used to apply RL algorithms, it is called Deep Reinforcement Learning. In this tutorial, I’ll show you how to get started with Gymnasium, an open-source Python library for developing and comparing reinforcement learning algorithms. I'll demonstrate how to set it up, explore various RL environments, and use Python to build a simple agent to implement an RL algorithm. Gymnasium is an open-source Python library designed to support the development of RL algorithms."
      },
      {
        "title": "[PDF] GYMNASIUM - OpenReview",
        "url": "https://openreview.net/pdf?id=feFlfuOse1",
        "description": "Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on"
      },
      {
        "title": "Gymnasium for Reinforcement Learning | by ANSHUMAN PRAJAPATI",
        "url": "https://medium.com/@abdprajapati090/gymnasium-for-reinforcement-learning-412fd895e7b9",
        "description": "Gymnasium is an open-source Python library that provides: A standard API for defining environments in reinforcement learning (RL). OpenReview+"
      }
    ]
  }
}

▼thought

站内搜索结果主要是关于强化学习的论文，但没有直接关于 Gymnasium 库本身的条目。让我获取更多关于 Gymnasium 的详细信息，包括其文档、GitHub 仓库、论文等。

▼✓抓取网页正文网址: "https://gymnasium.farama.org/index.html"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/index.html\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div class=\\\"page\\\">\\n          <a href=\\\"#\\\">\\n            <svg xmlns=\\\"http://www.w3.org/2000/svg\\\" viewBox=\\\"0 0 24 24\\\">\\n              <path d=\\\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\\\"></path>\\n            </svg>\\n            <span>Back to top</span>\\n          </a>\\n          <div><p>\\n  <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/index.md\\\" title=\\\"Edit this page\\\">\\n    \\n    <span>Edit this page</span>\\n  </a>\\n</p>\\n            <p><label for=\\\"__toc\\\">\\n              <p>Toggle table of contents sidebar</p>\\n              <i><svg>\\n                  <use href=\\\"#svg-toc\\\"></use>\\n                </svg></i>\\n            </label>\\n          </p></div>\\n          <article role=\\\"main\\\">\\n            \\n            <figure>\\n<img alt=\\\"Gymnasium Logo\\\" src=\\\"https://gymnasium.farama.org/_images/gymnasium-text.png\\\">\\n</figure>\\n<p>\\n<h2>\\nAn API standard for reinforcement learning with a diverse collection of reference environments\\n</h2>\\n</p>\\n<figure>\\n<a href=\\\"https://gymnasium.farama.org/_images/lunar_lander.gif\\\"><img alt=\\\"Lunar Lander\\\" src=\\\"https://gymnasium.farama.org/_images/lunar_lander.gif\\\">\\n</a>\\n</figure>\\n<p><strong>Gymnasium is a maintained fork of OpenAI’s Gym library.</strong> The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a <a href=\\\"https://gymnasium.farama.org/introduction/migration_guide/\\\"><span>migration guide</span></a> for old Gym environments:</p>\\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\\n\\n<span># Initialise the environment</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"LunarLander-v3\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n\\n<span># Reset the environment to generate the first observation</span>\\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>\\n<span>for</span> <span>_</span> <span>in</span> <span>range</span><span>(</span><span>1000</span><span>):</span>\\n    <span># this is where you would insert your policy</span>\\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\\n\\n    <span># step (transition) through the environment with the action</span>\\n    <span># receiving the next observation, reward and if the episode has terminated or truncated</span>\\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n\\n    <span># If the episode has ended then we can reset to start a new episode</span>\\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span><span>:</span>\\n        <span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n\\n<span>env</span><span>.</span><span>close</span><span>()</span>\\n</pre></div>\\n\\n\\n\\n\\n\\n\\n          </article>\\n        </div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/index.html",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div class=\"page\">\n          <a href=\"#\">\n            <svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 24 24\">\n              <path d=\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\"></path>\n            </svg>\n            <span>Back to top</span>\n          </a>\n          <div><p>\n  <a href=\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/index.md\" title=\"Edit this page\">\n    \n    <span>Edit this page</span>\n  </a>\n</p>\n            <p><label for=\"__toc\">\n              <p>Toggle table of contents sidebar</p>\n              <i><svg>\n                  <use href=\"#svg-toc\"></use>\n                </svg></i>\n            </label>\n          </p></div>\n          <article role=\"main\">\n            \n            <figure>\n<img alt=\"Gymnasium Logo\" src=\"https://gymnasium.farama.org/_images/gymnasium-text.png\">\n</figure>\n<p>\n<h2>\nAn API standard for reinforcement learning with a diverse collection of reference environments\n</h2>\n</p>\n<figure>\n<a href=\"https://gymnasium.farama.org/_images/lunar_lander.gif\"><img alt=\"Lunar Lander\" src=\"https://gymnasium.farama.org/_images/lunar_lander.gif\">\n</a>\n</figure>\n<p><strong>Gymnasium is a maintained fork of OpenAI’s Gym library.</strong> The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a <a href=\"https://gymnasium.farama.org/introduction/migration_guide/\"><span>migration guide</span></a> for old Gym environments:</p>\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\n\n<span># Initialise the environment</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"LunarLander-v3\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>\n\n<span># Reset the environment to generate the first observation</span>\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>\n<span>for</span> <span>_</span> <span>in</span> <span>range</span><span>(</span><span>1000</span><span>):</span>\n    <span># this is where you would insert your policy</span>\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\n\n    <span># step (transition) through the environment with the action</span>\n    <span># receiving the next observation, reward and if the episode has terminated or truncated</span>\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n\n    <span># If the episode has ended then we can reset to start a new episode</span>\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span><span>:</span>\n        <span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n\n<span>env</span><span>.</span><span>close</span><span>()</span>\n</pre></div>\n\n\n\n\n\n\n          </article>\n        </div></div>"
  }
}

▼✓抓取网页正文网址: "https://arxiv.org/abs/2407.17032"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://arxiv.org/abs/2407.17032\",\n  \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning Environments\",\n  \"author\": \"[Submitted on 24 Jul 2024 (v1), last revised 2 Nov 2025 (this version, v4)]\",\n  \"summary\": \"Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered b…\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div id=\\\"content-inner\\\">\\n    \\n    \\n    <div><p><span>Authors:</span><a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Towers,+M\\\" rel=\\\"nofollow\\\">Mark Towers</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kwiatkowski,+A\\\" rel=\\\"nofollow\\\">Ariel Kwiatkowski</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Terry,+J\\\" rel=\\\"nofollow\\\">Jordan Terry</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Balis,+J+U\\\" rel=\\\"nofollow\\\">John U. Balis</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=De+Cola,+G\\\" rel=\\\"nofollow\\\">Gianluca De Cola</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Deleu,+T\\\" rel=\\\"nofollow\\\">Tristan Deleu</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Goul%C3%A3o,+M\\\" rel=\\\"nofollow\\\">Manuel Goulão</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kallinteris,+A\\\" rel=\\\"nofollow\\\">Andreas Kallinteris</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Krimmel,+M\\\" rel=\\\"nofollow\\\">Markus Krimmel</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=KG,+A\\\" rel=\\\"nofollow\\\">Arjun KG</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Perez-Vicente,+R\\\" rel=\\\"nofollow\\\">Rodrigo Perez-Vicente</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Pierr%C3%A9,+A\\\" rel=\\\"nofollow\\\">Andrea Pierré</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Schulhoff,+S\\\" rel=\\\"nofollow\\\">Sander Schulhoff</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tai,+J+J\\\" rel=\\\"nofollow\\\">Jun Jet Tai</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tan,+H\\\" rel=\\\"nofollow\\\">Hannah Tan</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Younis,+O+G\\\" rel=\\\"nofollow\\\">Omar G. Younis</a></p></div>            \\n    <p><a href=\\\"https://arxiv.org/pdf/2407.17032\\\">View PDF</a>\\n    <a href=\\\"https://arxiv.org/html/2407.17032v4\\\">HTML (experimental)</a></p><blockquote>\\n            <span>Abstract:</span>Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at <a href=\\\"https://github.com/Farama-Foundation/Gymnasium\\\" rel=\\\"external noopener nofollow\\\">this https URL</a>\\n    </blockquote>\\n\\n    <!--CONTEXT-->\\n    <div>\\n      <table summary=\\\"Additional metadata\\\">        <tbody><tr>\\n          <td>Comments:</td>\\n          <td>Accepted at NeurIPS Datasets and Benchmarks 2025</td>\\n        </tr>\\n<tr>\\n          <td>Subjects:</td>\\n          <td>\\n            <span>Machine Learning (cs.LG)</span>; Digital Libraries (cs.DL)</td>\\n        </tr><tr>\\n          <td>Cite as:</td>\\n          <td><span><a href=\\\"https://arxiv.org/abs/2407.17032\\\">arXiv:2407.17032</a> [cs.LG]</span></td>\\n        </tr>\\n        <tr>\\n          <td>&nbsp;</td>\\n          <td>(or <span>\\n              <a href=\\\"https://arxiv.org/abs/2407.17032v4\\\">arXiv:2407.17032v4</a> [cs.LG]</span> for this version)\\n          </td>\\n        </tr>\\n        <tr>\\n          <td>&nbsp;</td>\\n          <td>              <a href=\\\"https://doi.org/10.48550/arXiv.2407.17032\\\" id=\\\"arxiv-doi-link\\\">https://doi.org/10.48550/arXiv.2407.17032</a><div>\\n              <!-- tooltip description -->\\n              <p><span></span>                  arXiv-issued DOI via DataCite</p>\\n            </div>\\n          </td>\\n        </tr></tbody></table>\\n    </div>\\n  </div><div>\\n      <h2>Submission history</h2><p> From: Mark Towers [<a href=\\\"https://arxiv.org/show-email/97d21085/2407.17032\\\" rel=\\\"nofollow\\\">view email</a>]      <br>            <strong><a href=\\\"https://arxiv.org/abs/2407.17032v1\\\" rel=\\\"nofollow\\\">[v1]</a></strong>\\n        Wed, 24 Jul 2024 06:35:05 UTC (388 KB)<br>\\n            <strong><a href=\\\"https://arxiv.org/abs/2407.17032v2\\\" rel=\\\"nofollow\\\">[v2]</a></strong>\\n        Wed, 9 Oct 2024 20:48:15 UTC (1,320 KB)<br>\\n            <strong><a href=\\\"https://arxiv.org/abs/2407.17032v3\\\" rel=\\\"nofollow\\\">[v3]</a></strong>\\n        Fri, 8 Nov 2024 16:08:51 UTC (1,320 KB)<br>\\n    <strong>[v4]</strong>\\n        Sun, 2 Nov 2025 13:42:19 UTC (1,018 KB)<br>\\n</p></div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://arxiv.org/abs/2407.17032",
    "title": "Gymnasium: A Standard Interface for Reinforcement Learning Environments",
    "author": "[Submitted on 24 Jul 2024 (v1), last revised 2 Nov 2025 (this version, v4)]",
    "summary": "Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered b…",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div id=\"content-inner\">\n    \n    \n    <div><p><span>Authors:</span><a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Towers,+M\" rel=\"nofollow\">Mark Towers</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kwiatkowski,+A\" rel=\"nofollow\">Ariel Kwiatkowski</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Terry,+J\" rel=\"nofollow\">Jordan Terry</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Balis,+J+U\" rel=\"nofollow\">John U. Balis</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=De+Cola,+G\" rel=\"nofollow\">Gianluca De Cola</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Deleu,+T\" rel=\"nofollow\">Tristan Deleu</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Goul%C3%A3o,+M\" rel=\"nofollow\">Manuel Goulão</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kallinteris,+A\" rel=\"nofollow\">Andreas Kallinteris</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Krimmel,+M\" rel=\"nofollow\">Markus Krimmel</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=KG,+A\" rel=\"nofollow\">Arjun KG</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Perez-Vicente,+R\" rel=\"nofollow\">Rodrigo Perez-Vicente</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Pierr%C3%A9,+A\" rel=\"nofollow\">Andrea Pierré</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Schulhoff,+S\" rel=\"nofollow\">Sander Schulhoff</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tai,+J+J\" rel=\"nofollow\">Jun Jet Tai</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tan,+H\" rel=\"nofollow\">Hannah Tan</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Younis,+O+G\" rel=\"nofollow\">Omar G. Younis</a></p></div>            \n    <p><a href=\"https://arxiv.org/pdf/2407.17032\">View PDF</a>\n    <a href=\"https://arxiv.org/html/2407.17032v4\">HTML (experimental)</a></p><blockquote>\n            <span>Abstract:</span>Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at <a href=\"https://github.com/Farama-Foundation/Gymnasium\" rel=\"external noopener nofollow\">this https URL</a>\n    </blockquote>\n\n    <!--CONTEXT-->\n    <div>\n      <table summary=\"Additional metadata\">        <tbody><tr>\n          <td>Comments:</td>\n          <td>Accepted at NeurIPS Datasets and Benchmarks 2025</td>\n        </tr>\n<tr>\n          <td>Subjects:</td>\n          <td>\n            <span>Machine Learning (cs.LG)</span>; Digital Libraries (cs.DL)</td>\n        </tr><tr>\n          <td>Cite as:</td>\n          <td><span><a href=\"https://arxiv.org/abs/2407.17032\">arXiv:2407.17032</a> [cs.LG]</span></td>\n        </tr>\n        <tr>\n          <td>&nbsp;</td>\n          <td>(or <span>\n              <a href=\"https://arxiv.org/abs/2407.17032v4\">arXiv:2407.17032v4</a> [cs.LG]</span> for this version)\n          </td>\n        </tr>\n        <tr>\n          <td>&nbsp;</td>\n          <td>              <a href=\"https://doi.org/10.48550/arXiv.2407.17032\" id=\"arxiv-doi-link\">https://doi.org/10.48550/arXiv.2407.17032</a><div>\n              <!-- tooltip description -->\n              <p><span></span>                  arXiv-issued DOI via DataCite</p>\n            </div>\n          </td>\n        </tr></tbody></table>\n    </div>\n  </div><div>\n      <h2>Submission history</h2><p> From: Mark Towers [<a href=\"https://arxiv.org/show-email/97d21085/2407.17032\" rel=\"nofollow\">view email</a>]      <br>            <strong><a href=\"https://arxiv.org/abs/2407.17032v1\" rel=\"nofollow\">[v1]</a></strong>\n        Wed, 24 Jul 2024 06:35:05 UTC (388 KB)<br>\n            <strong><a href=\"https://arxiv.org/abs/2407.17032v2\" rel=\"nofollow\">[v2]</a></strong>\n        Wed, 9 Oct 2024 20:48:15 UTC (1,320 KB)<br>\n            <strong><a href=\"https://arxiv.org/abs/2407.17032v3\" rel=\"nofollow\">[v3]</a></strong>\n        Fri, 8 Nov 2024 16:08:51 UTC (1,320 KB)<br>\n    <strong>[v4]</strong>\n        Sun, 2 Nov 2025 13:42:19 UTC (1,018 KB)<br>\n</p></div></div>"
  }
}

▼✓网页搜索结果数: 15, 查询: "Gymnasium vs OpenAI Gym differences history migration Farama Foundation"

{
"content": [
{
"type": "text",
"text": "{\n \"results\": [\n {\n \"title\": \"Gym Migration Guide - Gymnasium Documentation\",\n \"url\": \"https://gymnasium.farama.org/introduction/migration_guide/\",\n \"description\": \"# Gym Migration Guide¶. **If you’re new to Gymnasium**: You can probably skip this page! This guide is for users migrating from older versions of OpenAI Gym. If you’re just starting with RL, head to Basic Usage instead. **If you’re migrating from OpenAI Gym**: This guide will help you update your code to work with Gymnasium. **If you’re updating old tutorials**: Many online RL tutorials use the old v0.21 API. Gymnasium is a fork of OpenAI Gym v0.26, which introduced breaking changes from Gym v0.21. | **Environment Reset** | `obs =env.reset()` | `obs, info =env.reset()` | Training loops |. | **Random Seeding** | `env.seed(42)` | `env.reset(seed=42)` | Reproducibility |. | **Step Function** | `obs,reward,done, info =env.step(action)` | `obs,reward,terminated,truncated, info =env.step(action)` | RL algorithms |. import gym # Environment creation and seeding env = gym. import gymnasium as gym# Note: 'gymnasium' not 'gym' # Environment creation with render mode specified upfront env = gym. make(\\\"LunarLander-v3\\\", render_mode = \\\"human\\\") # Reset with seed parameter observation, info = env.\"\n },\n {\n \"title\": \"OpenAI Gym and Gymnasium: Reinforcement Learning ...\",\n \"url\": \"https://aignishant.medium.com/openai-gym-and-gymnasium-reinforcement-learning-environments-for-python-c889aed0e784\",\n \"description\": \"Gymnasium, maintained by the Farama Foundation, emerged as the successor to OpenAI Gym. Key improvements include: More consistent API design\"\n },\n {\n \"title\": \"Feature Request: migrate from gym to gymnasium #598 - GitHub\",\n \"url\": \"https://github.com/duburcqa/jiminy/issues/598\",\n \"description\": \"Gymnasium is the maintained version of openai gym and is compatible with current RL training libraries (rllib and tianshou have already migrated\"\n },\n {\n \"title\": \"Tech Thursdays: A practical guide to Gymnasium (the ... - Medium\",\n \"url\": \"https://medium.com/@gautsoni/tech-thursdays-a-practical-guide-to-gymnasium-the-modern-openai-gym-1b739aaa1a7a\",\n \"description\": \"Today, the actively maintained successor is Gymnasium — a Farama Foundation ... Migration from OpenAI Gym (and compatibility). Coming from older\"\n },\n {\n \"title\": \"Python Reinforcement Learning using Gymnasium – Full Course\",\n \"url\": \"https://www.youtube.com/watch?v=vufTSJbzKGU\",\n \"description\": \"... Gymnasium (previously called OpenAI Gym). Gymnasium ... Gymnasium is maintained by the Farama Foundation and is not associated with OpenAI.\"\n },\n {\n \"title\": \"A Gentle Introduction to OpenAI Gym | intro_to_gym - Wandb\",\n \"url\": \"https://wandb.ai/mukilan/intro_to_gym/reports/A-Gentle-Introduction-to-OpenAI-Gym--VmlldzozMjg5MTA3\",\n \"description\": \"In 2021, a non-profit organization called the Farama Foundation took over Gym. They introduced new features into Gym, renaming it Gymnasium.\"\n },\n {\n \"title\": \"A Standardized Interface for Reinforcement Learning Environments\",\n \"url\": \"https://arxiv.org/html/2407.17032v4\",\n \"description\": \"Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025 * **https://\"\n },\n {\n \"title\": \"[PDF] Autonomous and Adaptive Systems Introduction to Gym/Gymnasium\",\n \"url\": \"https://www.mircomusolesi.org/courses/AAS22-23/AAS22-23_Gym.pdf\",\n \"description\": \"algorithms. ‣Originally developed by Open AI, since October 2022 maintained by the Farama Foundation as the Gymnasium project.\"\n },\n {\n \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning...\",\n \"url\": \"https://openreview.net/forum?id=qPMLvJxtPK&referrer=%5Bthe%20profile%20of%20John%20U.%20Balis%5D(%2Fprofile%3Fid%3D~John_U._Balis1)\",\n \"description\": \"back arrowBack to **the profile of John U. ## Gymnasium: A Standard Interface for Reinforcement Learning Environments. **TL;DR:** An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym). **Abstract:** Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \\\\url{https://github.com/Farama-Foundation/Gymnasium}.\"\n },\n {\n \"title\": \"Farama-Foundation/Gymnasium: An API standard for single ... - GitHub\",\n \"url\": \"https://github.com/farama-foundation/gymnasium\",\n \"description\": \"Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between\"\n },\n {\n \"title\": \"External Environments - Gymnasium Documentation\",\n \"url\": \"https://gymnasium.farama.org/environments/third_party_environments/\",\n \"description\": \"The Farama Foundation maintains a number of other projects, which use the Gymnasium API, environments include: gridworlds (Minigrid), robotics (Gymnasium-Robotics), 3D navigation (Miniworld), web interaction (MiniWoB++), arcade games (Arcade Learning Environment), Doom (ViZDoom), Meta-objective robotics (Metaworld), autonomous driving (HighwayEnv), Retro Games (stable-retro), and many more. * BlueSky-Gym: Reinforcement Learning Environments for Air Traffic Applications. AnyTrading is a collection of Gym environments for reinforcement learning-based trading algorithms with a great focus on simplicity, flexibility, and comprehensiveness. Craftium wraps the Minetest game engine into the Gymnasium API, providing a modern and easy-to-use platform for designing Minecraft-like RL environments. * gym-pybullet-drones: Environments for quadcopter control. * panda-gym: Robotics environments using the PyBullet physics engine. A simple environment for benchmarking single and multi-agent reinforcement learning algorithms on a clone of Slime Volleyball game. Robo-gym provides a collection of reinforcement learning environments involving robotic tasks applicable in both simulation and real-world robotics. Gym environments that let you control real robots in a laboratory via the internet.\"\n },\n {\n \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning Environments\",\n \"url\": \"https://arxiv.org/html/2407.17032v1\",\n \"description\": \"# Gymnasium: A Standard Interface for Reinforcement Learning Environments. Gymnasium is an open-source library providing an API for reinforcement learning environments. Gymnasium comes with various built-in environments and utilities to simplify researchers’ work along with being supported by most training libraries. This paper outlines the main design decisions for Gymnasium, its key features, and the differences to alternative APIs. With the publication of a Deep Q-Networks (DQN) (Mnih et al., 2013), Reinforcement Learning (RL) was awoken from its Artificial Intelligence (AI) winter, showing that a general neural network-based algorithm can achieve expert-level performance across a range of complex tasks. Gymnasium is a maintained fork of Gym, bringing many improvements and API updates to enable its continued usage for open-source RL research. In Sections 2 and 3, we outline the design decisions for the project and the environment API specification, respectively. The only restriction on the agent is that it must produce a valid action as specified by the environment’s action space.\"\n },\n {\n \"title\": \"Gym and Gymnasium Environments for Reinforcement Learning\",\n \"url\": \"https://www.linkedin.com/pulse/exploring-differences-gym-gymnasium-environments-learning-sabir-ali\",\n \"description\": \"One of the main differences between Gym and Gymnasium is the scope of their environments. Gym provides a wide range of environments for various\"\n },\n {\n \"title\": \"Which OpenAI Gym version is best/most used? - Reddit\",\n \"url\": \"https://www.reddit.com/r/reinforcementlearning/comments/18vtoyi/which_openai_gym_version_is_bestmost_used/\",\n \"description\": \"I was originally using the latest version (now called Gymnasium instead of Gym), but 99% of tutorials and code online use older versions of Gym.\"\n },\n {\n \"title\": \"Introduction Open Ai Gym - GeeksforGeeks\",\n \"url\": \"https://www.geeksforgeeks.org/deep-learning/introduction-open-ai-gym/\",\n \"description\": \"OpenAI Gym is a popular open source toolkit designed to develop and compare reinforcement learning algorithms. It provides a wide variety of standardized environments from simple games to complex simulations where agents can be trained to learn optimal behaviors through trial and error. By offering a consistent interface and benchmarks OpenAI Gym makes it easier for researchers and developers to build, test and share their reinforcement learning models efficiently. ## How to Install Open Ai Gym. ### Step 2: Install Open Ai Gym Library. ## Game Environment using Open Ai Gym. * Your code sets up this environment using OpenAI Gym, resets it to start then runs a loop where at each step the agent randomly chooses to push the cart left or right (action\\\\_space.sample()), applies that action (env.step(action)), gets back the new state (obs). 2. ****Game Playing and AI Training:****Developers use Gym to train AI agents to play video games or board games by learning from trial and error.\"\n }\n ]\n}"
}
],
"details": {
"results": [
{
"title": "Gym Migration Guide - Gymnasium Documentation",
"url": "https://gymnasium.farama.org/introduction/migration_guide/",
"description": "# Gym Migration Guide¶. **If you’re new to Gymnasium**: You can probably skip this page! This guide is for users migrating from older versions of OpenAI Gym. If you’re just starting with RL, head to Basic Usage instead. **If you’re migrating from OpenAI Gym**: This guide will help you update your code to work with Gymnasium. **If you’re updating old tutorials**: Many online RL tutorials use the old v0.21 API. Gymnasium is a fork of OpenAI Gym v0.26, which introduced breaking changes from Gym v0.21. | **Environment Reset** | `obs =env.reset()` | `obs, info =env.reset()` | Training loops |. | **Random Seeding** | `env.seed(42)` | `env.reset(seed=42)` | Reproducibility |. | **Step Function** | `obs,reward,done, info =env.step(action)` | `obs,reward,terminated,truncated, info =env.step(action)` | RL algorithms |. import gym # Environment creation and seeding env = gym. import gymnasium as gym# Note: 'gymnasium' not 'gym' # Environment creation with render mode specified upfront env = gym. make(\"LunarLander-v3\", render_mode = \"human\") # Reset with seed parameter observation, info = env."
},
{
"title": "OpenAI Gym and Gymnasium: Reinforcement Learning ...",
"url": "https://aignishant.medium.com/openai-gym-and-gymnasium-reinforcement-learning-environments-for-python-c889aed0e784",
"description": "Gymnasium, maintained by the Farama Foundation, emerged as the successor to OpenAI Gym. Key improvements include: More consistent API design"
},
{
"title": "Feature Request: migrate from gym to gymnasium #598 - GitHub",
"url": "https://github.com/duburcqa/jiminy/issues/598",
"description": "Gymnasium is the maintained version of openai gym and is compatible with current RL training libraries (rllib and tianshou have already migrated"
},
{
"title": "Tech Thursdays: A practical guide to Gymnasium (the ... - Medium",
"url": "https://medium.com/@gautsoni/tech-thursdays-a-practical-guide-to-gymnasium-the-modern-openai-gym-1b739aaa1a7a",
"description": "Today, the actively maintained successor is Gymnasium — a Farama Foundation ... Migration from OpenAI Gym (and compatibility). Coming from older"
},
{
"title": "Python Reinforcement Learning using Gymnasium – Full Course",
"url": "https://www.youtube.com/watch?v=vufTSJbzKGU",
"description": "... Gymnasium (previously called OpenAI Gym). Gymnasium ... Gymnasium is maintained by the Farama Foundation and is not associated with OpenAI."
},
{
"title": "A Gentle Introduction to OpenAI Gym | intro_to_gym - Wandb",
"url": "https://wandb.ai/mukilan/intro_to_gym/reports/A-Gentle-Introduction-to-OpenAI-Gym--VmlldzozMjg5MTA3",
"description": "In 2021, a non-profit organization called the Farama Foundation took over Gym. They introduced new features into Gym, renaming it Gymnasium."
},
{
"title": "A Standardized Interface for Reinforcement Learning Environments",
"url": "https://arxiv.org/html/2407.17032v4",
"description": "Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025 * **https://"
},
{
"title": "[PDF] Autonomous and Adaptive Systems Introduction to Gym/Gymnasium",
"url": "https://www.mircomusolesi.org/courses/AAS22-23/AAS22-23_Gym.pdf",
"description": "algorithms. ‣Originally developed by Open AI, since October 2022 maintained by the Farama Foundation as the Gymnasium project."
},
{
"title": "Gymnasium: A Standard Interface for Reinforcement Learning...",
"url": "https://openreview.net/forum?id=qPMLvJxtPK&referrer=%5Bthe%20profile%20of%20John%20U.%20Balis%5D(%2Fprofile%3Fid%3D~John_U._Balis1)",
"description": "back arrowBack to **the profile of John U. ## Gymnasium: A Standard Interface for Reinforcement Learning Environments. **TL;DR:** An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym). **Abstract:** Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \\url{https://github.com/Farama-Foundation/Gymnasium}."
},
{
"title": "Farama-Foundation/Gymnasium: An API standard for single ... - GitHub",
"url": "https://github.com/farama-foundation/gymnasium",
"description": "Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between"
},
{
"title": "External Environments - Gymnasium Documentation",
"url": "https://gymnasium.farama.org/environments/third_party_environments/",
"description": "The Farama Foundation maintains a number of other projects, which use the Gymnasium API, environments include: gridworlds (Minigrid), robotics (Gymnasium-Robotics), 3D navigation (Miniworld), web interaction (MiniWoB++), arcade games (Arcade Learning Environment), Doom (ViZDoom), Meta-objective robotics (Metaworld), autonomous driving (HighwayEnv), Retro Games (stable-retro), and many more. * BlueSky-Gym: Reinforcement Learning Environments for Air Traffic Applications. AnyTrading is a collection of Gym environments for reinforcement learning-based trading algorithms with a great focus on simplicity, flexibility, and comprehensiveness. Craftium wraps the Minetest game engine into the Gymnasium API, providing a modern and easy-to-use platform for designing Minecraft-like RL environments. * gym-pybullet-drones: Environments for quadcopter control. * panda-gym: Robotics environments using the PyBullet physics engine. A simple environment for benchmarking single and multi-agent reinforcement learning algorithms on a clone of Slime Volleyball game. Robo-gym provides a collection of reinforcement learning environments involving robotic tasks applicable in both simulation and real-world robotics. Gym environments that let you control real robots in a laboratory via the internet."
},
{
"title": "Gymnasium: A Standard Interface for Reinforcement Learning Environments",
"url": "https://arxiv.org/html/2407.17032v1",
"description": "# Gymnasium: A Standard Interface for Reinforcement Learning Environments. Gymnasium is an open-source library providing an API for reinforcement learning environments. Gymnasium comes with various built-in environments and utilities to simplify researchers’ work along with being supported by most training libraries. This paper outlines the main design decisions for Gymnasium, its key features, and the differences to alternative APIs. With the publication of a Deep Q-Networks (DQN) (Mnih et al., 2013), Reinforcement Learning (RL) was awoken from its Artificial Intelligence (AI) winter, showing that a general neural network-based algorithm can achieve expert-level performance across a range of complex tasks. Gymnasium is a maintained fork of Gym, bringing many improvements and API updates to enable its continued usage for open-source RL research. In Sections 2 and 3, we outline the design decisions for the project and the environment API specification, respectively. The only restriction on the agent is that it must produce a valid action as specified by the environment’s action space."
},
{
"title": "Gym and Gymnasium Environments for Reinforcement Learning",
"url": "https://www.linkedin.com/pulse/exploring-differences-gym-gymnasium-environments-learning-sabir-ali",
"description": "One of the main differences between Gym and Gymnasium is the scope of their environments. Gym provides a wide range of environments for various"
},
{
"title": "Which OpenAI Gym version is best/most used? - Reddit",
"url": "https://www.reddit.com/r/reinforcementlearning/comments/18vtoyi/which_openai_gym_version_is_bestmost_used/",
"description": "I was originally using the latest version (now called Gymnasium instead of Gym), but 99% of tutorials and code online use older versions of Gym."
},
{
"title": "Introduction Open Ai Gym - GeeksforGeeks",
"url": "https://www.geeksforgeeks.org/deep-learning/introduction-open-ai-gym/",
"description": "OpenAI Gym is a popular open source toolkit designed to develop and compare reinforcement learning algorithms. It provides a wide variety of standardized environments from simple games to complex simulations where agents can be trained to learn optimal behaviors through trial and error. By offering a consistent interface and benchmarks OpenAI Gym makes it easier for researchers and developers to build, test and share their reinforcement learning models efficiently. ## How to Install Open Ai Gym. ### Step 2: Install Open Ai Gym Library. ## Game Environment using Open Ai Gym. * Your code sets up this environment using OpenAI Gym, resets it to start then runs a loop where at each step the agent randomly chooses to push the cart left or right (action\\_space.sample()), applies that action (env.step(action)), gets back the new state (obs). 2. ****Game Playing and AI Training:****Developers use Gym to train AI agents to play video games or board games by learning from trial and error."
}
]
}
}

▼thought

Good, I have a lot of information now. Let me get more details: the GitHub repo, the migration guide, the built-in environments, and the Farama Foundation context.

▼✓抓取网页正文网址: "https://github.com/farama-foundation/gymnasium"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://github.com/farama-foundation/gymnasium\",\n  \"title\": \"GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)\",\n  \"summary\": \"An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym) - Farama-Foundation/Gymnasium\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div data-hpc=\\\"true\\\"><article itemprop=\\\"text\\\"><p dir=\\\"auto\\\"><a href=\\\"https://badge.fury.io/py/gymnasium\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/b391159f1cb2ce27383263b393800ee2835d088f451900c3f50c62f51c013dbc/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f67796d6e617369756d2e737667\\\" alt=\\\"Python\\\" data-canonical-src=\\\"https://img.shields.io/pypi/pyversions/gymnasium.svg\\\"></a>\\n<a href=\\\"https://badge.fury.io/py/gymnasium\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/0777e88232a2ef56fd5da4c0e5b4e9827021e9c1a8dc7b7b629a2b82dce48f5b/68747470733a2f2f62616467652e667572792e696f2f70792f67796d6e617369756d2e737667\\\" alt=\\\"PyPI\\\" data-canonical-src=\\\"https://badge.fury.io/py/gymnasium.svg\\\"></a>\\n<a href=\\\"https://arxiv.org/abs/2407.17032\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/b594949273dfef4572c0e54cce28590df41cbb44198329f769b1f814c29cb57f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f61725869762d323430372e31373033322d6233316231622e737667\\\" alt=\\\"arXiv\\\" data-canonical-src=\\\"https://img.shields.io/badge/arXiv-2407.17032-b31b1b.svg\\\"></a>\\n<a href=\\\"https://pre-commit.com/\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/9b46a404a478a8f690c0f1e4450b6d22bafb14b5d0a237d9626fe889da71a710/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7072652d2d636f6d6d69742d656e61626c65642d627269676874677265656e3f6c6f676f3d7072652d636f6d6d6974266c6f676f436f6c6f723d7768697465\\\" alt=\\\"pre-commit\\\" data-canonical-src=\\\"https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&amp;logoColor=white\\\"></a>\\n<a href=\\\"https://github.com/Farama-Foundation/Gymnasium/blob/main/LICENSE\\\"><img src=\\\"https://camo.githubusercontent.com/d0897855a06181ac533d42e8883efb28d79dbd3379df1e24c9f9c88f20bcf6e2/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f466172616d612d466f756e646174696f6e2f47796d6e617369756d\\\" alt=\\\"License\\\" data-canonical-src=\\\"https://img.shields.io/github/license/Farama-Foundation/Gymnasium\\\"></a>\\n<a href=\\\"https://github.com/psf/black\\\"><img src=\\\"https://camo.githubusercontent.com/7019b88be88468d6b83fcbf59d2c06bfa4992bafa7d129f9b89ab29f8c7c1acb/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f64652532307374796c652d626c61636b2d3030303030302e737667\\\" alt=\\\"Code style: black\\\" data-canonical-src=\\\"https://img.shields.io/badge/code%20style-black-000000.svg\\\"></a></p>\\n<p dir=\\\"auto\\\">\\n    <a href=\\\"https://gymnasium.farama.org/\\\" rel=\\\"nofollow\\\">\\n    <img src=\\\"https://raw.githubusercontent.com/Farama-Foundation/Gymnasium/main/gymnasium-text.png\\\" width=\\\"500px\\\">\\n</a>\\n</p>\\n<p dir=\\\"auto\\\">Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. This is a fork of OpenAI's <a href=\\\"https://github.com/openai/gym\\\">Gym</a> library by its maintainers (OpenAI handed over maintenance a few years ago to an outside team), and is where future maintenance will occur going forward.</p>\\n<p dir=\\\"auto\\\">The documentation website is at <a href=\\\"https://gymnasium.farama.org/\\\" rel=\\\"nofollow\\\">gymnasium.farama.org</a>, and we have a public discord server (which we also use to coordinate development work) that you can join here: <a href=\\\"https://discord.gg/bnJ6kubTg6\\\" rel=\\\"nofollow\\\">https://discord.gg/bnJ6kubTg6</a></p>\\n\\n<p dir=\\\"auto\\\">Gymnasium includes the following families of environments along with a wide variety of third-party environments</p>\\n<ul dir=\\\"auto\\\">\\n<li><a href=\\\"https://gymnasium.farama.org/environments/classic_control/\\\" rel=\\\"nofollow\\\">Classic Control</a> - These are classic reinforcement learning based on real-world problems and physics.</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/box2d/\\\" rel=\\\"nofollow\\\">Box2D</a> - These environments all involve toy games based around physics control, using box2d based physics and PyGame-based rendering</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/toy_text/\\\" rel=\\\"nofollow\\\">Toy Text</a> - These environments are designed to be extremely simple, with small discrete state and action spaces, and hence easy to learn. As a result, they are suitable for debugging implementations of reinforcement learning algorithms.</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/mujoco/\\\" rel=\\\"nofollow\\\">MuJoCo</a> - A physics engine based environments with multi-joint control which are more complex than the Box2D environments.</li>\\n<li><a href=\\\"https://ale.farama.org/\\\" rel=\\\"nofollow\\\">Atari</a> - Emulator of Atari 2600 ROMs simulated that have a high range of complexity for agents to learn.</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/third_party_environments/\\\" rel=\\\"nofollow\\\">Third-party</a> - A number of environments have been created that are compatible with the Gymnasium API. Be aware of the version that the software was created for and use the <code>apply_env_compatibility</code> in <code>gymnasium.make</code> if necessary.</li>\\n</ul>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Installation</h2><a id=\\\"user-content-installation\\\" aria-label=\\\"Permalink: Installation\\\" href=\\\"#installation\\\"></a></p>\\n<p dir=\\\"auto\\\">To install the base Gymnasium library, use <code>pip install gymnasium</code></p>\\n<p dir=\\\"auto\\\">This does not include dependencies for all families of environments (there's a massive number, and some can be problematic to install on certain systems). You can install these dependencies for one family like <code>pip install \\\"gymnasium[atari]\\\"</code> or use <code>pip install \\\"gymnasium[all]\\\"</code> to install all dependencies.</p>\\n<p dir=\\\"auto\\\">We support and test for Python 3.10, 3.11, 3.12 and 3.13 on Linux and macOS. We will accept PRs related to Windows, but do not officially support it.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">API</h2><a id=\\\"user-content-api\\\" aria-label=\\\"Permalink: API\\\" href=\\\"#api\\\"></a></p>\\n<p dir=\\\"auto\\\">The Gymnasium API models environments as simple Python <code>env</code> classes. Creating environment instances and interacting with them is very simple- here's an example using the \\\"CartPole-v1\\\" environment:</p>\\n<div dir=\\\"auto\\\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\\\"CartPole-v1\\\"</span>)\\n\\n<span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>(<span>seed</span><span>=</span><span>42</span>)\\n<span>for</span> <span>_</span> <span>in</span> <span>range</span>(<span>1000</span>):\\n    <span>action</span> <span>=</span> <span>env</span>.<span>action_space</span>.<span>sample</span>()\\n    <span>observation</span>, <span>reward</span>, <span>terminated</span>, <span>truncated</span>, <span>info</span> <span>=</span> <span>env</span>.<span>step</span>(<span>action</span>)\\n\\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span>:\\n        <span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>()\\n<span>env</span>.<span>close</span>()</pre></div>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Notable Related Libraries</h2><a id=\\\"user-content-notable-related-libraries\\\" aria-label=\\\"Permalink: Notable Related Libraries\\\" href=\\\"#notable-related-libraries\\\"></a></p>\\n<p dir=\\\"auto\\\">Please note that this is an incomplete list, and just includes libraries that the maintainers most commonly point newcomers to when asked for recommendations.</p>\\n<ul dir=\\\"auto\\\">\\n<li><a href=\\\"https://github.com/vwxyzjn/cleanrl\\\">CleanRL</a> is a learning library based on the Gymnasium API. It is designed to cater to newer people in the field and provides very good reference implementations.</li>\\n<li><a href=\\\"https://github.com/Farama-Foundation/PettingZoo\\\">PettingZoo</a> is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments.</li>\\n<li>The Farama Foundation also has a collection of many other <a href=\\\"https://farama.org/projects\\\" rel=\\\"nofollow\\\">environments</a> that are maintained by the same team as Gymnasium and use the Gymnasium API.</li>\\n</ul>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Environment Versioning</h2><a id=\\\"user-content-environment-versioning\\\" aria-label=\\\"Permalink: Environment Versioning\\\" href=\\\"#environment-versioning\\\"></a></p>\\n<p dir=\\\"auto\\\">Gymnasium keeps strict versioning for reproducibility reasons. All environments end in a suffix like \\\"-v0\\\".  When changes are made to environments that might impact learning results, the number is increased by one to prevent potential confusion. These were inherited from Gym.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Contributing</h2><a id=\\\"user-content-contributing\\\" aria-label=\\\"Permalink: Contributing\\\" href=\\\"#contributing\\\"></a></p>\\n<p dir=\\\"auto\\\">We welcome contributions from the community!\\nPlease see our <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/blob/main/CONTRIBUTING.md\\\">CONTRIBUTING.md</a> for details on how to get started.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Support Gymnasium's Development</h2><a id=\\\"user-content-support-gymnasiums-development\\\" aria-label=\\\"Permalink: Support Gymnasium's Development\\\" href=\\\"#support-gymnasiums-development\\\"></a></p>\\n<p dir=\\\"auto\\\">If you are financially able to do so and would like to support the development of Gymnasium, please join others in the community in <a href=\\\"https://github.com/sponsors/Farama-Foundation\\\">donating to us</a>.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Citation</h2><a id=\\\"user-content-citation\\\" aria-label=\\\"Permalink: Citation\\\" href=\\\"#citation\\\"></a></p>\\n<p dir=\\\"auto\\\">You can cite Gymnasium using our related paper (<a href=\\\"https://arxiv.org/abs/2407.17032\\\" rel=\\\"nofollow\\\">https://arxiv.org/abs/2407.17032</a>) as:</p>\\n<div><pre><code>@article{towers2024gymnasium,\\n  title={Gymnasium: A Standard Interface for Reinforcement Learning Environments},\\n  author={Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and De Cola, Gianluca and Deleu, Tristan and Goul{\\\\~a}o, Manuel and Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},\\n  journal={arXiv preprint arXiv:2407.17032},\\n  year={2024}\\n}\\n</code></pre></div>\\n</article></div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://github.com/farama-foundation/gymnasium",
    "title": "GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)",
    "summary": "An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym) - Farama-Foundation/Gymnasium",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div data-hpc=\"true\"><article itemprop=\"text\"><p dir=\"auto\"><a href=\"https://badge.fury.io/py/gymnasium\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/b391159f1cb2ce27383263b393800ee2835d088f451900c3f50c62f51c013dbc/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f67796d6e617369756d2e737667\" alt=\"Python\" data-canonical-src=\"https://img.shields.io/pypi/pyversions/gymnasium.svg\"></a>\n<a href=\"https://badge.fury.io/py/gymnasium\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/0777e88232a2ef56fd5da4c0e5b4e9827021e9c1a8dc7b7b629a2b82dce48f5b/68747470733a2f2f62616467652e667572792e696f2f70792f67796d6e617369756d2e737667\" alt=\"PyPI\" data-canonical-src=\"https://badge.fury.io/py/gymnasium.svg\"></a>\n<a href=\"https://arxiv.org/abs/2407.17032\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/b594949273dfef4572c0e54cce28590df41cbb44198329f769b1f814c29cb57f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f61725869762d323430372e31373033322d6233316231622e737667\" alt=\"arXiv\" data-canonical-src=\"https://img.shields.io/badge/arXiv-2407.17032-b31b1b.svg\"></a>\n<a href=\"https://pre-commit.com/\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/9b46a404a478a8f690c0f1e4450b6d22bafb14b5d0a237d9626fe889da71a710/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7072652d2d636f6d6d69742d656e61626c65642d627269676874677265656e3f6c6f676f3d7072652d636f6d6d6974266c6f676f436f6c6f723d7768697465\" alt=\"pre-commit\" data-canonical-src=\"https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&amp;logoColor=white\"></a>\n<a href=\"https://github.com/Farama-Foundation/Gymnasium/blob/main/LICENSE\"><img src=\"https://camo.githubusercontent.com/d0897855a06181ac533d42e8883efb28d79dbd3379df1e24c9f9c88f20bcf6e2/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f466172616d612d466f756e646174696f6e2f47796d6e617369756d\" alt=\"License\" data-canonical-src=\"https://img.shields.io/github/license/Farama-Foundation/Gymnasium\"></a>\n<a href=\"https://github.com/psf/black\"><img src=\"https://camo.githubusercontent.com/7019b88be88468d6b83fcbf59d2c06bfa4992bafa7d129f9b89ab29f8c7c1acb/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f64652532307374796c652d626c61636b2d3030303030302e737667\" alt=\"Code style: black\" data-canonical-src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a></p>\n<p dir=\"auto\">\n    <a href=\"https://gymnasium.farama.org/\" rel=\"nofollow\">\n    <img src=\"https://raw.githubusercontent.com/Farama-Foundation/Gymnasium/main/gymnasium-text.png\" width=\"500px\">\n</a>\n</p>\n<p dir=\"auto\">Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. This is a fork of OpenAI's <a href=\"https://github.com/openai/gym\">Gym</a> library by its maintainers (OpenAI handed over maintenance a few years ago to an outside team), and is where future maintenance will occur going forward.</p>\n<p dir=\"auto\">The documentation website is at <a href=\"https://gymnasium.farama.org/\" rel=\"nofollow\">gymnasium.farama.org</a>, and we have a public discord server (which we also use to coordinate development work) that you can join here: <a href=\"https://discord.gg/bnJ6kubTg6\" rel=\"nofollow\">https://discord.gg/bnJ6kubTg6</a></p>\n\n<p dir=\"auto\">Gymnasium includes the following families of environments along with a wide variety of third-party environments</p>\n<ul dir=\"auto\">\n<li><a href=\"https://gymnasium.farama.org/environments/classic_control/\" rel=\"nofollow\">Classic Control</a> - These are classic reinforcement learning based on real-world problems and physics.</li>\n<li><a href=\"https://gymnasium.farama.org/environments/box2d/\" rel=\"nofollow\">Box2D</a> - These environments all involve toy games based around physics control, using box2d based physics and PyGame-based rendering</li>\n<li><a href=\"https://gymnasium.farama.org/environments/toy_text/\" rel=\"nofollow\">Toy Text</a> - These environments are designed to be extremely simple, with small discrete state and action spaces, and hence easy to learn. As a result, they are suitable for debugging implementations of reinforcement learning algorithms.</li>\n<li><a href=\"https://gymnasium.farama.org/environments/mujoco/\" rel=\"nofollow\">MuJoCo</a> - A physics engine based environments with multi-joint control which are more complex than the Box2D environments.</li>\n<li><a href=\"https://ale.farama.org/\" rel=\"nofollow\">Atari</a> - Emulator of Atari 2600 ROMs simulated that have a high range of complexity for agents to learn.</li>\n<li><a href=\"https://gymnasium.farama.org/environments/third_party_environments/\" rel=\"nofollow\">Third-party</a> - A number of environments have been created that are compatible with the Gymnasium API. Be aware of the version that the software was created for and use the <code>apply_env_compatibility</code> in <code>gymnasium.make</code> if necessary.</li>\n</ul>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Installation</h2><a id=\"user-content-installation\" aria-label=\"Permalink: Installation\" href=\"#installation\"></a></p>\n<p dir=\"auto\">To install the base Gymnasium library, use <code>pip install gymnasium</code></p>\n<p dir=\"auto\">This does not include dependencies for all families of environments (there's a massive number, and some can be problematic to install on certain systems). You can install these dependencies for one family like <code>pip install \"gymnasium[atari]\"</code> or use <code>pip install \"gymnasium[all]\"</code> to install all dependencies.</p>\n<p dir=\"auto\">We support and test for Python 3.10, 3.11, 3.12 and 3.13 on Linux and macOS. We will accept PRs related to Windows, but do not officially support it.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">API</h2><a id=\"user-content-api\" aria-label=\"Permalink: API\" href=\"#api\"></a></p>\n<p dir=\"auto\">The Gymnasium API models environments as simple Python <code>env</code> classes. Creating environment instances and interacting with them is very simple- here's an example using the \"CartPole-v1\" environment:</p>\n<div dir=\"auto\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\"CartPole-v1\"</span>)\n\n<span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>(<span>seed</span><span>=</span><span>42</span>)\n<span>for</span> <span>_</span> <span>in</span> <span>range</span>(<span>1000</span>):\n    <span>action</span> <span>=</span> <span>env</span>.<span>action_space</span>.<span>sample</span>()\n    <span>observation</span>, <span>reward</span>, <span>terminated</span>, <span>truncated</span>, <span>info</span> <span>=</span> <span>env</span>.<span>step</span>(<span>action</span>)\n\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span>:\n        <span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>()\n<span>env</span>.<span>close</span>()</pre></div>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Notable Related Libraries</h2><a id=\"user-content-notable-related-libraries\" aria-label=\"Permalink: Notable Related Libraries\" href=\"#notable-related-libraries\"></a></p>\n<p dir=\"auto\">Please note that this is an incomplete list, and just includes libraries that the maintainers most commonly point newcomers to when asked for recommendations.</p>\n<ul dir=\"auto\">\n<li><a href=\"https://github.com/vwxyzjn/cleanrl\">CleanRL</a> is a learning library based on the Gymnasium API. It is designed to cater to newer people in the field and provides very good reference implementations.</li>\n<li><a href=\"https://github.com/Farama-Foundation/PettingZoo\">PettingZoo</a> is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments.</li>\n<li>The Farama Foundation also has a collection of many other <a href=\"https://farama.org/projects\" rel=\"nofollow\">environments</a> that are maintained by the same team as Gymnasium and use the Gymnasium API.</li>\n</ul>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Environment Versioning</h2><a id=\"user-content-environment-versioning\" aria-label=\"Permalink: Environment Versioning\" href=\"#environment-versioning\"></a></p>\n<p dir=\"auto\">Gymnasium keeps strict versioning for reproducibility reasons. All environments end in a suffix like \"-v0\".  When changes are made to environments that might impact learning results, the number is increased by one to prevent potential confusion. These were inherited from Gym.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Contributing</h2><a id=\"user-content-contributing\" aria-label=\"Permalink: Contributing\" href=\"#contributing\"></a></p>\n<p dir=\"auto\">We welcome contributions from the community!\nPlease see our <a href=\"https://github.com/Farama-Foundation/Gymnasium/blob/main/CONTRIBUTING.md\">CONTRIBUTING.md</a> for details on how to get started.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Support Gymnasium's Development</h2><a id=\"user-content-support-gymnasiums-development\" aria-label=\"Permalink: Support Gymnasium's Development\" href=\"#support-gymnasiums-development\"></a></p>\n<p dir=\"auto\">If you are financially able to do so and would like to support the development of Gymnasium, please join others in the community in <a href=\"https://github.com/sponsors/Farama-Foundation\">donating to us</a>.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Citation</h2><a id=\"user-content-citation\" aria-label=\"Permalink: Citation\" href=\"#citation\"></a></p>\n<p dir=\"auto\">You can cite Gymnasium using our related paper (<a href=\"https://arxiv.org/abs/2407.17032\" rel=\"nofollow\">https://arxiv.org/abs/2407.17032</a>) as:</p>\n<div><pre><code>@article{towers2024gymnasium,\n  title={Gymnasium: A Standard Interface for Reinforcement Learning Environments},\n  author={Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and De Cola, Gianluca and Deleu, Tristan and Goul{\\~a}o, Manuel and Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},\n  journal={arXiv preprint arXiv:2407.17032},\n  year={2024}\n}\n</code></pre></div>\n</article></div></div>"
  }
}

▼✓抓取网页正文网址: "https://gymnasium.farama.org/introduction/migration_guide/"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/introduction/migration_guide/\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article role=\\\"main\\\">\\n            \\n            <section id=\\\"gym-migration-guide\\\">\\n<h2>Gym Migration Guide<a href=\\\"#gym-migration-guide\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"who-should-read-this-guide\\\">\\n<h2>Who Should Read This Guide?<a href=\\\"#who-should-read-this-guide\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><strong>If you’re new to Gymnasium</strong>: You can probably skip this page! This guide is for users migrating from older versions of OpenAI Gym. If you’re just starting with RL, head to <a href=\\\"https://gymnasium.farama.org/introduction/basic_usage/\\\"><span>Basic Usage</span></a> instead.</p>\\n<p><strong>If you’re migrating from OpenAI Gym</strong>: This guide will help you update your code to work with Gymnasium. The changes are significant but straightforward once you understand the reasoning behind them.</p>\\n<p><strong>If you’re updating old tutorials</strong>: Many online RL tutorials use the old v0.21 API. This guide shows you how to modernize that code.</p>\\n</section>\\n<section id=\\\"why-did-the-api-change\\\">\\n<h2>Why Did the API Change?<a href=\\\"#why-did-the-api-change\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p>Gymnasium is a fork of <a href=\\\"https://github.com/openai/gym/releases/tag/0.26.2\\\">OpenAI Gym v0.26</a>, which introduced breaking changes from <a href=\\\"https://github.com/openai/gym/releases/tag/v0.21.0\\\">Gym v0.21</a>. These changes weren’t made lightly - they solved important problems that made RL research and development more difficult.</p>\\n<p>The main issues with the old API were:</p>\\n<ul>\\n<li><p><strong>Ambiguous episode endings</strong>: The single <code><span>done</span></code> flag couldn’t distinguish between “task completed” and “time limit reached”</p></li>\\n<li><p><strong>Inconsistent seeding</strong>: Random number generation was unreliable and hard to reproduce</p></li>\\n<li><p><strong>Rendering complexity</strong>: Switching between visual modes was unnecessarily complicated</p></li>\\n<li><p><strong>Reproducibility problems</strong>: Subtle bugs made it difficult to reproduce research results</p></li>\\n</ul>\\n<p>For environments that can’t be updated, see the compatibility guide section below.</p>\\n</section>\\n<section id=\\\"quick-reference-complete-changes-table\\\">\\n<h2>Quick Reference: Complete Changes Table<a href=\\\"#quick-reference-complete-changes-table\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<div>\\n<table>\\n<thead>\\n<tr><th><p><strong>Component</strong></p></th>\\n<th><p><strong>v0.21 (Old)</strong></p></th>\\n<th><p><strong>v0.26+ (New)</strong></p></th>\\n<th><p><strong>Impact</strong></p></th>\\n</tr>\\n</thead>\\n<tbody>\\n<tr><td><p><strong>Package Import</strong></p></td>\\n<td><p><code><span>import</span> <span>gym</span></code></p></td>\\n<td><p><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p></td>\\n<td><p>All code</p></td>\\n</tr>\\n<tr><td><p><strong>Environment Reset</strong></p></td>\\n<td><p><code><span>obs</span> <span>=</span> <span>env.reset()</span></code></p></td>\\n<td><p><code><span>obs,</span> <span>info</span> <span>=</span> <span>env.reset()</span></code></p></td>\\n<td><p>Training loops</p></td>\\n</tr>\\n<tr><td><p><strong>Random Seeding</strong></p></td>\\n<td><p><code><span>env.seed(42)</span></code></p></td>\\n<td><p><code><span>env.reset(seed=42)</span></code></p></td>\\n<td><p>Reproducibility</p></td>\\n</tr>\\n<tr><td><p><strong>Step Function</strong></p></td>\\n<td><p><code><span>obs,</span> <span>reward,</span> <span>done,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\\n<td><p><code><span>obs,</span> <span>reward,</span> <span>terminated,</span> <span>truncated,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\\n<td><p>RL algorithms</p></td>\\n</tr>\\n<tr><td><p><strong>Episode Ending</strong></p></td>\\n<td><p><code><span>while</span> <span>not</span> <span>done:</span></code></p></td>\\n<td><p><code><span>while</span> <span>not</span> <span>(terminated</span> <span>or</span> <span>truncated):</span></code></p></td>\\n<td><p>Training loops</p></td>\\n</tr>\\n<tr><td><p><strong>Render Mode</strong></p></td>\\n<td><p><code><span>env.render(mode=\\\"human\\\")</span></code></p></td>\\n<td><p><code><span>gym.make(env_id,</span> <span>render_mode=\\\"human\\\")</span></code></p></td>\\n<td><p>Visualization</p></td>\\n</tr>\\n<tr><td><p><strong>Time Limit Detection</strong></p></td>\\n<td><p><code><span>info.get('TimeLimit.truncated')</span></code></p></td>\\n<td><p><code><span>truncated</span></code> return value</p></td>\\n<td><p>RL algorithms</p></td>\\n</tr>\\n<tr><td><p><strong>Value Bootstrapping</strong></p></td>\\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-done)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-terminated)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\\n<td><p>RL correctness</p></td>\\n</tr>\\n</tbody>\\n</table>\\n</div>\\n</section>\\n<section id=\\\"side-by-side-code-comparison\\\">\\n<h2>Side-by-Side Code Comparison<a href=\\\"#side-by-side-code-comparison\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"old-v0-21-code\\\">\\n<h3>Old v0.21 Code<a href=\\\"#old-v0-21-code\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<div><pre><span></span><span>import</span><span> </span><span>gym</span>\\n\\n<span># Environment creation and seeding</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"LunarLander-v3\\\"</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>123</span><span>)</span>\\n<span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n\\n<span># Training loop</span>\\n<span>done</span> <span>=</span> <span>False</span>\\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n    <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n\\n<span>env</span><span>.</span><span>close</span><span>()</span>\\n</pre></div>\\n</section>\\n<section id=\\\"new-v0-26-code-including-v1-0-0\\\">\\n<h3>New v0.26+ Code (Including v1.0.0)<a href=\\\"#new-v0-26-code-including-v1-0-0\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>  <span># Note: 'gymnasium' not 'gym'</span>\\n\\n<span># Environment creation with render mode specified upfront</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"LunarLander-v3\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n\\n<span># Reset with seed parameter</span>\\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>123</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\\n\\n<span># Training loop with terminated/truncated distinction</span>\\n<span>done</span> <span>=</span> <span>False</span>\\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n\\n    <span># Episode ends if either terminated OR truncated</span>\\n    <span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\\n\\n<span>env</span><span>.</span><span>close</span><span>()</span>\\n</pre></div>\\n</section>\\n</section>\\n<section id=\\\"key-changes-breakdown\\\">\\n<h2>Key Changes Breakdown<a href=\\\"#key-changes-breakdown\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"package-name-change\\\">\\n<h3>1. Package Name Change<a href=\\\"#package-name-change\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p><strong>Old</strong>: <code><span>import</span> <span>gym</span></code>\\n<strong>New</strong>: <code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p>\\n<p>Why: Gymnasium is a separate project that maintains and improves upon the original Gym codebase.</p>\\n<div><pre><span></span><span># OLD</span>\\n<span>import</span><span> </span><span>gym</span>\\n\\n<span># NEW</span>\\n<span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\\n</pre></div>\\n</section>\\n<section id=\\\"seeding-and-random-number-generation\\\">\\n<h3>2. Seeding and Random Number Generation<a href=\\\"#seeding-and-random-number-generation\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>The biggest conceptual change is how randomness is handled.</p>\\n<p><strong>Old v0.21</strong>: Separate <code><span>seed()</span></code> method</p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>  <span># Set random seed</span>\\n<span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Reset environment</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Seed passed to <code><span>reset()</span></code></p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Seed and reset together</span>\\n</pre></div>\\n<p><strong>Why this changed</strong>: Some environments (especially emulated games) can only set their random state at the beginning of an episode, not mid-episode. The old approach could lead to inconsistent behavior.</p>\\n<p><strong>Practical impact</strong>:</p>\\n<div><pre><span></span><span># OLD: Seeding applied to all future episodes</span>\\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>\\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\\n    <span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n\\n<span># NEW: Each episode can have its own seed</span>\\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\\n    <span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span> <span>+</span> <span>episode</span><span>)</span>  <span># Each episode gets unique seed</span>\\n</pre></div>\\n</section>\\n<section id=\\\"environment-reset-changes\\\">\\n<h3>3. Environment Reset Changes<a href=\\\"#environment-reset-changes\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p><strong>Old v0.21</strong>: Returns only observation</p>\\n<div><pre><span></span><span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Returns observation AND info</p>\\n<div><pre><span></span><span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n</pre></div>\\n<p><strong>Why this changed</strong>:</p>\\n<ul>\\n<li><p><code><span>info</span></code> provides consistent access to debugging information</p></li>\\n<li><p><code><span>seed</span></code> parameter enables reproducible episodes</p></li>\\n<li><p><code><span>options</span></code> parameter allows episode-specific configuration</p></li>\\n</ul>\\n<p><strong>Common migration pattern</strong>:</p>\\n<div><pre><span></span><span># If you don't need the new features, just unpack the tuple</span>\\n<span>obs</span><span>,</span> <span>_</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Ignore info with underscore</span>\\n\\n<span># If you want to maintain the same random behavior as v0.21</span>\\n<span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Set seed once</span>\\n<span># Then for subsequent resets:</span>\\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Uses internal random state</span>\\n</pre></div>\\n</section>\\n<section id=\\\"step-function-the-done-terminated-truncated-split\\\">\\n<h3>4. Step Function: The <code><span>done</span></code> → <code><span>terminated</span></code>/<code><span>truncated</span></code> Split<a href=\\\"#step-function-the-done-terminated-truncated-split\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>This is the most important change for training algorithms.</p>\\n<p><strong>Old v0.21</strong>: Single <code><span>done</span></code> flag</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Separate <code><span>terminated</span></code> and <code><span>truncated</span></code> flags</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n</pre></div>\\n<p><strong>Why this matters</strong>:</p>\\n<ul>\\n<li><p><strong><code><span>terminated</span></code></strong>: Episode ended because the task was completed or failed (agent reached goal, died, etc.)</p></li>\\n<li><p><strong><code><span>truncated</span></code></strong>: Episode ended due to external constraints (time limit, step limit, etc.)</p></li>\\n</ul>\\n<p>This distinction is crucial for value function bootstrapping in RL algorithms:</p>\\n<div><pre><span></span><span># OLD (ambiguous)</span>\\n<span>if</span> <span>done</span><span>:</span>\\n    <span># Should we bootstrap? We don't know if this was natural termination or time limit!</span>\\n    <span>next_value</span> <span>=</span> <span>0</span>  <span># Assumption that may be wrong</span>\\n\\n<span># NEW (clear)</span>\\n<span>if</span> <span>terminated</span><span>:</span>\\n    <span>next_value</span> <span>=</span> <span>0</span>      <span># Natural ending - no future value</span>\\n<span>elif</span> <span>truncated</span><span>:</span>\\n    <span>next_value</span> <span>=</span> <span>value_function</span><span>(</span><span>next_obs</span><span>)</span>  <span># Time limit - estimate future value</span>\\n</pre></div>\\n<p><strong>Migration strategy</strong>:</p>\\n<div><pre><span></span><span># Simple migration (works for many cases)</span>\\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\\n\\n<span># Better migration (preserves RL algorithm correctness)</span>\\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>if</span> <span>terminated</span><span>:</span>\\n    <span># Episode naturally ended - use reward as-is</span>\\n    <span>target</span> <span>=</span> <span>reward</span>\\n<span>elif</span> <span>truncated</span><span>:</span>\\n    <span># Episode cut short - may need to estimate remaining value</span>\\n    <span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>discount</span> <span>*</span> <span>estimate_value</span><span>(</span><span>obs</span><span>)</span>\\n</pre></div>\\n<p>For more information, see our <a href=\\\"https://farama.org/Gymnasium-Terminated-Truncated-Step-API\\\">blog post</a> about it.</p>\\n</section>\\n<section id=\\\"render-mode-changes\\\">\\n<h3>5. Render Mode Changes<a href=\\\"#render-mode-changes\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p><strong>Old v0.21</strong>: Render mode specified each time</p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>     <span># Visual window</span>\\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"rgb_array\\\"</span><span>)</span> <span># Get pixel array</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Render mode fixed at creation</p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>     <span># For visual display</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"rgb_array\\\"</span><span>)</span> <span># For recording</span>\\n<span>env</span><span>.</span><span>render</span><span>()</span>  <span># Uses the mode specified at creation</span>\\n</pre></div>\\n<p><strong>Why this changed</strong>: Some environments can’t switch render modes on-the-fly. Fixing the mode at creation enables better optimization and prevents bugs.</p>\\n<p><strong>Practical implications</strong>:</p>\\n<div><pre><span></span><span># OLD: Could switch modes dynamically</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\\n    <span># ... episode code ...</span>\\n    <span>if</span> <span>episode</span> <span>%</span> <span>10</span> <span>==</span> <span>0</span><span>:</span>\\n        <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>  <span># Show every 10th episode</span>\\n\\n<span># NEW: Create separate environments for different purposes</span>\\n<span>training_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>  <span># No rendering for speed</span>\\n<span>eval_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>  <span># Visual for evaluation</span>\\n\\n<span># Or use None for no rendering, then create visual env when needed</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>None</span><span>)</span>  <span># Fast training</span>\\n<span>if</span> <span>need_visualization</span><span>:</span>\\n    <span>visual_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n</pre></div>\\n</section>\\n</section>\\n<section id=\\\"timelimit-wrapper-changes\\\">\\n<h2>TimeLimit Wrapper Changes<a href=\\\"#timelimit-wrapper-changes\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p>The <a href=\\\"https://gymnasium.farama.org/api/wrappers/misc_wrappers/#gymnasium.wrappers.TimeLimit\\\" title=\\\"gymnasium.wrappers.TimeLimit\\\"><code><span>TimeLimit</span></code></a> wrapper behavior also changed to align with the new termination model.</p>\\n<p><strong>Old v0.21</strong>: Added <code><span>TimeLimit.truncated</span></code> to info dict</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>if</span> <span>done</span> <span>and</span> <span>info</span><span>.</span><span>get</span><span>(</span><span>'TimeLimit.truncated'</span><span>,</span> <span>False</span><span>):</span>\\n    <span># Episode ended due to time limit</span>\\n    <span>pass</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Uses the <code><span>truncated</span></code> return value</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>if</span> <span>truncated</span><span>:</span>\\n    <span># Episode ended due to time limit (or other truncation)</span>\\n    <span>pass</span>\\n<span>if</span> <span>terminated</span><span>:</span>\\n    <span># Episode ended naturally (success/failure)</span>\\n    <span>pass</span>\\n</pre></div>\\n<p>This makes time limit detection much cleaner and more explicit.</p>\\n</section>\\n<section id=\\\"environment-specific-changes\\\">\\n<h2>Environment-Specific Changes<a href=\\\"#environment-specific-changes\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"removed-environments\\\">\\n<h3>Removed Environments<a href=\\\"#removed-environments\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>Some environments were moved or removed:</p>\\n<div><pre><span></span><span># OLD: Robotics environments in main gym</span>\\n<span>import</span><span> </span><span>gym</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"FetchReach-v1\\\"</span><span>)</span>  <span># No longer available</span>\\n\\n<span># NEW: Moved to separate package</span>\\n<span>import</span><span> </span><span>gymnasium</span>\\n\\n<span>import</span><span> </span><span>gymnasium_robotics</span>\\n<span>import</span><span> </span><span>ale_py</span>\\n\\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>((</span><span>gymnasium_robotics</span><span>,</span> <span>ale_py</span><span>))</span>\\n\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"FetchReach-v1\\\"</span><span>)</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"ALE/Pong-v5\\\"</span><span>)</span>\\n</pre></div>\\n</section>\\n</section>\\n<section id=\\\"compatibility-helpers\\\">\\n<h2>Compatibility Helpers<a href=\\\"#compatibility-helpers\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"loading-openai-gym-environments\\\">\\n<h3>Loading OpenAI Gym environments<a href=\\\"#loading-openai-gym-environments\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>For environments that can’t be updated to Gymnasium, we provide compatibility wrappers either for v21 and v26 style environments, where either the environment name or the environment itself can be passed.</p>\\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span>\\n<span>import</span><span> </span><span>shimmy</span>  <span># install shimmy with `pip install shimmy`</span>\\n\\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>(</span><span>shimmy</span><span>)</span>\\n\\n\\n<span># Gym v0.21 style environments</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV21Environment-v0\\\"</span><span>,</span> <span>env_id</span><span>=</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span># or</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV21Environment-v0\\\"</span><span>,</span> <span>env</span><span>=</span><span>OldV21Env</span><span>())</span>\\n\\n<span># Gym v0.26 style environments</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV26Environment-v0\\\"</span><span>,</span> <span>env_id</span><span>=</span><span>\\\"Cartpole-v1\\\"</span><span>)</span>\\n<span># or</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV26Environment-v0\\\"</span><span>,</span> <span>env</span><span>=</span><span>OldV26Env</span><span>())</span>\\n</pre></div>\\n</section>\\n<section id=\\\"step-api-compatibility\\\">\\n<h3>Step API Compatibility<a href=\\\"#step-api-compatibility\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>If environments implement the (old) done step API, Gymnasium provides functions (<a href=\\\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\\\" title=\\\"gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\\\"><code><span>convert_to_terminated_truncated_step_api()</span></code></a> and <a href=\\\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_done_step_api\\\" title=\\\"gymnasium.utils.step_api_compatibility.convert_to_done_step_api\\\"><code><span>convert_to_done_step_api()</span></code></a>) that will convert an environment with the old step API (using <code><span>done</span></code>) to the new step API (using <code><span>termination</span></code> and <code><span>truncation</span></code>), and vice versa.</p>\\n</section>\\n</section>\\n<section id=\\\"testing-your-migration\\\">\\n<h2>Testing Your Migration<a href=\\\"#testing-your-migration\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p>After migrating, verify that:</p>\\n<ul>\\n<li><p>[ ] <strong>Import statements</strong> use <code><span>gymnasium</span></code> instead of <code><span>gym</span></code></p></li>\\n<li><p>[ ] <strong>Reset calls</strong> handle the <code><span>(obs,</span> <span>info)</span></code> return format</p></li>\\n<li><p>[ ] <strong>Step calls</strong> handle <code><span>terminated</span></code> and <code><span>truncated</span></code> separately</p></li>\\n<li><p>[ ] <strong>Render mode</strong> is specified during environment creation</p></li>\\n<li><p>[ ] <strong>Random seeding</strong> uses the <code><span>seed</span></code> parameter in <code><span>reset()</span></code></p></li>\\n<li><p>[ ] <strong>Training algorithms</strong> properly distinguish termination types</p></li>\\n</ul>\\n<p>Use the <a href=\\\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.env_checker.check_env\\\" title=\\\"gymnasium.utils.env_checker.check_env\\\"><code><span>check_env()</span></code></a> to verify their implementation.</p>\\n</section>\\n<section id=\\\"getting-help\\\">\\n<h2>Getting Help<a href=\\\"#getting-help\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><strong>If you encounter issues during migration</strong>:</p>\\n<ol>\\n<li><p><strong>Check the compatibility guide</strong>: Some old environments can be used with compatibility wrappers</p></li>\\n<li><p><strong>Look at the environment documentation</strong>: Each environment may have specific migration notes</p></li>\\n<li><p><strong>Test with simple environments first</strong>: Start with CartPole before moving to complex environments</p></li>\\n<li><p><strong>Compare old vs new behavior</strong>: Run the same code with both APIs to understand differences</p></li>\\n</ol>\\n<p><strong>Common resources</strong>:</p>\\n<ul>\\n<li><p><a href=\\\"https://gymnasium.farama.org/api/env\\\">Gymnasium API documentation</a></p></li>\\n<li><p><a href=\\\"https://github.com/Farama-Foundation/Gymnasium/issues\\\">GitHub issues</a> for bug reports</p></li>\\n<li><p><a href=\\\"https://discord.gg/bnJ6kubTg6\\\">Discord community</a> for questions</p></li>\\n</ul>\\n</section>\\n</section>\\n\\n          </article></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/introduction/migration_guide/",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article role=\"main\">\n            \n            <section id=\"gym-migration-guide\">\n<h2>Gym Migration Guide<a href=\"#gym-migration-guide\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"who-should-read-this-guide\">\n<h2>Who Should Read This Guide?<a href=\"#who-should-read-this-guide\" title=\"Link to this heading\">¶</a></h2>\n<p><strong>If you’re new to Gymnasium</strong>: You can probably skip this page! This guide is for users migrating from older versions of OpenAI Gym. If you’re just starting with RL, head to <a href=\"https://gymnasium.farama.org/introduction/basic_usage/\"><span>Basic Usage</span></a> instead.</p>\n<p><strong>If you’re migrating from OpenAI Gym</strong>: This guide will help you update your code to work with Gymnasium. The changes are significant but straightforward once you understand the reasoning behind them.</p>\n<p><strong>If you’re updating old tutorials</strong>: Many online RL tutorials use the old v0.21 API. This guide shows you how to modernize that code.</p>\n</section>\n<section id=\"why-did-the-api-change\">\n<h2>Why Did the API Change?<a href=\"#why-did-the-api-change\" title=\"Link to this heading\">¶</a></h2>\n<p>Gymnasium is a fork of <a href=\"https://github.com/openai/gym/releases/tag/0.26.2\">OpenAI Gym v0.26</a>, which introduced breaking changes from <a href=\"https://github.com/openai/gym/releases/tag/v0.21.0\">Gym v0.21</a>. These changes weren’t made lightly - they solved important problems that made RL research and development more difficult.</p>\n<p>The main issues with the old API were:</p>\n<ul>\n<li><p><strong>Ambiguous episode endings</strong>: The single <code><span>done</span></code> flag couldn’t distinguish between “task completed” and “time limit reached”</p></li>\n<li><p><strong>Inconsistent seeding</strong>: Random number generation was unreliable and hard to reproduce</p></li>\n<li><p><strong>Rendering complexity</strong>: Switching between visual modes was unnecessarily complicated</p></li>\n<li><p><strong>Reproducibility problems</strong>: Subtle bugs made it difficult to reproduce research results</p></li>\n</ul>\n<p>For environments that can’t be updated, see the compatibility guide section below.</p>\n</section>\n<section id=\"quick-reference-complete-changes-table\">\n<h2>Quick Reference: Complete Changes Table<a href=\"#quick-reference-complete-changes-table\" title=\"Link to this heading\">¶</a></h2>\n<div>\n<table>\n<thead>\n<tr><th><p><strong>Component</strong></p></th>\n<th><p><strong>v0.21 (Old)</strong></p></th>\n<th><p><strong>v0.26+ (New)</strong></p></th>\n<th><p><strong>Impact</strong></p></th>\n</tr>\n</thead>\n<tbody>\n<tr><td><p><strong>Package Import</strong></p></td>\n<td><p><code><span>import</span> <span>gym</span></code></p></td>\n<td><p><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p></td>\n<td><p>All code</p></td>\n</tr>\n<tr><td><p><strong>Environment Reset</strong></p></td>\n<td><p><code><span>obs</span> <span>=</span> <span>env.reset()</span></code></p></td>\n<td><p><code><span>obs,</span> <span>info</span> <span>=</span> <span>env.reset()</span></code></p></td>\n<td><p>Training loops</p></td>\n</tr>\n<tr><td><p><strong>Random Seeding</strong></p></td>\n<td><p><code><span>env.seed(42)</span></code></p></td>\n<td><p><code><span>env.reset(seed=42)</span></code></p></td>\n<td><p>Reproducibility</p></td>\n</tr>\n<tr><td><p><strong>Step Function</strong></p></td>\n<td><p><code><span>obs,</span> <span>reward,</span> <span>done,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\n<td><p><code><span>obs,</span> <span>reward,</span> <span>terminated,</span> <span>truncated,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\n<td><p>RL algorithms</p></td>\n</tr>\n<tr><td><p><strong>Episode Ending</strong></p></td>\n<td><p><code><span>while</span> <span>not</span> <span>done:</span></code></p></td>\n<td><p><code><span>while</span> <span>not</span> <span>(terminated</span> <span>or</span> <span>truncated):</span></code></p></td>\n<td><p>Training loops</p></td>\n</tr>\n<tr><td><p><strong>Render Mode</strong></p></td>\n<td><p><code><span>env.render(mode=\"human\")</span></code></p></td>\n<td><p><code><span>gym.make(env_id,</span> <span>render_mode=\"human\")</span></code></p></td>\n<td><p>Visualization</p></td>\n</tr>\n<tr><td><p><strong>Time Limit Detection</strong></p></td>\n<td><p><code><span>info.get('TimeLimit.truncated')</span></code></p></td>\n<td><p><code><span>truncated</span></code> return value</p></td>\n<td><p>RL algorithms</p></td>\n</tr>\n<tr><td><p><strong>Value Bootstrapping</strong></p></td>\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-done)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-terminated)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\n<td><p>RL correctness</p></td>\n</tr>\n</tbody>\n</table>\n</div>\n</section>\n<section id=\"side-by-side-code-comparison\">\n<h2>Side-by-Side Code Comparison<a href=\"#side-by-side-code-comparison\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"old-v0-21-code\">\n<h3>Old v0.21 Code<a href=\"#old-v0-21-code\" title=\"Link to this heading\">¶</a></h3>\n<div><pre><span></span><span>import</span><span> </span><span>gym</span>\n\n<span># Environment creation and seeding</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"LunarLander-v3\"</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>123</span><span>)</span>\n<span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n\n<span># Training loop</span>\n<span>done</span> <span>=</span> <span>False</span>\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n    <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"human\"</span><span>)</span>\n\n<span>env</span><span>.</span><span>close</span><span>()</span>\n</pre></div>\n</section>\n<section id=\"new-v0-26-code-including-v1-0-0\">\n<h3>New v0.26+ Code (Including v1.0.0)<a href=\"#new-v0-26-code-including-v1-0-0\" title=\"Link to this heading\">¶</a></h3>\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>  <span># Note: 'gymnasium' not 'gym'</span>\n\n<span># Environment creation with render mode specified upfront</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"LunarLander-v3\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>\n\n<span># Reset with seed parameter</span>\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>123</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\n\n<span># Training loop with terminated/truncated distinction</span>\n<span>done</span> <span>=</span> <span>False</span>\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n\n    <span># Episode ends if either terminated OR truncated</span>\n    <span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\n\n<span>env</span><span>.</span><span>close</span><span>()</span>\n</pre></div>\n</section>\n</section>\n<section id=\"key-changes-breakdown\">\n<h2>Key Changes Breakdown<a href=\"#key-changes-breakdown\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"package-name-change\">\n<h3>1. Package Name Change<a href=\"#package-name-change\" title=\"Link to this heading\">¶</a></h3>\n<p><strong>Old</strong>: <code><span>import</span> <span>gym</span></code>\n<strong>New</strong>: <code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p>\n<p>Why: Gymnasium is a separate project that maintains and improves upon the original Gym codebase.</p>\n<div><pre><span></span><span># OLD</span>\n<span>import</span><span> </span><span>gym</span>\n\n<span># NEW</span>\n<span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\n</pre></div>\n</section>\n<section id=\"seeding-and-random-number-generation\">\n<h3>2. Seeding and Random Number Generation<a href=\"#seeding-and-random-number-generation\" title=\"Link to this heading\">¶</a></h3>\n<p>The biggest conceptual change is how randomness is handled.</p>\n<p><strong>Old v0.21</strong>: Separate <code><span>seed()</span></code> method</p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>  <span># Set random seed</span>\n<span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Reset environment</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Seed passed to <code><span>reset()</span></code></p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Seed and reset together</span>\n</pre></div>\n<p><strong>Why this changed</strong>: Some environments (especially emulated games) can only set their random state at the beginning of an episode, not mid-episode. The old approach could lead to inconsistent behavior.</p>\n<p><strong>Practical impact</strong>:</p>\n<div><pre><span></span><span># OLD: Seeding applied to all future episodes</span>\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\n    <span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n\n<span># NEW: Each episode can have its own seed</span>\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\n    <span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span> <span>+</span> <span>episode</span><span>)</span>  <span># Each episode gets unique seed</span>\n</pre></div>\n</section>\n<section id=\"environment-reset-changes\">\n<h3>3. Environment Reset Changes<a href=\"#environment-reset-changes\" title=\"Link to this heading\">¶</a></h3>\n<p><strong>Old v0.21</strong>: Returns only observation</p>\n<div><pre><span></span><span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Returns observation AND info</p>\n<div><pre><span></span><span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n</pre></div>\n<p><strong>Why this changed</strong>:</p>\n<ul>\n<li><p><code><span>info</span></code> provides consistent access to debugging information</p></li>\n<li><p><code><span>seed</span></code> parameter enables reproducible episodes</p></li>\n<li><p><code><span>options</span></code> parameter allows episode-specific configuration</p></li>\n</ul>\n<p><strong>Common migration pattern</strong>:</p>\n<div><pre><span></span><span># If you don't need the new features, just unpack the tuple</span>\n<span>obs</span><span>,</span> <span>_</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Ignore info with underscore</span>\n\n<span># If you want to maintain the same random behavior as v0.21</span>\n<span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Set seed once</span>\n<span># Then for subsequent resets:</span>\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Uses internal random state</span>\n</pre></div>\n</section>\n<section id=\"step-function-the-done-terminated-truncated-split\">\n<h3>4. Step Function: The <code><span>done</span></code> → <code><span>terminated</span></code>/<code><span>truncated</span></code> Split<a href=\"#step-function-the-done-terminated-truncated-split\" title=\"Link to this heading\">¶</a></h3>\n<p>This is the most important change for training algorithms.</p>\n<p><strong>Old v0.21</strong>: Single <code><span>done</span></code> flag</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Separate <code><span>terminated</span></code> and <code><span>truncated</span></code> flags</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n</pre></div>\n<p><strong>Why this matters</strong>:</p>\n<ul>\n<li><p><strong><code><span>terminated</span></code></strong>: Episode ended because the task was completed or failed (agent reached goal, died, etc.)</p></li>\n<li><p><strong><code><span>truncated</span></code></strong>: Episode ended due to external constraints (time limit, step limit, etc.)</p></li>\n</ul>\n<p>This distinction is crucial for value function bootstrapping in RL algorithms:</p>\n<div><pre><span></span><span># OLD (ambiguous)</span>\n<span>if</span> <span>done</span><span>:</span>\n    <span># Should we bootstrap? We don't know if this was natural termination or time limit!</span>\n    <span>next_value</span> <span>=</span> <span>0</span>  <span># Assumption that may be wrong</span>\n\n<span># NEW (clear)</span>\n<span>if</span> <span>terminated</span><span>:</span>\n    <span>next_value</span> <span>=</span> <span>0</span>      <span># Natural ending - no future value</span>\n<span>elif</span> <span>truncated</span><span>:</span>\n    <span>next_value</span> <span>=</span> <span>value_function</span><span>(</span><span>next_obs</span><span>)</span>  <span># Time limit - estimate future value</span>\n</pre></div>\n<p><strong>Migration strategy</strong>:</p>\n<div><pre><span></span><span># Simple migration (works for many cases)</span>\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\n\n<span># Better migration (preserves RL algorithm correctness)</span>\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>if</span> <span>terminated</span><span>:</span>\n    <span># Episode naturally ended - use reward as-is</span>\n    <span>target</span> <span>=</span> <span>reward</span>\n<span>elif</span> <span>truncated</span><span>:</span>\n    <span># Episode cut short - may need to estimate remaining value</span>\n    <span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>discount</span> <span>*</span> <span>estimate_value</span><span>(</span><span>obs</span><span>)</span>\n</pre></div>\n<p>For more information, see our <a href=\"https://farama.org/Gymnasium-Terminated-Truncated-Step-API\">blog post</a> about it.</p>\n</section>\n<section id=\"render-mode-changes\">\n<h3>5. Render Mode Changes<a href=\"#render-mode-changes\" title=\"Link to this heading\">¶</a></h3>\n<p><strong>Old v0.21</strong>: Render mode specified each time</p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"human\"</span><span>)</span>     <span># Visual window</span>\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"rgb_array\"</span><span>)</span> <span># Get pixel array</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Render mode fixed at creation</p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>     <span># For visual display</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"rgb_array\"</span><span>)</span> <span># For recording</span>\n<span>env</span><span>.</span><span>render</span><span>()</span>  <span># Uses the mode specified at creation</span>\n</pre></div>\n<p><strong>Why this changed</strong>: Some environments can’t switch render modes on-the-fly. Fixing the mode at creation enables better optimization and prevents bugs.</p>\n<p><strong>Practical implications</strong>:</p>\n<div><pre><span></span><span># OLD: Could switch modes dynamically</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\n    <span># ... episode code ...</span>\n    <span>if</span> <span>episode</span> <span>%</span> <span>10</span> <span>==</span> <span>0</span><span>:</span>\n        <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"human\"</span><span>)</span>  <span># Show every 10th episode</span>\n\n<span># NEW: Create separate environments for different purposes</span>\n<span>training_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>  <span># No rendering for speed</span>\n<span>eval_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>  <span># Visual for evaluation</span>\n\n<span># Or use None for no rendering, then create visual env when needed</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>None</span><span>)</span>  <span># Fast training</span>\n<span>if</span> <span>need_visualization</span><span>:</span>\n    <span>visual_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>\n</pre></div>\n</section>\n</section>\n<section id=\"timelimit-wrapper-changes\">\n<h2>TimeLimit Wrapper Changes<a href=\"#timelimit-wrapper-changes\" title=\"Link to this heading\">¶</a></h2>\n<p>The <a href=\"https://gymnasium.farama.org/api/wrappers/misc_wrappers/#gymnasium.wrappers.TimeLimit\" title=\"gymnasium.wrappers.TimeLimit\"><code><span>TimeLimit</span></code></a> wrapper behavior also changed to align with the new termination model.</p>\n<p><strong>Old v0.21</strong>: Added <code><span>TimeLimit.truncated</span></code> to info dict</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>if</span> <span>done</span> <span>and</span> <span>info</span><span>.</span><span>get</span><span>(</span><span>'TimeLimit.truncated'</span><span>,</span> <span>False</span><span>):</span>\n    <span># Episode ended due to time limit</span>\n    <span>pass</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Uses the <code><span>truncated</span></code> return value</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>if</span> <span>truncated</span><span>:</span>\n    <span># Episode ended due to time limit (or other truncation)</span>\n    <span>pass</span>\n<span>if</span> <span>terminated</span><span>:</span>\n    <span># Episode ended naturally (success/failure)</span>\n    <span>pass</span>\n</pre></div>\n<p>This makes time limit detection much cleaner and more explicit.</p>\n</section>\n<section id=\"environment-specific-changes\">\n<h2>Environment-Specific Changes<a href=\"#environment-specific-changes\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"removed-environments\">\n<h3>Removed Environments<a href=\"#removed-environments\" title=\"Link to this heading\">¶</a></h3>\n<p>Some environments were moved or removed:</p>\n<div><pre><span></span><span># OLD: Robotics environments in main gym</span>\n<span>import</span><span> </span><span>gym</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"FetchReach-v1\"</span><span>)</span>  <span># No longer available</span>\n\n<span># NEW: Moved to separate package</span>\n<span>import</span><span> </span><span>gymnasium</span>\n\n<span>import</span><span> </span><span>gymnasium_robotics</span>\n<span>import</span><span> </span><span>ale_py</span>\n\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>((</span><span>gymnasium_robotics</span><span>,</span> <span>ale_py</span><span>))</span>\n\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"FetchReach-v1\"</span><span>)</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"ALE/Pong-v5\"</span><span>)</span>\n</pre></div>\n</section>\n</section>\n<section id=\"compatibility-helpers\">\n<h2>Compatibility Helpers<a href=\"#compatibility-helpers\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"loading-openai-gym-environments\">\n<h3>Loading OpenAI Gym environments<a href=\"#loading-openai-gym-environments\" title=\"Link to this heading\">¶</a></h3>\n<p>For environments that can’t be updated to Gymnasium, we provide compatibility wrappers either for v21 and v26 style environments, where either the environment name or the environment itself can be passed.</p>\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span>\n<span>import</span><span> </span><span>shimmy</span>  <span># install shimmy with `pip install shimmy`</span>\n\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>(</span><span>shimmy</span><span>)</span>\n\n\n<span># Gym v0.21 style environments</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV21Environment-v0\"</span><span>,</span> <span>env_id</span><span>=</span><span>\"CartPole-v1\"</span><span>)</span>\n<span># or</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV21Environment-v0\"</span><span>,</span> <span>env</span><span>=</span><span>OldV21Env</span><span>())</span>\n\n<span># Gym v0.26 style environments</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV26Environment-v0\"</span><span>,</span> <span>env_id</span><span>=</span><span>\"Cartpole-v1\"</span><span>)</span>\n<span># or</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV26Environment-v0\"</span><span>,</span> <span>env</span><span>=</span><span>OldV26Env</span><span>())</span>\n</pre></div>\n</section>\n<section id=\"step-api-compatibility\">\n<h3>Step API Compatibility<a href=\"#step-api-compatibility\" title=\"Link to this heading\">¶</a></h3>\n<p>If environments implement the (old) done step API, Gymnasium provides functions (<a href=\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\" title=\"gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\"><code><span>convert_to_terminated_truncated_step_api()</span></code></a> and <a href=\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_done_step_api\" title=\"gymnasium.utils.step_api_compatibility.convert_to_done_step_api\"><code><span>convert_to_done_step_api()</span></code></a>) that will convert an environment with the old step API (using <code><span>done</span></code>) to the new step API (using <code><span>termination</span></code> and <code><span>truncation</span></code>), and vice versa.</p>\n</section>\n</section>\n<section id=\"testing-your-migration\">\n<h2>Testing Your Migration<a href=\"#testing-your-migration\" title=\"Link to this heading\">¶</a></h2>\n<p>After migrating, verify that:</p>\n<ul>\n<li><p>[ ] <strong>Import statements</strong> use <code><span>gymnasium</span></code> instead of <code><span>gym</span></code></p></li>\n<li><p>[ ] <strong>Reset calls</strong> handle the <code><span>(obs,</span> <span>info)</span></code> return format</p></li>\n<li><p>[ ] <strong>Step calls</strong> handle <code><span>terminated</span></code> and <code><span>truncated</span></code> separately</p></li>\n<li><p>[ ] <strong>Render mode</strong> is specified during environment creation</p></li>\n<li><p>[ ] <strong>Random seeding</strong> uses the <code><span>seed</span></code> parameter in <code><span>reset()</span></code></p></li>\n<li><p>[ ] <strong>Training algorithms</strong> properly distinguish termination types</p></li>\n</ul>\n<p>Use the <a href=\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.env_checker.check_env\" title=\"gymnasium.utils.env_checker.check_env\"><code><span>check_env()</span></code></a> to verify their implementation.</p>\n</section>\n<section id=\"getting-help\">\n<h2>Getting Help<a href=\"#getting-help\" title=\"Link to this heading\">¶</a></h2>\n<p><strong>If you encounter issues during migration</strong>:</p>\n<ol>\n<li><p><strong>Check the compatibility guide</strong>: Some old environments can be used with compatibility wrappers</p></li>\n<li><p><strong>Look at the environment documentation</strong>: Each environment may have specific migration notes</p></li>\n<li><p><strong>Test with simple environments first</strong>: Start with CartPole before moving to complex environments</p></li>\n<li><p><strong>Compare old vs new behavior</strong>: Run the same code with both APIs to understand differences</p></li>\n</ol>\n<p><strong>Common resources</strong>:</p>\n<ul>\n<li><p><a href=\"https://gymnasium.farama.org/api/env\">Gymnasium API documentation</a></p></li>\n<li><p><a href=\"https://github.com/Farama-Foundation/Gymnasium/issues\">GitHub issues</a> for bug reports</p></li>\n<li><p><a href=\"https://discord.gg/bnJ6kubTg6\">Discord community</a> for questions</p></li>\n</ul>\n</section>\n</section>\n\n          </article></div>"
  }
}

▼✓网页搜索结果数: 10, 查询: "Gymnasium built-in environments list classic control Box2D MuJoCo Atari Toy Text"

{
"content": [
{
"type": "text",
"text": "{\n \"results\": [\n {\n \"title\": \"Complete List - Atari - Gym Documentation\",\n \"url\": \"https://www.gymlibrary.dev/environments/atari/complete_list/\",\n \"description\": \"Toy Text. Toggle navigation of Toy Text. Blackjack · Taxi · Cliff Walking · Frozen Lake · Classic Control. Toggle navigation of Classic Control. Acrobot · Cart\"\n },\n {\n \"title\": \"https://zoo.cs.yale.edu/classes/cs470/materials/hw...\",\n \"url\": \"https://zoo.cs.yale.edu/classes/cs470/materials/hws/aima/gym/docs/environments.md\",\n \"description\": \"# Environments. This is a list of Gym environments, including those packaged with Gym, official OpenAI environments, and third party environment. The code for each environment group is housed in its own subdirectory. These are a variety of algorithmic tasks, such as learning to copy a. env = gym.make('Copy-v0'). The Atari environments are a variety of Atari video games. do the full install, you can install dependencies via `pip install -e. '.[atari]'` (you'll need `cmake` installed) and then get started as. env = gym.make('SpaceInvaders-v0'). Box2d is a 2D physics engine. You can install it via `pip install -e. These are a variety of classic control tasks, which would appear in a. install, you will need to run `pip install -e '.[classic\\\\_control]'` to. These environments also use MuJoCo. You'll have to also run `pip install -e '.[robotics]'` if. Learn more here: https://github.com/openai/procgen. Learn more here: https://github.com/openai/retro. Learn more here: https://github.com/openai/roboschool.\"\n },\n {\n \"title\": \"Reinforcement Learning - Colab - Google\",\n \"url\": \"https://colab.research.google.com/github/henrywoo/MyML/blob/master/Copy_of_reinforcement_learning.ipynb\",\n \"description\": \"Gym provides us with a collection of environments, such as: Classic control: control theory problems from the classic RL literature; Toy text: simple text\"\n },\n {\n \"title\": \"Box2D - Gymnasium Documentation\",\n \"url\": \"https://gymnasium.farama.org/environments/box2d/\",\n \"description\": \"# Box2D¶. These environments all involve toy games based around physics control, using box2d based physics and PyGame-based rendering. These environments were contributed back in the early days of OpenAI Gym by Oleg Klimov, and have become popular toy benchmarks ever since. All environments are highly configurable via arguments specified in each environment’s documentation. The unique dependencies for this set of environments can be installed via:. SWIG is necessary for building the wheel for box2d-py, the Python package that provides bindings to box2d.\"\n },\n {\n \"title\": \"Install Gymnasium (OpenAI Gym) on Windows - YouTube\",\n \"url\": \"https://www.youtube.com/watch?v=gMgj4pSHLww\",\n \"description\": \"Create new Conda environment 01:44 pip install gymnasium[classic-control] gymnasium[toy-text] 02:16 pip install gymnasium[mujoco] 02:27 pip\"\n },\n {\n \"title\": \"[PDF] Release 0.24.1 - Gym Documentation\",\n \"url\": \"https://gym-docs.readthedocs.io/_/downloads/en/latest/pdf/\",\n \"description\": \"This environment is part of the Classic Control environments. ... These are the unused toy-text environments present in Gym prior to Gym 0.20.0.\"\n },\n {\n \"title\": \"MuJoCo - Gymnasium Documentation\",\n \"url\": \"https://gymnasium.farama.org/environments/mujoco/\",\n \"description\": \"The unique dependencies including the MuJoCo simulator for this set of environments can be installed via:. MuJoCo `v3` environments and older, which relied on the `mujoco-py` framework, were migrated to the `gymnasium-robotics` package starting with `gymnasium` v1.2. The state spaces for MuJoCo environments in Gymnasium consist of two parts that are flattened and concatenated together: the position of the body part and joints (`mujoco.MjData.qpos`) and their corresponding velocity (`mujoco.MjData.qvel`) (more information in the MuJoCo Physics State Documentation). The Training performance of `v4` and `v5` is different because of the many changes in the environments, but the Half Cheetah and Swimmer exhibits identical behaviour, Pusher and Swimmer are close (for more information see GitHub Issue). All of the MuJoCo Environments besides the general Gymnasium arguments, and environment specific arguments they also take the following arguments for configuring the renderer:. The MuJoCo simulator renders images with OpenGL and can use 3 different back ends “glfw” (default), “egl”, “omesa”, which can be selected by setting an environment variable.\"\n },\n {\n \"title\": \"A Standardized Interface for Reinforcement Learning Environments\",\n \"url\": \"https://arxiv.org/html/2407.17032v4\",\n \"description\": \"In addition, Gymnasium provides a collection of built-in easy-to-use environments, tools for easily customizing environments, and tools to\"\n },\n {\n \"title\": \"Gym/docs/environments.md at master · StanfordVL/Gym - GitHub\",\n \"url\": \"https://github.com/StanfordVL/Gym/blob/master/docs/environments.md\",\n \"description\": \"This is a list of Gym environments, including those packaged with Gym, official OpenAI environments, and third party environment. For information on creating\"\n },\n {\n \"title\": \"YanglanWang/classic_control - GitHub\",\n \"url\": \"https://github.com/YanglanWang/classic_control\",\n \"description\": \"OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. This is the gym open-source library, which gives you access to a\"\n }\n ]\n}"
}
],
"details": {
"results": [
{
"title": "Complete List - Atari - Gym Documentation",
"url": "https://www.gymlibrary.dev/environments/atari/complete_list/",
"description": "Toy Text. Toggle navigation of Toy Text. Blackjack · Taxi · Cliff Walking · Frozen Lake · Classic Control. Toggle navigation of Classic Control. Acrobot · Cart"
},
{
"title": "https://zoo.cs.yale.edu/classes/cs470/materials/hw...",
"url": "https://zoo.cs.yale.edu/classes/cs470/materials/hws/aima/gym/docs/environments.md",
"description": "# Environments. This is a list of Gym environments, including those packaged with Gym, official OpenAI environments, and third party environment. The code for each environment group is housed in its own subdirectory. These are a variety of algorithmic tasks, such as learning to copy a. env = gym.make('Copy-v0'). The Atari environments are a variety of Atari video games. do the full install, you can install dependencies via `pip install -e. '.[atari]'` (you'll need `cmake` installed) and then get started as. env = gym.make('SpaceInvaders-v0'). Box2d is a 2D physics engine. You can install it via `pip install -e. These are a variety of classic control tasks, which would appear in a. install, you will need to run `pip install -e '.[classic\\_control]'` to. These environments also use MuJoCo. You'll have to also run `pip install -e '.[robotics]'` if. Learn more here: https://github.com/openai/procgen. Learn more here: https://github.com/openai/retro. Learn more here: https://github.com/openai/roboschool."
},
{
"title": "Reinforcement Learning - Colab - Google",
"url": "https://colab.research.google.com/github/henrywoo/MyML/blob/master/Copy_of_reinforcement_learning.ipynb",
"description": "Gym provides us with a collection of environments, such as: Classic control: control theory problems from the classic RL literature; Toy text: simple text"
},
{
"title": "Box2D - Gymnasium Documentation",
"url": "https://gymnasium.farama.org/environments/box2d/",
"description": "# Box2D¶. These environments all involve toy games based around physics control, using box2d based physics and PyGame-based rendering. These environments were contributed back in the early days of OpenAI Gym by Oleg Klimov, and have become popular toy benchmarks ever since. All environments are highly configurable via arguments specified in each environment’s documentation. The unique dependencies for this set of environments can be installed via:. SWIG is necessary for building the wheel for box2d-py, the Python package that provides bindings to box2d."
},
{
"title": "Install Gymnasium (OpenAI Gym) on Windows - YouTube",
"url": "https://www.youtube.com/watch?v=gMgj4pSHLww",
"description": "Create new Conda environment 01:44 pip install gymnasium[classic-control] gymnasium[toy-text] 02:16 pip install gymnasium[mujoco] 02:27 pip"
},
{
"title": "[PDF] Release 0.24.1 - Gym Documentation",
"url": "https://gym-docs.readthedocs.io/_/downloads/en/latest/pdf/",
"description": "This environment is part of the Classic Control environments. ... These are the unused toy-text environments present in Gym prior to Gym 0.20.0."
},
{
"title": "MuJoCo - Gymnasium Documentation",
"url": "https://gymnasium.farama.org/environments/mujoco/",
"description": "The unique dependencies including the MuJoCo simulator for this set of environments can be installed via:. MuJoCo `v3` environments and older, which relied on the `mujoco-py` framework, were migrated to the `gymnasium-robotics` package starting with `gymnasium` v1.2. The state spaces for MuJoCo environments in Gymnasium consist of two parts that are flattened and concatenated together: the position of the body part and joints (`mujoco.MjData.qpos`) and their corresponding velocity (`mujoco.MjData.qvel`) (more information in the MuJoCo Physics State Documentation). The Training performance of `v4` and `v5` is different because of the many changes in the environments, but the Half Cheetah and Swimmer exhibits identical behaviour, Pusher and Swimmer are close (for more information see GitHub Issue). All of the MuJoCo Environments besides the general Gymnasium arguments, and environment specific arguments they also take the following arguments for configuring the renderer:. The MuJoCo simulator renders images with OpenGL and can use 3 different back ends “glfw” (default), “egl”, “omesa”, which can be selected by setting an environment variable."
},
{
"title": "A Standardized Interface for Reinforcement Learning Environments",
"url": "https://arxiv.org/html/2407.17032v4",
"description": "In addition, Gymnasium provides a collection of built-in easy-to-use environments, tools for easily customizing environments, and tools to"
},
{
"title": "Gym/docs/environments.md at master · StanfordVL/Gym - GitHub",
"url": "https://github.com/StanfordVL/Gym/blob/master/docs/environments.md",
"description": "This is a list of Gym environments, including those packaged with Gym, official OpenAI environments, and third party environment. For information on creating"
},
{
"title": "YanglanWang/classic_control - GitHub",
"url": "https://github.com/YanglanWang/classic_control",
"description": "OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. This is the gym open-source library, which gives you access to a"
}
]
}
}

▼thought

Now let me get more details about the Farama Foundation, the ecosystem, and the latest version/developments.

▼✓网页搜索结果数: 10, 查询: "Farama Foundation Gymnasium ecosystem PettingZoo Minigrid 2025 2026"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Projects | The Farama Foundation\",\n      \"url\": \"https://farama.org/projects\",\n      \"description\": \"PettingZoo and Gymnasium bindings for popular reinforcement learning environments outside of Farama ... Copyright © The Farama Foundation 2026. Contact Us.\"\n    },\n    {\n      \"title\": \"Farama-Foundation/Gymnasium: An API standard for single ... - GitHub\",\n      \"url\": \"https://github.com/farama-foundation/gymnasium\",\n      \"description\": \"PettingZoo is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments. The Farama Foundation\"\n    },\n    {\n      \"title\": \"GitHub - Farama-Foundation/PettingZoo: An API standard for multi ...\",\n      \"url\": \"https://github.com/Farama-Foundation/PettingZoo\",\n      \"description\": \"PettingZoo is a Python library for conducting research in multi-agent reinforcement learning, akin to a multi-agent version of Gymnasium.\"\n    },\n    {\n      \"title\": \"PettingZoo Documentation\",\n      \"url\": \"https://pettingzoo.farama.org/index.html\",\n      \"description\": \"PettingZoo is a simple, pythonic interface capable of representing general multi-agent reinforcement learning (MARL) problems.\"\n    },\n    {\n      \"title\": \"A Reference Architecture of Reinforcement Learning Frameworks\",\n      \"url\": \"https://arxiv.org/html/2603.06413v1\",\n      \"description\": \"(2025) Gymnasium: a standard interface for reinforcement learning environments. Note: GitHub repository: https://github.com/Farama-Foundation/\"\n    },\n    {\n      \"title\": \"Installations | The Farama Foundation\",\n      \"url\": \"https://farama.org/stats/installations\",\n      \"description\": \"Installations ; PettingZoo5,163,800 ; Minari368,297 ; Gymnasium-Robotics529,442 ; Jumpy38,085 ; Metaworld114,491.\"\n    },\n    {\n      \"title\": \"Release Notes - PettingZoo Documentation\",\n      \"url\": \"https://pettingzoo.farama.org/release_notes/index.html\",\n      \"description\": \"This is a patch release for PettingZoo addressing dependency updates. * Updated Ray dependency in tutorials from 2.7.0 to 2.55.0 (#1329). This is a maintenance release for PettingZoo, adding support for `Python 3.13` and `Python 3.14`. This release also removes the deprecated MPE environments from PettingZoo. MPE environments are now maintained in the package MPE2. * MPE environments were removed from PettingZoo. * Waterworld was removed from PettingZoo. * Support for Python 3.13 and Python 3.14 was added. * Python 3.13 and Python 3.14 were added to CI workflows. * MPE documentation updated to reflect migration to `mpe2`. In this release of PettingZoo, we have introduced support for `python 3.12`, `gymnasium>=1.0.0`, and `agilerl>=2.0.0`. We have deprecated `python 3.8`, and added warnings to prepare the future transfer of the MPE environments to the package MPE2 in a future release. PettingZoo 1.25.0 coincides with a new release of Supersuit. * Update gymnasium to >=1.0.0 (#1272). * AgileRL tutorials updated for version 2.0.0 (#1258).\"\n    },\n    {\n      \"title\": \"Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/issues/944\",\n      \"description\": \"This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from\"\n    },\n    {\n      \"title\": \"NeurIPS Poster Gymnasium: A Standard Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://neurips.cc/virtual/2025/poster/121446\",\n      \"description\": \"# Gymnasium: A Standard Interface for Reinforcement Learning Environments. Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field.Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential.\"\n    },\n    {\n      \"title\": \"A Standard Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://www.researchgate.net/publication/382526464_Gymnasium_A_Standard_Interface_for_Reinforcement_Learning_Environments\",\n      \"description\": \"Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Projects | The Farama Foundation",
        "url": "https://farama.org/projects",
        "description": "PettingZoo and Gymnasium bindings for popular reinforcement learning environments outside of Farama ... Copyright © The Farama Foundation 2026. Contact Us."
      },
      {
        "title": "Farama-Foundation/Gymnasium: An API standard for single ... - GitHub",
        "url": "https://github.com/farama-foundation/gymnasium",
        "description": "PettingZoo is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments. The Farama Foundation"
      },
      {
        "title": "GitHub - Farama-Foundation/PettingZoo: An API standard for multi ...",
        "url": "https://github.com/Farama-Foundation/PettingZoo",
        "description": "PettingZoo is a Python library for conducting research in multi-agent reinforcement learning, akin to a multi-agent version of Gymnasium."
      },
      {
        "title": "PettingZoo Documentation",
        "url": "https://pettingzoo.farama.org/index.html",
        "description": "PettingZoo is a simple, pythonic interface capable of representing general multi-agent reinforcement learning (MARL) problems."
      },
      {
        "title": "A Reference Architecture of Reinforcement Learning Frameworks",
        "url": "https://arxiv.org/html/2603.06413v1",
        "description": "(2025) Gymnasium: a standard interface for reinforcement learning environments. Note: GitHub repository: https://github.com/Farama-Foundation/"
      },
      {
        "title": "Installations | The Farama Foundation",
        "url": "https://farama.org/stats/installations",
        "description": "Installations ; PettingZoo5,163,800 ; Minari368,297 ; Gymnasium-Robotics529,442 ; Jumpy38,085 ; Metaworld114,491."
      },
      {
        "title": "Release Notes - PettingZoo Documentation",
        "url": "https://pettingzoo.farama.org/release_notes/index.html",
        "description": "This is a patch release for PettingZoo addressing dependency updates. * Updated Ray dependency in tutorials from 2.7.0 to 2.55.0 (#1329). This is a maintenance release for PettingZoo, adding support for `Python 3.13` and `Python 3.14`. This release also removes the deprecated MPE environments from PettingZoo. MPE environments are now maintained in the package MPE2. * MPE environments were removed from PettingZoo. * Waterworld was removed from PettingZoo. * Support for Python 3.13 and Python 3.14 was added. * Python 3.13 and Python 3.14 were added to CI workflows. * MPE documentation updated to reflect migration to `mpe2`. In this release of PettingZoo, we have introduced support for `python 3.12`, `gymnasium>=1.0.0`, and `agilerl>=2.0.0`. We have deprecated `python 3.8`, and added warnings to prepare the future transfer of the MPE environments to the package MPE2 in a future release. PettingZoo 1.25.0 coincides with a new release of Supersuit. * Update gymnasium to >=1.0.0 (#1272). * AgileRL tutorials updated for version 2.0.0 (#1258)."
      },
      {
        "title": "Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...",
        "url": "https://github.com/Farama-Foundation/Gymnasium/issues/944",
        "description": "This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from"
      },
      {
        "title": "NeurIPS Poster Gymnasium: A Standard Interface for Reinforcement Learning Environments",
        "url": "https://neurips.cc/virtual/2025/poster/121446",
        "description": "# Gymnasium: A Standard Interface for Reinforcement Learning Environments. Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field.Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential."
      },
      {
        "title": "A Standard Interface for Reinforcement Learning Environments",
        "url": "https://www.researchgate.net/publication/382526464_Gymnasium_A_Standard_Interface_for_Reinforcement_Learning_Environments",
        "description": "Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue."
      }
    ]
  }
}

▼✓网页搜索结果数: 10, 查询: "Gymnasium v1.0 release changelog latest version 2025"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"A Standardized Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://arxiv.org/html/2407.17032v4\",\n      \"description\": \"Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025 * **https://\"\n    },\n    {\n      \"title\": \"[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...\",\n      \"url\": \"https://arxiv.org/pdf/2407.17032\",\n      \"description\": \"Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025* and over 18\"\n    },\n    {\n      \"title\": \"Gymnasium v1.0: A Comprehensive Update | The Farama Foundation\",\n      \"url\": \"https://farama.org/Gymnasium-v1.0\",\n      \"description\": \"This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull\"\n    },\n    {\n      \"title\": \"Gymnasium Release Notes\",\n      \"url\": \"https://gymnasium.farama.org/gymnasium_release_notes/index.html\",\n      \"description\": \"Gymnasium v1.3.0. This release brings a new Taxi environment version, a new RepeatAction wrapper, and a range of bug fixes across vector environments and\"\n    },\n    {\n      \"title\": \"Releases · Farama-Foundation/Gymnasium - GitHub\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/releases\",\n      \"description\": \"Gymnasium v1.2.1. Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs. New Features. Added new wrappers to\"\n    },\n    {\n      \"title\": \"gymnasium - PyPI\",\n      \"url\": \"https://pypi.org/project/gymnasium/\",\n      \"description\": \"Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API.\"\n    },\n    {\n      \"title\": \"Release Notes - Gymnasium-Robotics Documentation\",\n      \"url\": \"https://robotics.farama.org/release_notes/index.html\",\n      \"description\": \"Release Notes. This minor release updates MaMuJoCo to follow the latest PettingZoo version 1.23.0 and some minor bug fixes in the Github PyPI publish workflow.\"\n    },\n    {\n      \"title\": \"[ICLR 2025] Robust Gymnasium: A Unified Modular Benchmark for ...\",\n      \"url\": \"https://github.com/SafeRL-Lab/Robust-Gymnasium\",\n      \"description\": \"This benchmark aims to advance robust reinforcement learning (RL) for real-world applications and domain adaptation.\"\n    },\n    {\n      \"title\": \"Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/issues/944\",\n      \"description\": \"This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from\"\n    },\n    {\n      \"title\": \"Gym Documentation\",\n      \"url\": \"https://www.gymlibrary.dev/\",\n      \"description\": \"You can read a migration guide for upgrading older version of Gym to Gymnasium here. Click the button to continue to the Gymnasium documentation website. Go to\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "A Standardized Interface for Reinforcement Learning Environments",
        "url": "https://arxiv.org/html/2407.17032v4",
        "description": "Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025 * **https://"
      },
      {
        "title": "[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...",
        "url": "https://arxiv.org/pdf/2407.17032",
        "description": "Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025* and over 18"
      },
      {
        "title": "Gymnasium v1.0: A Comprehensive Update | The Farama Foundation",
        "url": "https://farama.org/Gymnasium-v1.0",
        "description": "This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull"
      },
      {
        "title": "Gymnasium Release Notes",
        "url": "https://gymnasium.farama.org/gymnasium_release_notes/index.html",
        "description": "Gymnasium v1.3.0. This release brings a new Taxi environment version, a new RepeatAction wrapper, and a range of bug fixes across vector environments and"
      },
      {
        "title": "Releases · Farama-Foundation/Gymnasium - GitHub",
        "url": "https://github.com/Farama-Foundation/Gymnasium/releases",
        "description": "Gymnasium v1.2.1. Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs. New Features. Added new wrappers to"
      },
      {
        "title": "gymnasium - PyPI",
        "url": "https://pypi.org/project/gymnasium/",
        "description": "Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API."
      },
      {
        "title": "Release Notes - Gymnasium-Robotics Documentation",
        "url": "https://robotics.farama.org/release_notes/index.html",
        "description": "Release Notes. This minor release updates MaMuJoCo to follow the latest PettingZoo version 1.23.0 and some minor bug fixes in the Github PyPI publish workflow."
      },
      {
        "title": "[ICLR 2025] Robust Gymnasium: A Unified Modular Benchmark for ...",
        "url": "https://github.com/SafeRL-Lab/Robust-Gymnasium",
        "description": "This benchmark aims to advance robust reinforcement learning (RL) for real-world applications and domain adaptation."
      },
      {
        "title": "Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...",
        "url": "https://github.com/Farama-Foundation/Gymnasium/issues/944",
        "description": "This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from"
      },
      {
        "title": "Gym Documentation",
        "url": "https://www.gymlibrary.dev/",
        "description": "You can read a migration guide for upgrading older version of Gym to Gymnasium here. Click the button to continue to the Gymnasium documentation website. Go to"
      }
    ]
  }
}

▼✓抓取网页正文网址: "https://gymnasium.farama.org/environments/classic_control/"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/environments/classic_control/\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div class=\\\"page\\\">\\n          <a href=\\\"#\\\">\\n            <svg xmlns=\\\"http://www.w3.org/2000/svg\\\" viewBox=\\\"0 0 24 24\\\">\\n              <path d=\\\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\\\"></path>\\n            </svg>\\n            <span>Back to top</span>\\n          </a>\\n          <div><p>\\n  <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/environments/classic_control.md\\\" title=\\\"Edit this page\\\">\\n    \\n    <span>Edit this page</span>\\n  </a>\\n</p>\\n            <p><label for=\\\"__toc\\\">\\n              <p>Toggle table of contents sidebar</p>\\n              <i><svg>\\n                  <use href=\\\"#svg-toc\\\"></use>\\n                </svg></i>\\n            </label>\\n          </p></div>\\n          <article role=\\\"main\\\">\\n            \\n            <section id=\\\"classic-control\\\">\\n<h2>Classic Control<a href=\\\"#classic-control\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n\\n\\n\\n\\n    <p>The unique dependencies for this set of environments can be installed via:</p>\\n<div><pre><span></span>pip<span> </span>install<span> </span>gymnasium<span>[</span>classic-control<span>]</span>\\n</pre></div>\\n<p>There are five classic control environments: Acrobot, CartPole, Mountain Car, Continuous Mountain Car, and Pendulum. All of these environments are stochastic in terms of their initial state, within a given range. In addition, Acrobot has noise applied to the taken action. Also, regarding both mountain car environments, the cars are underpowered to climb the mountain, so it takes some effort to reach the top.</p>\\n<p>Among Gymnasium environments, this set of environments can be considered easier ones to solve by a policy.</p>\\n<p>All environments are highly configurable via arguments specified in each environment’s documentation.</p>\\n</section>\\n\\n          </article>\\n        </div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/environments/classic_control/",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div class=\"page\">\n          <a href=\"#\">\n            <svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 24 24\">\n              <path d=\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\"></path>\n            </svg>\n            <span>Back to top</span>\n          </a>\n          <div><p>\n  <a href=\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/environments/classic_control.md\" title=\"Edit this page\">\n    \n    <span>Edit this page</span>\n  </a>\n</p>\n            <p><label for=\"__toc\">\n              <p>Toggle table of contents sidebar</p>\n              <i><svg>\n                  <use href=\"#svg-toc\"></use>\n                </svg></i>\n            </label>\n          </p></div>\n          <article role=\"main\">\n            \n            <section id=\"classic-control\">\n<h2>Classic Control<a href=\"#classic-control\" title=\"Link to this heading\">¶</a></h2>\n\n\n\n\n    <p>The unique dependencies for this set of environments can be installed via:</p>\n<div><pre><span></span>pip<span> </span>install<span> </span>gymnasium<span>[</span>classic-control<span>]</span>\n</pre></div>\n<p>There are five classic control environments: Acrobot, CartPole, Mountain Car, Continuous Mountain Car, and Pendulum. All of these environments are stochastic in terms of their initial state, within a given range. In addition, Acrobot has noise applied to the taken action. Also, regarding both mountain car environments, the cars are underpowered to climb the mountain, so it takes some effort to reach the top.</p>\n<p>Among Gymnasium environments, this set of environments can be considered easier ones to solve by a policy.</p>\n<p>All environments are highly configurable via arguments specified in each environment’s documentation.</p>\n</section>\n\n          </article>\n        </div></div>"
  }
}

▼thought

Let me get the v1.0 blog post and the release notes for more details.

▼✓抓取网页正文网址: "https://farama.org/Gymnasium-v1.0"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://farama.org/Gymnasium-v1.0\",\n  \"title\": \"Gymnasium v1.0: A Comprehensive Update\",\n  \"summary\": \"After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 20…\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div>\\n        <h2 id=\\\"gymnasium-v10-a-comprehensive-update\\\">Gymnasium v1.0: A Comprehensive Update**</h2>\\n\\n<p>After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull requests have been merged since version 0.29.1, culminating in Gymnasium v1.0, a stable release focused on improving the API (<code>Env</code>, <code>Space</code>, and <code>VectorEnv</code>).</p>\\n\\n<p>For a more detailed summary, see our <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\\\">release notes</a>. We have also published a white paper on Gymnasium you can check out <a href=\\\"https://arxiv.org/abs/2407.17032\\\">here</a> and cite if using in academic work.</p>\\n\\n<p>Let’s dive into some of the key changes!</p>\\n\\n<h2 id=\\\"1-vector-environments-overhaul\\\">1. <strong>Vector Environments Overhaul</strong></h2>\\n<p>Vector environments have been significantly revamped. Previously, <code>VectorEnv</code> inherited from <code>Env</code>, which wasn’t technically valid, and caused various issues with method signatures. In v1.0, <code>Env</code> and <code>VectorEnv</code> are now distinct and fully supported, leading to clearer code and more efficient implementations.</p>\\n\\n<p>Moreover, Gymnasium introduces the <code>gymnasium.make_vec</code> function for easier vectorized environment creation. The new setup allows users to specify a vectorization mode (by default: “sync”, “async” or “vector_entry_point”) making it simpler to create and manage multiple environment instances simultaneously.</p>\\n\\n<div><pre><code><span>envs</span> <span>=</span> <span>gym</span><span>.</span><span>make_vec</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>num_envs</span><span>=</span><span>3</span><span>,</span> <span>vectorization_mode</span><span>=</span><span>\\\"vector_entry_point\\\"</span><span>)</span>\\n</code></pre></div>\\n\\n<p>With this, we also support custom vectorization, defined for each environment separately, which allows for significantly more efficient implementations in certain cases.</p>\\n\\n<p>We also changed the auto-reset behavior of vectorized environments. In the past, when an episode terminated (or was truncated), it would immediately reset, with the final observation being passed in the <code>info</code> dict. This was less than ideal, as <code>info</code> was primarily intended for auxiliary information, not the main observations. Now, the observation returned alongside <code>terminated | truncated == True</code> is the final observation of the episode that just finished. The action following that will be ignored and treated as a reset signal, after which the new episode will begin. We will further elaborate on this in a future blog post.</p>\\n\\n<h2 id=\\\"2-improved-wrappers\\\">2. <strong>Improved Wrappers</strong></h2>\\n<p>With the separation of <code>Env</code> and <code>VectorEnv</code>, wrappers have also undergone changes. Wrappers that previously worked for both environments now have distinct variants for each. Standard environment wrappers are housed in <code>gymnasium.wrappers</code>, while vector-specific wrappers are found in <code>gymnasium.wrappers.vector</code>.</p>\\n\\n<p>Several wrappers have been renamed or removed for clarity, such as: <code>AutoResetWrapper</code> -&gt; <code>Autoreset</code> and <code>FrameStack</code> -&gt; <code>FrameStackObservation</code>.\\nNew wrappers have also been introduced, such as <code>DelayObservation</code> and <code>MaxAndSkipObservation</code>, to extend functionality.</p>\\n\\n<h3 id=\\\"3-functional-environments\\\">3. <strong>Functional Environments</strong></h3>\\n<p>To enable a different coding paradigm and allow for easier integration into planning algorithms, Gymnasium introduces <code>FuncEnv</code>, a functional version of <code>Env</code>. This new environment type exposes functions such as <code>reward</code>, <code>observation</code>, and <code>transition</code>, allowing for more flexible control over the environment’s behavior. Furthermore, functional environments implemented in JAX can often be jitted and vmapped, leading to massive performance improvements and easy hardware acceleration.</p>\\n\\n<h3 id=\\\"4-environment-version-changes\\\">4. <strong>Environment Version Changes</strong></h3>\\n<p>Several popular environments have seen updates:</p>\\n<ul>\\n  <li><strong>New MuJoCo v5</strong> have been added to support the latest MuJoCo versions along with new features and bug fixes.</li>\\n  <li><strong>Lunar Lander</strong> and <strong>CarRacing</strong> both received bug fixes, resulting in new environment versions.</li>\\n</ul>\\n\\n<h2 id=\\\"5-removing-the-plugin-system\\\">5. <strong>Removing the Plugin System</strong></h2>\\n<p>In version 1.0 we removed an undocumented plugin system that allowed for registering external environments behind the scenes. In previous versions, users could create environments like Atari or Minigrid without explicitly importing the relevant modules. Now, users will need to import these external libraries directly to register environments.</p>\\n\\n<p>Example:</p>\\n<div><pre><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\\n<span>import</span> <span>ale_py</span>\\n\\n<span>gym</span><span>.</span><span>register_envs</span><span>(</span><span>ale_py</span><span>)</span>  <span># optional\\n</span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"ALE/Pong-v5\\\"</span><span>)</span>\\n</code></pre></div>\\n<p>This change increases security, transparency and ensures a clearer workflow.</p>\\n\\n<h2 id=\\\"moving-ale-out-of-gymnasium\\\">Moving ALE out of Gymnasium</h2>\\n\\n<p>Atari (Arcade Learning Environment / ALE) and Gymnasium (and Gym) have been interlinked over the course of their existence. With v1.0 we decided to properly split them into two separate projects, with a new dedicated <a href=\\\"https://ale.farama.org/\\\">ALE website</a>. Given the Atari environments’ iconic history in reinforcement learning, we wanted to give a bit of backstory on this decision.</p>\\n\\n<p>The interface reinforcement learning researchers use for Atari environments is based on the Arcade Learning Environment (ALE), which was created at the University of Alberta as a C++ library in 2013 by some of the first researchers at DeepMind, including Marc Bellemare. A few years after this release, Python bindings were added to a public version of the ALE by Ben Goodrich, a researcher at the University of Tennessee (https://github.com/bbitmaster/ale_python_interface). This library was then forked by OpenAI, renamed Atari-Py, and released on PyPI as a dependency to enable Atari environments in Gym/Gymnasium.</p>\\n\\n<p>This problem of having Gymnasium depend on a fork of a fork of the ALE meant that any changes (such as new environments) or bug fixes that occurred in the ALE were not available in Atari-Py, which included the benchmark for 56 Atari environments that DeepMind first used in their work showing human level performance in the environments. This was made worse by the fact that DeepMind had made their own major modifications and upgrades to the ALE in this time period (some were published publicly in <a href=\\\"https://github.com/google-deepmind/xitari\\\">Xitari</a>, some were not). This created well founded scientific reproducibility concerns from the original ALE authors, which led to agreements between the original ALE authors, DeepMind and OpenAI to upstream python bindings from Atari-Py and ALE. Then, Gym would depend directly on the ALE, and DeepMind would contribute all their code to ALE so that the public could use it in a centralized and reproducible way. With the release of Gymnasium 1.0, we’re finally completing this process of upstreaming all Atari environments from Gymnasium to ALE.</p>\\n\\n<p>As a result, Atari documentation has been moved to <a href=\\\"https://farama.org/ale.farama.org\\\">ale.farama.org</a> with the Gymnasium links redirecting users. We are maintaining <code>pip install \\\"gymnasium[atari]\\\"</code> (though removing <code>accept-rom-license</code> as it’s unnecessary now) for backward compatibility. Most importantly, the plugin system described previously means that users need to <code>import ale_py</code> in order to register environments.</p>\\n\\n<h2 id=\\\"looking-ahead\\\">Looking Ahead</h2>\\n<p>Gymnasium 1.0 brings a refined, clearer, and more efficient framework for creating and interacting with reinforcement learning environments. With a focus on long-term stability, we intend this to be the final breaking change for the foreseeable future. We hope that this release sets the foundation for future growth in the reinforcement learning community.</p>\\n\\n    </div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://farama.org/Gymnasium-v1.0",
    "title": "Gymnasium v1.0: A Comprehensive Update",
    "summary": "After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 20…",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div>\n        <h2 id=\"gymnasium-v10-a-comprehensive-update\">Gymnasium v1.0: A Comprehensive Update**</h2>\n\n<p>After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull requests have been merged since version 0.29.1, culminating in Gymnasium v1.0, a stable release focused on improving the API (<code>Env</code>, <code>Space</code>, and <code>VectorEnv</code>).</p>\n\n<p>For a more detailed summary, see our <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\">release notes</a>. We have also published a white paper on Gymnasium you can check out <a href=\"https://arxiv.org/abs/2407.17032\">here</a> and cite if using in academic work.</p>\n\n<p>Let’s dive into some of the key changes!</p>\n\n<h2 id=\"1-vector-environments-overhaul\">1. <strong>Vector Environments Overhaul</strong></h2>\n<p>Vector environments have been significantly revamped. Previously, <code>VectorEnv</code> inherited from <code>Env</code>, which wasn’t technically valid, and caused various issues with method signatures. In v1.0, <code>Env</code> and <code>VectorEnv</code> are now distinct and fully supported, leading to clearer code and more efficient implementations.</p>\n\n<p>Moreover, Gymnasium introduces the <code>gymnasium.make_vec</code> function for easier vectorized environment creation. The new setup allows users to specify a vectorization mode (by default: “sync”, “async” or “vector_entry_point”) making it simpler to create and manage multiple environment instances simultaneously.</p>\n\n<div><pre><code><span>envs</span> <span>=</span> <span>gym</span><span>.</span><span>make_vec</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>num_envs</span><span>=</span><span>3</span><span>,</span> <span>vectorization_mode</span><span>=</span><span>\"vector_entry_point\"</span><span>)</span>\n</code></pre></div>\n\n<p>With this, we also support custom vectorization, defined for each environment separately, which allows for significantly more efficient implementations in certain cases.</p>\n\n<p>We also changed the auto-reset behavior of vectorized environments. In the past, when an episode terminated (or was truncated), it would immediately reset, with the final observation being passed in the <code>info</code> dict. This was less than ideal, as <code>info</code> was primarily intended for auxiliary information, not the main observations. Now, the observation returned alongside <code>terminated | truncated == True</code> is the final observation of the episode that just finished. The action following that will be ignored and treated as a reset signal, after which the new episode will begin. We will further elaborate on this in a future blog post.</p>\n\n<h2 id=\"2-improved-wrappers\">2. <strong>Improved Wrappers</strong></h2>\n<p>With the separation of <code>Env</code> and <code>VectorEnv</code>, wrappers have also undergone changes. Wrappers that previously worked for both environments now have distinct variants for each. Standard environment wrappers are housed in <code>gymnasium.wrappers</code>, while vector-specific wrappers are found in <code>gymnasium.wrappers.vector</code>.</p>\n\n<p>Several wrappers have been renamed or removed for clarity, such as: <code>AutoResetWrapper</code> -&gt; <code>Autoreset</code> and <code>FrameStack</code> -&gt; <code>FrameStackObservation</code>.\nNew wrappers have also been introduced, such as <code>DelayObservation</code> and <code>MaxAndSkipObservation</code>, to extend functionality.</p>\n\n<h3 id=\"3-functional-environments\">3. <strong>Functional Environments</strong></h3>\n<p>To enable a different coding paradigm and allow for easier integration into planning algorithms, Gymnasium introduces <code>FuncEnv</code>, a functional version of <code>Env</code>. This new environment type exposes functions such as <code>reward</code>, <code>observation</code>, and <code>transition</code>, allowing for more flexible control over the environment’s behavior. Furthermore, functional environments implemented in JAX can often be jitted and vmapped, leading to massive performance improvements and easy hardware acceleration.</p>\n\n<h3 id=\"4-environment-version-changes\">4. <strong>Environment Version Changes</strong></h3>\n<p>Several popular environments have seen updates:</p>\n<ul>\n  <li><strong>New MuJoCo v5</strong> have been added to support the latest MuJoCo versions along with new features and bug fixes.</li>\n  <li><strong>Lunar Lander</strong> and <strong>CarRacing</strong> both received bug fixes, resulting in new environment versions.</li>\n</ul>\n\n<h2 id=\"5-removing-the-plugin-system\">5. <strong>Removing the Plugin System</strong></h2>\n<p>In version 1.0 we removed an undocumented plugin system that allowed for registering external environments behind the scenes. In previous versions, users could create environments like Atari or Minigrid without explicitly importing the relevant modules. Now, users will need to import these external libraries directly to register environments.</p>\n\n<p>Example:</p>\n<div><pre><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\n<span>import</span> <span>ale_py</span>\n\n<span>gym</span><span>.</span><span>register_envs</span><span>(</span><span>ale_py</span><span>)</span>  <span># optional\n</span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"ALE/Pong-v5\"</span><span>)</span>\n</code></pre></div>\n<p>This change increases security, transparency and ensures a clearer workflow.</p>\n\n<h2 id=\"moving-ale-out-of-gymnasium\">Moving ALE out of Gymnasium</h2>\n\n<p>Atari (Arcade Learning Environment / ALE) and Gymnasium (and Gym) have been interlinked over the course of their existence. With v1.0 we decided to properly split them into two separate projects, with a new dedicated <a href=\"https://ale.farama.org/\">ALE website</a>. Given the Atari environments’ iconic history in reinforcement learning, we wanted to give a bit of backstory on this decision.</p>\n\n<p>The interface reinforcement learning researchers use for Atari environments is based on the Arcade Learning Environment (ALE), which was created at the University of Alberta as a C++ library in 2013 by some of the first researchers at DeepMind, including Marc Bellemare. A few years after this release, Python bindings were added to a public version of the ALE by Ben Goodrich, a researcher at the University of Tennessee (https://github.com/bbitmaster/ale_python_interface). This library was then forked by OpenAI, renamed Atari-Py, and released on PyPI as a dependency to enable Atari environments in Gym/Gymnasium.</p>\n\n<p>This problem of having Gymnasium depend on a fork of a fork of the ALE meant that any changes (such as new environments) or bug fixes that occurred in the ALE were not available in Atari-Py, which included the benchmark for 56 Atari environments that DeepMind first used in their work showing human level performance in the environments. This was made worse by the fact that DeepMind had made their own major modifications and upgrades to the ALE in this time period (some were published publicly in <a href=\"https://github.com/google-deepmind/xitari\">Xitari</a>, some were not). This created well founded scientific reproducibility concerns from the original ALE authors, which led to agreements between the original ALE authors, DeepMind and OpenAI to upstream python bindings from Atari-Py and ALE. Then, Gym would depend directly on the ALE, and DeepMind would contribute all their code to ALE so that the public could use it in a centralized and reproducible way. With the release of Gymnasium 1.0, we’re finally completing this process of upstreaming all Atari environments from Gymnasium to ALE.</p>\n\n<p>As a result, Atari documentation has been moved to <a href=\"https://farama.org/ale.farama.org\">ale.farama.org</a> with the Gymnasium links redirecting users. We are maintaining <code>pip install \"gymnasium[atari]\"</code> (though removing <code>accept-rom-license</code> as it’s unnecessary now) for backward compatibility. Most importantly, the plugin system described previously means that users need to <code>import ale_py</code> in order to register environments.</p>\n\n<h2 id=\"looking-ahead\">Looking Ahead</h2>\n<p>Gymnasium 1.0 brings a refined, clearer, and more efficient framework for creating and interacting with reinforcement learning environments. With a focus on long-term stability, we intend this to be the final breaking change for the foreseeable future. We hope that this release sets the foundation for future growth in the reinforcement learning community.</p>\n\n    </div></div>"
  }
}

▼✓抓取网页正文网址: "https://gymnasium.farama.org/gymnasium_release_notes/index.html"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/gymnasium_release_notes/index.html\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article role=\\\"main\\\">\\n            \\n            <section id=\\\"gymnasium-release-notes\\\">\\n<h2>Gymnasium Release Notes<a href=\\\"#gymnasium-release-notes\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"release-v1-3-0\\\">\\n<h2>v1.3.0<a href=\\\"#release-v1-3-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2026-04-22 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.3.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.3.0/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.3.0</h2>\\n<p>This release brings a new Taxi environment version, a new <code>RepeatAction</code> wrapper, and a range of bug fixes across vector environments and wrappers.</p>\\n<h2>Core Changes</h2>\\n<ul>\\n<li>The <code>Taxi</code> environment has been updated to <code>v4</code> to correct the <code>is_rainy</code> implementation, which previously did not behave as documented by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"4307782993\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1561\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1561/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1561\\\">#1561</a>)</li>\\n<li><code>pygame</code> has been replaced with <a href=\\\"https://pyga.me/\\\" rel=\\\"nofollow\\\"><code>pygame-ce</code></a>, unlocking Python 3.14 compatibility. The drop-in replacement preserves the existing rendering behaviour by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/mwydmuch/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/mwydmuch\\\">@mwydmuch</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3788673329\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1512\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1512/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1512\\\">#1512</a>)</li>\\n<li>Added <code>RepeatAction</code> wrapper that repeats a given action for a fixed number of steps, useful for frame-skipping and coarser control loops by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Lidang-Jiang/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Lidang-Jiang\\\">@Lidang-Jiang</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"4194777638\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1553\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1553/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1553\\\">#1553</a>)</li>\\n<li>Optimised <code>Box.__init__</code> to reduce overhead through lazy evaluation of variables by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3958776450\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1529\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1529/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1529\\\">#1529</a>)</li>\\n<li>Fix vector <code>RecordVideo</code> wrapper to remove memory leaks across episodes by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/JonahFSD/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/JonahFSD\\\">@JonahFSD</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3914195171\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1527\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1527/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1527\\\">#1527</a>)</li>\\n<li>Fix Vector <code>NormalizeReward</code> wrapper to work identically as the non-vectorized version by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/JonahFSD/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/JonahFSD\\\">@JonahFSD</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3912418439\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1526\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1526/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1526\\\">#1526</a>)</li>\\n</ul>\\n<h2>Bug Fixes</h2>\\n<ul>\\n<li>Removed the <code>VectorEnv</code> destructor, which was causing <code>VectorEnv.close()</code> to be called unintentionally by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/TimSchneider42/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/TimSchneider42\\\">@TimSchneider42</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3832761339\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1522\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1522/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1522\\\">#1522</a>)</li>\\n<li>Fixed <code>NormalizeObservation</code> vectorized wrapper to override the <code>observation_space</code> and a <code>float32</code> cast by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/JonahFSD/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/JonahFSD\\\">@JonahFSD</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3914296741\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1528\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1528/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1528\\\">#1528</a>)</li>\\n<li>Fixed <code>RenderFrame</code>'s typing by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/jorenham/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/jorenham\\\">@jorenham</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"4305182899\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1560\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1560/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1560\\\">#1560</a>)</li>\\n<li>Fixed A2C tutorial's entropy computation when a scalar (rather than tensor) is passed to the loss function by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/epoint95/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/epoint95\\\">@epoint95</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3868402075\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1524\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1524/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1524\\\">#1524</a>)</li>\\n</ul>\\n<h2>Third-Party Environments</h2>\\n<p>10 new community environments have been added to the <a href=\\\"https://gymnasium.farama.org/environments/third_party_environments/\\\" rel=\\\"nofollow\\\">third-party environments list</a>, including a new Cybersecurity environments section.</p>\\n<p><strong>Full Changelog:</strong> <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.3...v1.3.0\\\"><tt>v1.2.3...v1.3.0</tt></a></p></section>\\n<section id=\\\"release-v1-2-3\\\">\\n<h2>v1.2.3<a href=\\\"#release-v1-2-3\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-12-18 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.3\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.3/\\\">PyPI</a></em></p>\\n<h2>Gymnasium V1.2.3</h2>\\n<p>This is a minor release with the most significant being changing the dependency  for \\\"gymnasium[box2d]\\\" from <a href=\\\"https://pypi.org/project/box2d-py\\\" rel=\\\"nofollow\\\">box2d-py</a> to <a href=\\\"https://pypi.org/project/Box2D/\\\" rel=\\\"nofollow\\\">box2d</a>. See <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3580605032\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1474\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1474/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1474\\\">#1474</a> for more detail.</p>\\n<p>Additionally, we're fixed several typos and added a couple of third-party projects. Lastly, we fixed the create an environment tutorial to use NumPy [row, col] in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3704120436\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1490\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1490/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1490\\\">#1490</a></p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.2...v1.2.3\\\"><tt>v1.2.2...v1.2.3</tt></a></p></section>\\n<section id=\\\"release-v1-2-2\\\">\\n<h2>v1.2.2: V1.2.2<a href=\\\"#release-v1-2-2\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-11-04 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.2\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.2/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.2.2</h2>\\n<p>This is a minor update just to add <code>Discrete.dtype</code> and improve <code>DictInfoToList</code></p>\\n<p>These additional changes were made</p>\\n<ul>\\n<li>Add <code>Discrete.dtype</code> parameter by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/VadimBim/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/VadimBim\\\">@VadimBim</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3529955390\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1467\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1467/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1467\\\">#1467</a></li>\\n<li>Improve <code>MultiDiscrete.dtype</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3542266766\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1469\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1469/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1469\\\">#1469</a></li>\\n<li>Improve <code>DictInfoToList</code> to support vector info with missing binary key by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3526222616\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1466\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1466/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1466\\\">#1466</a></li>\\n</ul>\\n<p>Thanks to Wispr for their support of the project</p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.1...v1.2.2\\\"><tt>v1.2.1...v1.2.2</tt></a></p></section>\\n<section id=\\\"release-v1-2-1\\\">\\n<h2>v1.2.1<a href=\\\"#release-v1-2-1\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-09-23 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.1\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.1/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.2.1</h2>\\n<p>Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs.</p>\\n<h2>New Features</h2>\\n<ul>\\n<li>Added new wrappers to discretize observations and actions (<code>gymnasium.wrappers.DiscretizeObservation</code> and <code>gymnasium.wrappers.DiscretizeAction</code>) by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3185358242\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1411\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1411/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1411\\\">#1411</a>)</li>\\n<li>Add vector-based <code>RecordVideo</code> wrapper that allows recording all sub-environments at the same time by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3241889899\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1418\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1418/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1418\\\">#1418</a>)</li>\\n<li>Add new customization to Frozen Lake for reward schedule and success rate by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Tayzn/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Tayzn\\\">@Tayzn</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3021051457\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1363\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1363/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1363\\\">#1363</a>)</li>\\n<li>Add tutorial on action masking by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/dantp-ai/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/dantp-ai\\\">@dantp-ai</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2958306932\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1345\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1345/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1345\\\">#1345</a>)</li>\\n<li>Optimize several environment functions for Blackjack and Mujoco-Humanoid by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/aseembits93/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/aseembits93\\\">@aseembits93</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3298033175\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1426\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1426/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1426\\\">#1426</a>, <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3320254150\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1429\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1429/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1429\\\">#1429</a>, <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3320248227\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1428\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1428/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1428\\\">#1428</a> and <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3356742105\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1436\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1436/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1436\\\">#1436</a>)</li>\\n</ul>\\n<h2>Fixes</h2>\\n<ul>\\n<li>Fixes memory leak in recording by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pascal-roth/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pascal-roth\\\">@pascal-roth</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3394001149\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1444\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1444/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1444\\\">#1444</a></li>\\n<li>Fix \\\"pygame is not responding\\\" when rendering the pygame window in the Taxi environment by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/RizkySugiharto/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/RizkySugiharto\\\">@RizkySugiharto</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3182695017\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1410\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1410/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1410\\\">#1410</a>)</li>\\n<li>Fix the vector wrapper warnings by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3205200502\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1415\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1415/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1415\\\">#1415</a>)</li>\\n<li>Fix <code>AsyncVectorEnv.step</code> from hanging forever by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/matinmoezzi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/matinmoezzi\\\">@matinmoezzi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3243762778\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1419\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1419/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1419\\\">#1419</a>)</li>\\n<li>Clear markers after rendering for MuJoCo OffscreenViewers by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3248714510\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1420\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1420/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1420\\\">#1420</a>)</li>\\n<li>Fix array-conversion for torch cuda tensors by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3363046088\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1439\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1439/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1439\\\">#1439</a>)</li>\\n<li>Add missing dependency on packaging to array-api feature by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3361105679\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1437\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1437/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1437\\\">#1437</a>)</li>\\n</ul>\\n<p>Thank you to all the contributors</p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.0...v1.2.1\\\"><tt>v1.2.0...v1.2.1</tt></a></p></section>\\n<section id=\\\"release-v1-2-0\\\">\\n<h2>v1.2.0<a href=\\\"#release-v1-2-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-06-27 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.0/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.2.0</h2>\\n<p>In Gym <a href=\\\"https://github.com/openai/gym/releases/tag/0.24.0\\\">v0.24</a>, v4 MuJoCo environments were added that used a different simulator (mujoco, not mujoco-py).<br>\\nHaving been 3 years since v0.24 and with mujoco-py being unmaintained, it is limiting our ability to support Python 3.13.<br>\\nTherefore, in this release, we have moved the MuJoCo v2 and v3 to the <a href=\\\"https://github.com/Farama-Foundation/gymnasium-robotics\\\">Gymnasium-Robotics</a> project, meaning the users who cannot upgrade to the v4 or v5 MuJoCo environments should update their code to:</p>\\n<div data-snippet-clipboard-copy-content=\\\"import gymnasium as gym\\nimport gymnasium_robotics  # `pip install &quot;gymnasium-robotics[mujoco-py]&quot;`\\n\\ngym.register_envs(gymnasium_robotics)  # optional\\n\\nenv = gym.make(&quot;Humanoid-v3&quot;)\\\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\\n<span>import</span> <span>gymnasium_robotics</span>  <span># `pip install \\\"gymnasium-robotics[mujoco-py]\\\"`</span>\\n\\n<span>gym</span>.<span>register_envs</span>(<span>gymnasium_robotics</span>)  <span># optional</span>\\n\\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\\\"Humanoid-v3\\\"</span>)</pre></div>\\n<p>In addition, we have added support for Python 3.13 (and dropped Python 3.8 and 3.9 following NumPy and other projects).</p>\\n<h2>New Features</h2>\\n<ul>\\n<li>Add <code>AddWhiteNoise</code> and <code>ObstructView</code> wrappers that add noise to RGB renderings either across the whole image or sections by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2645643074\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1243\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1243/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1243\\\">#1243</a>)</li>\\n<li>Adds rainy and fickle options for Taxi environment that were implemented in the original paper by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/foreverska/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/foreverska\\\">@foreverska</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2867689892\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1315\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1315/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1315\\\">#1315</a>)</li>\\n<li>Add <code>wrappers.ArrayConversion</code>, a generic conversion wrapper between Array API compatible frameworks (like NumPy, Torch, Jax, etc) by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2930111508\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1333\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1333/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1333\\\">#1333</a>)</li>\\n<li>Add garbage collector on RecordVideo wrapper to prevent out of memory problems by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/vicbentu/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/vicbentu\\\">@vicbentu</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3054955909\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1378\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1378/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1378\\\">#1378</a>)</li>\\n</ul>\\n<h2>Bug Fixes</h2>\\n<ul>\\n<li>Fix MuJoCo add_markers for mujoco&gt;=3.2 by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2900883178\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1329\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1329/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1329\\\">#1329</a>)</li>\\n<li>Update CliffWalking version in envs registry by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/abouelsaadat/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/abouelsaadat\\\">@abouelsaadat</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2936052417\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1334\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1334/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1334\\\">#1334</a>)</li>\\n<li>Fix jax 0.6.0 regression in jax_to_numpy by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/GaetanLepage/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/GaetanLepage\\\">@GaetanLepage</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3004879973\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1361\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1361/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1361\\\">#1361</a>)</li>\\n<li>Fix Atari Preprocessing to only check env has NOOP action when <code>noop_max&gt;0</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3098488051\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1393\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1393/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1393\\\">#1393</a>)</li>\\n<li>Fix all warnings in tests/spaces by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3126115146\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1396\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1396/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1396\\\">#1396</a>)</li>\\n<li>Don't override SDL's audio driver on Linux by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/AnErrupTion/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/AnErrupTion\\\">@AnErrupTion</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3079104537\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1387\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1387/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1387\\\">#1387</a>)</li>\\n</ul>\\n<p>In addition, this release includes numerous updates to the documentation, most importantly to the introductory pages, with an aim to make them easier for new users of Gymnasium or RL to understand.</p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.1...v1.2.0\\\"><tt>v1.1.1...v1.2.0</tt></a></p></section>\\n<section id=\\\"release-v1-1-1\\\">\\n<h2>Gymnasium v1.1.1<a href=\\\"#release-v1-1-1\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-03-06 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.1\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.1.1/\\\">PyPI</a></em></p>\\n<h2>Changes</h2>\\n<ul>\\n<li>Remove assert on metadata render modes for MuJoCo-based environments in <code>mujoco_env.py</code> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2900123613\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1328\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1328/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1328\\\">#1328</a>)</li>\\n<li>Update <code>wrappers.vector.NumpyToTorch</code> to refer to <code>numpy</code> instead of <code>jax</code>  by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pkuderov/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pkuderov\\\">@pkuderov</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2883844632\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1319\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1319/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1319\\\">#1319</a></li>\\n</ul>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.0...v1.1.1\\\"><tt>v1.1.0...v1.1.1</tt></a></p></section>\\n<section id=\\\"release-v1-1-0\\\">\\n<h2>v1.1.0<a href=\\\"#release-v1-1-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-02-26 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.1.0/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.1</h2>\\n<p>In this release, we fix several bugs with Gymnasium v1.0 along with new features to improve the changes made.</p>\\n<p>With the release of Gymnasium v1.0, one of the major changes we made was to the vector environment implementation, improving how users interface with it and extend it. Some users noted that they required backward compatibility with how vector environments automatically reset sub-environments when they terminated or truncated, referred to as the autoreset mode or API. As a result, in v1.1, we have added support to the implemented vector environments (<code>SyncVectorEnv</code> and <code>AsyncVectorEnv</code>) and wrappers for all three possible modes: next-step, same-step and disabled. To know the type of autoreset mode used, it should be specified in the vector environment metadata, <code>metadata[\\\"autoreset_mode\\\"]</code> as a <code>gymnasium.vectors.AutoresetMode</code> enum. For more information on the differences between the autoreset modes and how to use them, read <a href=\\\"https://farama.org/Vector-Autoreset-Mode\\\" rel=\\\"nofollow\\\">https://farama.org/Vector-Autoreset-Mode</a>.</p>\\n<p>In addition, we've added several helpful features</p>\\n<ul>\\n<li>In Gymnasium v0.25, we added masking for spaces, <code>space.sample(mask=...)</code>, to logically mask out possible samples from spaces. <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2862826251\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\\\">#1310</a> adds probability masking for each space with <code>space.sample(probability=...)</code> to specify the probability of each sample, which is helpful for RL policies that output a probability distribution of actions.</li>\\n<li>For MuJoCo environments, users have the option to render the robot with RGB images or depth-based images. Previously, it was only possible to access only the RGB or depth rendering. Gymnasium v1.1 adds RGBD rendering that outputs both the RGB and depth-based image as a single output.</li>\\n<li>In Gymnasium v1.0, we improve the conversion wrappers between <code>jax</code>, <code>torch</code> and <code>numpy</code> with <code>JaxToTorch</code>, <code>JaxToNumpy</code>, <code>NumpyToTorch</code>, etc. In v1.1, we've improved the wrappers to work with rendering and to be compatible with the full dlpack API.</li>\\n</ul>\\n<h2>New Features</h2>\\n<ul>\\n<li>Space sampling with probability masks by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/mariojerez/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/mariojerez\\\">@mariojerez</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2862826251\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\\\">#1310</a>)</li>\\n<li>Add rgb + depth rendering to mujoco by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/DavidPL1/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/DavidPL1\\\">@DavidPL1</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2609027305\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1229\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1229/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1229\\\">#1229</a>)</li>\\n<li>Add Render Support for <code>JaxToNumpy</code>, <code>JaxToTorch</code> and <code>NumpyToTorch</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2848895679\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1306\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1306/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1306\\\">#1306</a>)</li>\\n<li>Add full support for dlpack API by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2814347019\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1299\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1299/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1299\\\">#1299</a>)</li>\\n<li>Enable the StickyAction wrapper to repeat the old action for more than 1 step by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2640457710\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1240\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1240/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1240\\\">#1240</a>)</li>\\n<li>Add <code>wrappers.vector.TransformObs/Action</code> support for a single obs/action space argument by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/howardh/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/howardh\\\">@howardh</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2774346221\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1288\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1288/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1288\\\">#1288</a>)</li>\\n<li>Improve 'set_wrapper_attr' for wrappers to modify how variables are updated by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/duburcqa/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/duburcqa\\\">@duburcqa</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2796910732\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1294\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1294/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1294\\\">#1294</a>)</li>\\n<li>Allow <code>AtariPreprocessing</code> to support non-square observations by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/li-plus/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/li-plus\\\">@li-plus</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2865365152\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1312\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1312/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1312\\\">#1312</a>)</li>\\n<li>Improve <code>Wrapper</code> and <code>VectorWrapper</code> error checking by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2682633843\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1260\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1260/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1260\\\">#1260</a>)</li>\\n</ul>\\n<h2>Bug Fixes</h2>\\n<ul>\\n<li>Fix <code>get_wrapper_attr</code> / <code>set_wrapper_attr</code> for edge case by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/duburcqa/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/duburcqa\\\">@duburcqa</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2796262300\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1293\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1293/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1293\\\">#1293</a>)</li>\\n<li>Add width and height check for MujocoRenderer to help error checking by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Beanpow/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Beanpow\\\">@Beanpow</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2609101033\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1230\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1230/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1230\\\">#1230</a>)</li>\\n<li>Fix CarRacing-v3 with discrete actions by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Sebastian-Griesbach/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Sebastian-Griesbach\\\">@Sebastian-Griesbach</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2661510536\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1253\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1253/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1253\\\">#1253</a>)</li>\\n<li>Fix Mujoco Rendering for <code>rgb_array</code> that is upside down by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2701582796\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1264\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1264/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1264\\\">#1264</a>)</li>\\n<li>Revert <code>OrderedDict</code> key ordering in <code>Dict</code> space by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2776425048\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1291\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1291/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1291\\\">#1291</a>)</li>\\n<li><code>wrappers.vector.NumpyToTorch</code> doesn't require <code>jax</code> to work by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2852244816\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1308\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1308/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1308\\\">#1308</a>)</li>\\n<li>Prevented accidental alteration of <code>env_spec_kwargs</code> in <code>make_vec</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/TimSchneider42/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/TimSchneider42\\\">@TimSchneider42</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2838283083\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1304\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1304/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1304\\\">#1304</a>)</li>\\n<li>Quit Pygame when cliffwalking environment is closed by @sjmikler (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2807765801\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1297\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1297/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1297\\\">#1297</a>)</li>\\n<li>Add <code>TimeAwareObservation</code> support for environments without a <code>spec</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2775663588\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1289\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1289/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1289\\\">#1289</a>)</li>\\n<li>Improve the error message when ending the MuJoCo renderer without calling <code>env.close</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/a-ayesh/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/a-ayesh\\\">@a-ayesh</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2761798816\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1283\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1283/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1283\\\">#1283</a>)</li>\\n</ul>\\n<h2>Documentation Changes</h2>\\n<ul>\\n<li>Fixed typo in blackjack for the values sampled, 2-10 instead of 2-9 by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/artemmiyy/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/artemmiyy\\\">@artemmiyy</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2656508334\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1250\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1250/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1250\\\">#1250</a>)</li>\\n<li>Added some popular third-party beginner tutorials by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/godsboy404/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/godsboy404\\\">@godsboy404</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2737913912\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1271\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1271/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1271\\\">#1271</a>)</li>\\n<li>Update documentation for reward scaling wrappers by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/keraJLi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/keraJLi\\\">@keraJLi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2765903176\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1285\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1285/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1285\\\">#1285</a>)</li>\\n</ul>\\n<p>Thanks to the 31 new contributors that contributed to this release with the <strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.0.0...v1.1.0\\\"><tt>v1.0.0...v1.1.0</tt></a></p></section>\\n<section id=\\\"release-v1-0-0\\\">\\n<h2>v1.0.0<a href=\\\"#release-v1-0-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2024-10-08 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.0.0/\\\">PyPI</a></em></p>\\n<h2>v1.0.0 release notes</h2>\\n<p>Over the last few years, the volunteer team behind Gym and Gymnasium has worked to fix bugs, improve the documentation, add new features, and change the API where appropriate so that the benefits outweigh the co\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/gymnasium_release_notes/index.html",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article role=\"main\">\n            \n            <section id=\"gymnasium-release-notes\">\n<h2>Gymnasium Release Notes<a href=\"#gymnasium-release-notes\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"release-v1-3-0\">\n<h2>v1.3.0<a href=\"#release-v1-3-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2026-04-22 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.3.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.3.0/\">PyPI</a></em></p>\n<h2>Gymnasium v1.3.0</h2>\n<p>This release brings a new Taxi environment version, a new <code>RepeatAction</code> wrapper, and a range of bug fixes across vector environments and wrappers.</p>\n<h2>Core Changes</h2>\n<ul>\n<li>The <code>Taxi</code> environment has been updated to <code>v4</code> to correct the <code>is_rainy</code> implementation, which previously did not behave as documented by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"4307782993\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1561\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1561/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1561\">#1561</a>)</li>\n<li><code>pygame</code> has been replaced with <a href=\"https://pyga.me/\" rel=\"nofollow\"><code>pygame-ce</code></a>, unlocking Python 3.14 compatibility. The drop-in replacement preserves the existing rendering behaviour by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/mwydmuch/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/mwydmuch\">@mwydmuch</a> (<a data-error-text=\"Failed to load title\" data-id=\"3788673329\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1512\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1512/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1512\">#1512</a>)</li>\n<li>Added <code>RepeatAction</code> wrapper that repeats a given action for a fixed number of steps, useful for frame-skipping and coarser control loops by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Lidang-Jiang/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Lidang-Jiang\">@Lidang-Jiang</a> (<a data-error-text=\"Failed to load title\" data-id=\"4194777638\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1553\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1553/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1553\">#1553</a>)</li>\n<li>Optimised <code>Box.__init__</code> to reduce overhead through lazy evaluation of variables by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"3958776450\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1529\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1529/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1529\">#1529</a>)</li>\n<li>Fix vector <code>RecordVideo</code> wrapper to remove memory leaks across episodes by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/JonahFSD/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/JonahFSD\">@JonahFSD</a> (<a data-error-text=\"Failed to load title\" data-id=\"3914195171\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1527\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1527/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1527\">#1527</a>)</li>\n<li>Fix Vector <code>NormalizeReward</code> wrapper to work identically as the non-vectorized version by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/JonahFSD/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/JonahFSD\">@JonahFSD</a> (<a data-error-text=\"Failed to load title\" data-id=\"3912418439\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1526\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1526/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1526\">#1526</a>)</li>\n</ul>\n<h2>Bug Fixes</h2>\n<ul>\n<li>Removed the <code>VectorEnv</code> destructor, which was causing <code>VectorEnv.close()</code> to be called unintentionally by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/TimSchneider42/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/TimSchneider42\">@TimSchneider42</a> (<a data-error-text=\"Failed to load title\" data-id=\"3832761339\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1522\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1522/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1522\">#1522</a>)</li>\n<li>Fixed <code>NormalizeObservation</code> vectorized wrapper to override the <code>observation_space</code> and a <code>float32</code> cast by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/JonahFSD/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/JonahFSD\">@JonahFSD</a> (<a data-error-text=\"Failed to load title\" data-id=\"3914296741\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1528\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1528/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1528\">#1528</a>)</li>\n<li>Fixed <code>RenderFrame</code>'s typing by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/jorenham/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/jorenham\">@jorenham</a> (<a data-error-text=\"Failed to load title\" data-id=\"4305182899\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1560\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1560/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1560\">#1560</a>)</li>\n<li>Fixed A2C tutorial's entropy computation when a scalar (rather than tensor) is passed to the loss function by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/epoint95/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/epoint95\">@epoint95</a> (<a data-error-text=\"Failed to load title\" data-id=\"3868402075\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1524\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1524/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1524\">#1524</a>)</li>\n</ul>\n<h2>Third-Party Environments</h2>\n<p>10 new community environments have been added to the <a href=\"https://gymnasium.farama.org/environments/third_party_environments/\" rel=\"nofollow\">third-party environments list</a>, including a new Cybersecurity environments section.</p>\n<p><strong>Full Changelog:</strong> <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.3...v1.3.0\"><tt>v1.2.3...v1.3.0</tt></a></p></section>\n<section id=\"release-v1-2-3\">\n<h2>v1.2.3<a href=\"#release-v1-2-3\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-12-18 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.3\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.3/\">PyPI</a></em></p>\n<h2>Gymnasium V1.2.3</h2>\n<p>This is a minor release with the most significant being changing the dependency  for \"gymnasium[box2d]\" from <a href=\"https://pypi.org/project/box2d-py\" rel=\"nofollow\">box2d-py</a> to <a href=\"https://pypi.org/project/Box2D/\" rel=\"nofollow\">box2d</a>. See <a data-error-text=\"Failed to load title\" data-id=\"3580605032\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1474\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1474/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1474\">#1474</a> for more detail.</p>\n<p>Additionally, we're fixed several typos and added a couple of third-party projects. Lastly, we fixed the create an environment tutorial to use NumPy [row, col] in <a data-error-text=\"Failed to load title\" data-id=\"3704120436\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1490\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1490/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1490\">#1490</a></p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.2...v1.2.3\"><tt>v1.2.2...v1.2.3</tt></a></p></section>\n<section id=\"release-v1-2-2\">\n<h2>v1.2.2: V1.2.2<a href=\"#release-v1-2-2\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-11-04 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.2\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.2/\">PyPI</a></em></p>\n<h2>Gymnasium v1.2.2</h2>\n<p>This is a minor update just to add <code>Discrete.dtype</code> and improve <code>DictInfoToList</code></p>\n<p>These additional changes were made</p>\n<ul>\n<li>Add <code>Discrete.dtype</code> parameter by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/VadimBim/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/VadimBim\">@VadimBim</a> in <a data-error-text=\"Failed to load title\" data-id=\"3529955390\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1467\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1467/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1467\">#1467</a></li>\n<li>Improve <code>MultiDiscrete.dtype</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> in <a data-error-text=\"Failed to load title\" data-id=\"3542266766\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1469\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1469/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1469\">#1469</a></li>\n<li>Improve <code>DictInfoToList</code> to support vector info with missing binary key by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> in <a data-error-text=\"Failed to load title\" data-id=\"3526222616\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1466\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1466/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1466\">#1466</a></li>\n</ul>\n<p>Thanks to Wispr for their support of the project</p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.1...v1.2.2\"><tt>v1.2.1...v1.2.2</tt></a></p></section>\n<section id=\"release-v1-2-1\">\n<h2>v1.2.1<a href=\"#release-v1-2-1\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-09-23 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.1\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.1/\">PyPI</a></em></p>\n<h2>Gymnasium v1.2.1</h2>\n<p>Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs.</p>\n<h2>New Features</h2>\n<ul>\n<li>Added new wrappers to discretize observations and actions (<code>gymnasium.wrappers.DiscretizeObservation</code> and <code>gymnasium.wrappers.DiscretizeAction</code>) by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"3185358242\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1411\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1411/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1411\">#1411</a>)</li>\n<li>Add vector-based <code>RecordVideo</code> wrapper that allows recording all sub-environments at the same time by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"3241889899\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1418\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1418/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1418\">#1418</a>)</li>\n<li>Add new customization to Frozen Lake for reward schedule and success rate by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Tayzn/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Tayzn\">@Tayzn</a> (<a data-error-text=\"Failed to load title\" data-id=\"3021051457\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1363\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1363/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1363\">#1363</a>)</li>\n<li>Add tutorial on action masking by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/dantp-ai/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/dantp-ai\">@dantp-ai</a> (<a data-error-text=\"Failed to load title\" data-id=\"2958306932\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1345\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1345/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1345\">#1345</a>)</li>\n<li>Optimize several environment functions for Blackjack and Mujoco-Humanoid by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/aseembits93/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/aseembits93\">@aseembits93</a> (<a data-error-text=\"Failed to load title\" data-id=\"3298033175\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1426\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1426/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1426\">#1426</a>, <a data-error-text=\"Failed to load title\" data-id=\"3320254150\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1429\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1429/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1429\">#1429</a>, <a data-error-text=\"Failed to load title\" data-id=\"3320248227\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1428\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1428/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1428\">#1428</a> and <a data-error-text=\"Failed to load title\" data-id=\"3356742105\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1436\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1436/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1436\">#1436</a>)</li>\n</ul>\n<h2>Fixes</h2>\n<ul>\n<li>Fixes memory leak in recording by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pascal-roth/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pascal-roth\">@pascal-roth</a> in <a data-error-text=\"Failed to load title\" data-id=\"3394001149\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1444\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1444/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1444\">#1444</a></li>\n<li>Fix \"pygame is not responding\" when rendering the pygame window in the Taxi environment by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/RizkySugiharto/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/RizkySugiharto\">@RizkySugiharto</a> (<a data-error-text=\"Failed to load title\" data-id=\"3182695017\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1410\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1410/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1410\">#1410</a>)</li>\n<li>Fix the vector wrapper warnings by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"3205200502\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1415\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1415/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1415\">#1415</a>)</li>\n<li>Fix <code>AsyncVectorEnv.step</code> from hanging forever by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/matinmoezzi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/matinmoezzi\">@matinmoezzi</a> (<a data-error-text=\"Failed to load title\" data-id=\"3243762778\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1419\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1419/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1419\">#1419</a>)</li>\n<li>Clear markers after rendering for MuJoCo OffscreenViewers by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3248714510\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1420\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1420/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1420\">#1420</a>)</li>\n<li>Fix array-conversion for torch cuda tensors by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3363046088\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1439\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1439/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1439\">#1439</a>)</li>\n<li>Add missing dependency on packaging to array-api feature by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3361105679\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1437\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1437/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1437\">#1437</a>)</li>\n</ul>\n<p>Thank you to all the contributors</p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.0...v1.2.1\"><tt>v1.2.0...v1.2.1</tt></a></p></section>\n<section id=\"release-v1-2-0\">\n<h2>v1.2.0<a href=\"#release-v1-2-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-06-27 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.0/\">PyPI</a></em></p>\n<h2>Gymnasium v1.2.0</h2>\n<p>In Gym <a href=\"https://github.com/openai/gym/releases/tag/0.24.0\">v0.24</a>, v4 MuJoCo environments were added that used a different simulator (mujoco, not mujoco-py).<br>\nHaving been 3 years since v0.24 and with mujoco-py being unmaintained, it is limiting our ability to support Python 3.13.<br>\nTherefore, in this release, we have moved the MuJoCo v2 and v3 to the <a href=\"https://github.com/Farama-Foundation/gymnasium-robotics\">Gymnasium-Robotics</a> project, meaning the users who cannot upgrade to the v4 or v5 MuJoCo environments should update their code to:</p>\n<div data-snippet-clipboard-copy-content=\"import gymnasium as gym\nimport gymnasium_robotics  # `pip install &quot;gymnasium-robotics[mujoco-py]&quot;`\n\ngym.register_envs(gymnasium_robotics)  # optional\n\nenv = gym.make(&quot;Humanoid-v3&quot;)\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\n<span>import</span> <span>gymnasium_robotics</span>  <span># `pip install \"gymnasium-robotics[mujoco-py]\"`</span>\n\n<span>gym</span>.<span>register_envs</span>(<span>gymnasium_robotics</span>)  <span># optional</span>\n\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\"Humanoid-v3\"</span>)</pre></div>\n<p>In addition, we have added support for Python 3.13 (and dropped Python 3.8 and 3.9 following NumPy and other projects).</p>\n<h2>New Features</h2>\n<ul>\n<li>Add <code>AddWhiteNoise</code> and <code>ObstructView</code> wrappers that add noise to RGB renderings either across the whole image or sections by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"2645643074\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1243\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1243/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1243\">#1243</a>)</li>\n<li>Adds rainy and fickle options for Taxi environment that were implemented in the original paper by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/foreverska/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/foreverska\">@foreverska</a> (<a data-error-text=\"Failed to load title\" data-id=\"2867689892\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1315\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1315/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1315\">#1315</a>)</li>\n<li>Add <code>wrappers.ArrayConversion</code>, a generic conversion wrapper between Array API compatible frameworks (like NumPy, Torch, Jax, etc) by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2930111508\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1333\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1333/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1333\">#1333</a>)</li>\n<li>Add garbage collector on RecordVideo wrapper to prevent out of memory problems by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/vicbentu/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/vicbentu\">@vicbentu</a> (<a data-error-text=\"Failed to load title\" data-id=\"3054955909\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1378\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1378/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1378\">#1378</a>)</li>\n</ul>\n<h2>Bug Fixes</h2>\n<ul>\n<li>Fix MuJoCo add_markers for mujoco&gt;=3.2 by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2900883178\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1329\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1329/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1329\">#1329</a>)</li>\n<li>Update CliffWalking version in envs registry by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/abouelsaadat/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/abouelsaadat\">@abouelsaadat</a> (<a data-error-text=\"Failed to load title\" data-id=\"2936052417\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1334\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1334/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1334\">#1334</a>)</li>\n<li>Fix jax 0.6.0 regression in jax_to_numpy by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/GaetanLepage/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/GaetanLepage\">@GaetanLepage</a> (<a data-error-text=\"Failed to load title\" data-id=\"3004879973\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1361\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1361/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1361\">#1361</a>)</li>\n<li>Fix Atari Preprocessing to only check env has NOOP action when <code>noop_max&gt;0</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"3098488051\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1393\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1393/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1393\">#1393</a>)</li>\n<li>Fix all warnings in tests/spaces by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3126115146\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1396\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1396/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1396\">#1396</a>)</li>\n<li>Don't override SDL's audio driver on Linux by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/AnErrupTion/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/AnErrupTion\">@AnErrupTion</a> (<a data-error-text=\"Failed to load title\" data-id=\"3079104537\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1387\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1387/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1387\">#1387</a>)</li>\n</ul>\n<p>In addition, this release includes numerous updates to the documentation, most importantly to the introductory pages, with an aim to make them easier for new users of Gymnasium or RL to understand.</p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.1...v1.2.0\"><tt>v1.1.1...v1.2.0</tt></a></p></section>\n<section id=\"release-v1-1-1\">\n<h2>Gymnasium v1.1.1<a href=\"#release-v1-1-1\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-03-06 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.1\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.1.1/\">PyPI</a></em></p>\n<h2>Changes</h2>\n<ul>\n<li>Remove assert on metadata render modes for MuJoCo-based environments in <code>mujoco_env.py</code> (<a data-error-text=\"Failed to load title\" data-id=\"2900123613\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1328\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1328/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1328\">#1328</a>)</li>\n<li>Update <code>wrappers.vector.NumpyToTorch</code> to refer to <code>numpy</code> instead of <code>jax</code>  by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pkuderov/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pkuderov\">@pkuderov</a> in <a data-error-text=\"Failed to load title\" data-id=\"2883844632\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1319\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1319/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1319\">#1319</a></li>\n</ul>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.0...v1.1.1\"><tt>v1.1.0...v1.1.1</tt></a></p></section>\n<section id=\"release-v1-1-0\">\n<h2>v1.1.0<a href=\"#release-v1-1-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-02-26 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.1.0/\">PyPI</a></em></p>\n<h2>Gymnasium v1.1</h2>\n<p>In this release, we fix several bugs with Gymnasium v1.0 along with new features to improve the changes made.</p>\n<p>With the release of Gymnasium v1.0, one of the major changes we made was to the vector environment implementation, improving how users interface with it and extend it. Some users noted that they required backward compatibility with how vector environments automatically reset sub-environments when they terminated or truncated, referred to as the autoreset mode or API. As a result, in v1.1, we have added support to the implemented vector environments (<code>SyncVectorEnv</code> and <code>AsyncVectorEnv</code>) and wrappers for all three possible modes: next-step, same-step and disabled. To know the type of autoreset mode used, it should be specified in the vector environment metadata, <code>metadata[\"autoreset_mode\"]</code> as a <code>gymnasium.vectors.AutoresetMode</code> enum. For more information on the differences between the autoreset modes and how to use them, read <a href=\"https://farama.org/Vector-Autoreset-Mode\" rel=\"nofollow\">https://farama.org/Vector-Autoreset-Mode</a>.</p>\n<p>In addition, we've added several helpful features</p>\n<ul>\n<li>In Gymnasium v0.25, we added masking for spaces, <code>space.sample(mask=...)</code>, to logically mask out possible samples from spaces. <a data-error-text=\"Failed to load title\" data-id=\"2862826251\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\">#1310</a> adds probability masking for each space with <code>space.sample(probability=...)</code> to specify the probability of each sample, which is helpful for RL policies that output a probability distribution of actions.</li>\n<li>For MuJoCo environments, users have the option to render the robot with RGB images or depth-based images. Previously, it was only possible to access only the RGB or depth rendering. Gymnasium v1.1 adds RGBD rendering that outputs both the RGB and depth-based image as a single output.</li>\n<li>In Gymnasium v1.0, we improve the conversion wrappers between <code>jax</code>, <code>torch</code> and <code>numpy</code> with <code>JaxToTorch</code>, <code>JaxToNumpy</code>, <code>NumpyToTorch</code>, etc. In v1.1, we've improved the wrappers to work with rendering and to be compatible with the full dlpack API.</li>\n</ul>\n<h2>New Features</h2>\n<ul>\n<li>Space sampling with probability masks by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/mariojerez/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/mariojerez\">@mariojerez</a> (<a data-error-text=\"Failed to load title\" data-id=\"2862826251\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\">#1310</a>)</li>\n<li>Add rgb + depth rendering to mujoco by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/DavidPL1/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/DavidPL1\">@DavidPL1</a> (<a data-error-text=\"Failed to load title\" data-id=\"2609027305\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1229\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1229/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1229\">#1229</a>)</li>\n<li>Add Render Support for <code>JaxToNumpy</code>, <code>JaxToTorch</code> and <code>NumpyToTorch</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2848895679\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1306\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1306/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1306\">#1306</a>)</li>\n<li>Add full support for dlpack API by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2814347019\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1299\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1299/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1299\">#1299</a>)</li>\n<li>Enable the StickyAction wrapper to repeat the old action for more than 1 step by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"2640457710\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1240\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1240/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1240\">#1240</a>)</li>\n<li>Add <code>wrappers.vector.TransformObs/Action</code> support for a single obs/action space argument by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/howardh/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/howardh\">@howardh</a> (<a data-error-text=\"Failed to load title\" data-id=\"2774346221\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1288\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1288/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1288\">#1288</a>)</li>\n<li>Improve 'set_wrapper_attr' for wrappers to modify how variables are updated by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/duburcqa/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/duburcqa\">@duburcqa</a> (<a data-error-text=\"Failed to load title\" data-id=\"2796910732\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1294\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1294/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1294\">#1294</a>)</li>\n<li>Allow <code>AtariPreprocessing</code> to support non-square observations by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/li-plus/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/li-plus\">@li-plus</a> (<a data-error-text=\"Failed to load title\" data-id=\"2865365152\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1312\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1312/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1312\">#1312</a>)</li>\n<li>Improve <code>Wrapper</code> and <code>VectorWrapper</code> error checking by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2682633843\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1260\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1260/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1260\">#1260</a>)</li>\n</ul>\n<h2>Bug Fixes</h2>\n<ul>\n<li>Fix <code>get_wrapper_attr</code> / <code>set_wrapper_attr</code> for edge case by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/duburcqa/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/duburcqa\">@duburcqa</a> (<a data-error-text=\"Failed to load title\" data-id=\"2796262300\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1293\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1293/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1293\">#1293</a>)</li>\n<li>Add width and height check for MujocoRenderer to help error checking by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Beanpow/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Beanpow\">@Beanpow</a> (<a data-error-text=\"Failed to load title\" data-id=\"2609101033\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1230\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1230/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1230\">#1230</a>)</li>\n<li>Fix CarRacing-v3 with discrete actions by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Sebastian-Griesbach/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Sebastian-Griesbach\">@Sebastian-Griesbach</a> (<a data-error-text=\"Failed to load title\" data-id=\"2661510536\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1253\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1253/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1253\">#1253</a>)</li>\n<li>Fix Mujoco Rendering for <code>rgb_array</code> that is upside down by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2701582796\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1264\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1264/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1264\">#1264</a>)</li>\n<li>Revert <code>OrderedDict</code> key ordering in <code>Dict</code> space by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2776425048\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1291\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1291/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1291\">#1291</a>)</li>\n<li><code>wrappers.vector.NumpyToTorch</code> doesn't require <code>jax</code> to work by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2852244816\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1308\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1308/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1308\">#1308</a>)</li>\n<li>Prevented accidental alteration of <code>env_spec_kwargs</code> in <code>make_vec</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/TimSchneider42/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/TimSchneider42\">@TimSchneider42</a> (<a data-error-text=\"Failed to load title\" data-id=\"2838283083\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1304\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1304/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1304\">#1304</a>)</li>\n<li>Quit Pygame when cliffwalking environment is closed by @sjmikler (<a data-error-text=\"Failed to load title\" data-id=\"2807765801\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1297\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1297/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1297\">#1297</a>)</li>\n<li>Add <code>TimeAwareObservation</code> support for environments without a <code>spec</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2775663588\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1289\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1289/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1289\">#1289</a>)</li>\n<li>Improve the error message when ending the MuJoCo renderer without calling <code>env.close</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/a-ayesh/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/a-ayesh\">@a-ayesh</a> (<a data-error-text=\"Failed to load title\" data-id=\"2761798816\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1283\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1283/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1283\">#1283</a>)</li>\n</ul>\n<h2>Documentation Changes</h2>\n<ul>\n<li>Fixed typo in blackjack for the values sampled, 2-10 instead of 2-9 by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/artemmiyy/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/artemmiyy\">@artemmiyy</a> (<a data-error-text=\"Failed to load title\" data-id=\"2656508334\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1250\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1250/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1250\">#1250</a>)</li>\n<li>Added some popular third-party beginner tutorials by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/godsboy404/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/godsboy404\">@godsboy404</a> (<a data-error-text=\"Failed to load title\" data-id=\"2737913912\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1271\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1271/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1271\">#1271</a>)</li>\n<li>Update documentation for reward scaling wrappers by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/keraJLi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/keraJLi\">@keraJLi</a> (<a data-error-text=\"Failed to load title\" data-id=\"2765903176\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1285\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1285/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1285\">#1285</a>)</li>\n</ul>\n<p>Thanks to the 31 new contributors that contributed to this release with the <strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.0.0...v1.1.0\"><tt>v1.0.0...v1.1.0</tt></a></p></section>\n<section id=\"release-v1-0-0\">\n<h2>v1.0.0<a href=\"#release-v1-0-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2024-10-08 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.0.0/\">PyPI</a></em></p>\n<h2>v1.0.0 release notes</h2>\n<p>Over the last few years, the volunteer team behind Gym and Gymnasium has worked to fix bugs, improve the documentation, add new features, and change the API where appropriate so that the benefits outweigh the co"
  }
}

▼thought

Now I have a comprehensive picture of Gymnasium. Let me also get the Farama Foundation stats page and the third-party environments page for completeness.

▼✓抓取网页正文网址: "https://farama.org/stats/installations"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://farama.org/stats/installations\",\n  \"title\": \"Installations\",\n  \"summary\": \"Maintaining The World’s Open Source Reinforcement Learning Tools\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div id=\\\"farama-stats-container\\\"><div><p><span>427,777,793</span></p></div><div id=\\\"collapse_downloads\\\"><div><p><span></span><span>403,904,777</span></p><div id=\\\"collapse_downloads_Gymnasium\\\"><p><span>Gymnasium</span><span>70,396,473</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym</span><span>333,508,304</span></p></div></div><div><p><span>PettingZoo</span><span>5,249,005</span></p></div><div><p><span>Minari</span><span>374,042</span></p></div><div><p><span>Gymnasium-Robotics</span><span>543,608</span></p></div><div><p><span>Jumpy</span><span>38,110</span></p></div><div><p><span></span><span>201,991</span></p><div id=\\\"collapse_downloads_MAgent2\\\"><p><span>MAgent2</span><span>43,479</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>magent</span><span>158,512</span></p></div></div><div><p><span>Metaworld</span><span>120,947</span></p></div><div><p><span></span><span>2,426,372</span></p><div id=\\\"collapse_downloads_MiniGrid\\\"><p><span>Minigrid</span><span>1,072,585</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym-minigrid</span><span>1,353,787</span></p><p><span>babyai</span><span>0</span></p></div></div><div><p><span>MiniWoB++</span><span>13,724</span></p></div><div><p><span>MOMAland</span><span>7,082</span></p></div><div><p><span></span><span>269,147</span></p><div id=\\\"collapse_downloads_MO-Gymnasium\\\"><p><span>MO-Gymnasium</span><span>259,496</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>mo-gym</span><span>9,651</span></p></div></div><div><p><span>Shimmy</span><span>4,785,290</span></p></div><div><p><span>Arcade Learning Environment</span><span>9,192,090</span></p></div><div><p><span>HighwayEnv</span><span>231,619</span></p></div><div><p><span>MPE2</span><span>13,092</span></p></div><div><p><span>Procgen2</span><span>1,670</span></p></div><div><p><span>Stable-Retro</span><span>58,217</span></p></div><div><p><span>ViZDoom</span><span>347,010</span></p></div></div></div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://farama.org/stats/installations",
    "title": "Installations",
    "summary": "Maintaining The World’s Open Source Reinforcement Learning Tools",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div id=\"farama-stats-container\"><div><p><span>427,777,793</span></p></div><div id=\"collapse_downloads\"><div><p><span></span><span>403,904,777</span></p><div id=\"collapse_downloads_Gymnasium\"><p><span>Gymnasium</span><span>70,396,473</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym</span><span>333,508,304</span></p></div></div><div><p><span>PettingZoo</span><span>5,249,005</span></p></div><div><p><span>Minari</span><span>374,042</span></p></div><div><p><span>Gymnasium-Robotics</span><span>543,608</span></p></div><div><p><span>Jumpy</span><span>38,110</span></p></div><div><p><span></span><span>201,991</span></p><div id=\"collapse_downloads_MAgent2\"><p><span>MAgent2</span><span>43,479</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>magent</span><span>158,512</span></p></div></div><div><p><span>Metaworld</span><span>120,947</span></p></div><div><p><span></span><span>2,426,372</span></p><div id=\"collapse_downloads_MiniGrid\"><p><span>Minigrid</span><span>1,072,585</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym-minigrid</span><span>1,353,787</span></p><p><span>babyai</span><span>0</span></p></div></div><div><p><span>MiniWoB++</span><span>13,724</span></p></div><div><p><span>MOMAland</span><span>7,082</span></p></div><div><p><span></span><span>269,147</span></p><div id=\"collapse_downloads_MO-Gymnasium\"><p><span>MO-Gymnasium</span><span>259,496</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>mo-gym</span><span>9,651</span></p></div></div><div><p><span>Shimmy</span><span>4,785,290</span></p></div><div><p><span>Arcade Learning Environment</span><span>9,192,090</span></p></div><div><p><span>HighwayEnv</span><span>231,619</span></p></div><div><p><span>MPE2</span><span>13,092</span></p></div><div><p><span>Procgen2</span><span>1,670</span></p></div><div><p><span>Stable-Retro</span><span>58,217</span></p></div><div><p><span>ViZDoom</span><span>347,010</span></p></div></div></div></div>"
  }
}

▼✓网页搜索结果数: 10, 查询: "Gymnasium wrappers API spaces Env observation action reward step reset"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Env - Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/api/env/\",\n      \"description\": \"* `step()` - Updates an environment with actions returning the next agent observation, the reward for taking that actions, if the environment has terminated or truncated due to the latest action and information from the environment about the step, i.e. metrics, debug info. :   * **observation** (*ObsType*) – An element of the environment’s `observation_space` as the next observation due to the agent actions. :   Resets the environment to an initial internal state, returning an initial observation and info. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset and the env’s `np_random_seed` will *not* be altered. The environment’s `metadata` render modes (env.metadata[“render\\\\_modes”]) should contain the possible ways to implement the render modes. Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., `gymnasium.make(\\\"CartPole-v1\\\",render_mode=\\\"human\\\")`. When implementing an environment, the `Env.reset()` and `Env.step()` functions must be created to describe the dynamics of the environment.\"\n    },\n    {\n      \"title\": \"Core - Gym Documentation\",\n      \"url\": \"https://www.gymlibrary.dev/api/core/\",\n      \"description\": \"## gym.Env¶. gym.Env.step(*self*, *action: ActType*) → Tuple[ObsType, float, bool, bool, dict]¶. When end of episode is reached, you are responsible for calling `reset()` to reset this environment’s state. Accepts an action and returns either a tuple (observation, reward, terminated, truncated, info). :   * **observation** (*object*) – this will be an element of the environment’s `observation_space`. * **reward** (*float*) – The amount of reward returned as a result of taking the action. If the environment already has a random number generator and `reset()` is called with `seed=None`, the RNG should not be reset. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset. This will be an element of `observation_space` (typically a numpy array) and is analogous to the observation returned by `step()`. It is of datatype `Space` provided by Gym. For example, if the observation space is of type `Box` and the shape of the object is `(4,)`, this denotes a valid observation will be an array of 4 numbers.\"\n    },\n    {\n      \"title\": \"Wrappers - Gym Documentation\",\n      \"url\": \"https://www.gymlibrary.dev/api/wrappers/\",\n      \"description\": \"When using the AutoResetWrapper to collect rollouts, note that the when `self.env.step()` returns `done`, a new observation from after calling `self.env.reset()` is returned by `self.step()` alongside the terminal reward and done state from the previous episode . | `AutoResetWrapper` | `gym.Wrapper` | `env` | The wrapped environment will automatically reset when the done state is reached. | `FilterObservation` | `gym.ObservationWrapper` | `env`, `filter_keys=None` | If you have an environment that returns dictionaries as observations, but you would like to only keep a subset of the entries, you can use this wrapper. | `FlattenObservation` | `gym.ObservationWrapper` | `env` | Observation wrapper that flattens the observation |. | `ResizeObservation` | `gym.ObservationWrapper` | `env`, `shape` | This wrapper works on environments with image observations (or more generally observations of shape AxBxC) and resizes the observation to the shape given by the tuple `shape`. |  | `gym.ObservationWrapper` | `env`, `f` | This wrapper will apply `f` to observations |.\"\n    },\n    {\n      \"title\": \"Gymnasium/gymnasium/core.py at main - GitHub\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py\",\n      \"description\": \"\\\"\\\"\\\"Wraps a :class:`gymnasium.Env` to allow a modular transformation of the :meth:`step` and :meth:`reset` methods.\"\n    },\n    {\n      \"title\": \"Wrappers - Gymnasium Documentation - Farama Foundation\",\n      \"url\": \"https://gymnasium.farama.org/v0.27.0/api/experimental/wrappers/\",\n      \"description\": \"A passive environment checker wrapper that surrounds the step, reset and render functions to check they follow the gymnasium API. Initialises the wrapper\"\n    },\n    {\n      \"title\": \"Gym Wrappers | alexandervandekleut.github.io\",\n      \"url\": \"https://alexandervandekleut.github.io/gym-wrappers/\",\n      \"description\": \"The `gym.Wrapper` class inherits from the `gym.Env` class, which defines environments according to the OpenAI API for reinforcement learning. Implementing the `gym.Wrapper` class requires defining an `__init__` method that accepts the environment to be extended as a parameter. Wrapper): def __init__(self, env): super(). env = env def step(self, action): next_state, reward, done, info = self. We can modify specific aspects of the environment by using subclasses of `gym.Wrapper` that override how the environment processes observations, rewards, and action. 1. `gym.ObservationWrapper`: Used to modify the observations returned by the environment. To do this, override the `observation` method of the environment. This method accepts a single parameter (the observation to be modified) and returns the modified observation. 2. `gym.RewardWrapper`: Used to modify the rewards returned by the environment. To do this, override the `reward` method of the environment. This method accepts a single parameter (the reward to be modified) and returns the modified reward. To do this, override the `action` method of the environment.\"\n    },\n    {\n      \"title\": \"Mastering Gymnasium: RL Environments That Click | by Thinking Loop\",\n      \"url\": \"https://medium.com/@ThinkingLoop/mastering-gymnasium-rl-environments-that-click-02590ac1c533\",\n      \"description\": \"# Mastering Gymnasium: RL Environments That Click | by Thinking Loop | Medium. # Mastering Gymnasium: RL Environments That Click. ## A practical, code-first guide to building, wrapping, vectorizing, and debugging reinforcement learning environments with Gymnasium — without the usual confusion. _Master OpenAI Gymnasium for reinforcement learning: reset/step API, wrappers, vector envs, seeding, recording, and custom environments with practical code._. It’s usually not the algorithm. It’s the environment loop. Gymnasium is where that loop becomes predictable. And once the loop is predictable, RL becomes… almost fun. If you’ve ever fought mysterious `done` logic, broken wrappers, or training runs that “worked yesterday,” this guide is for you. Let’s make Gymnasium feel like a tool you control—not a roulette wheel. ## What Gymnasium actually gives you (and why it matters). Gymnasium is the modern, maintained standard interface for reinforcement learning environments, designed as a successor-style drop-in for the older Gym ecosystem. That matters because the environment API is not a small detail.\"\n    },\n    {\n      \"title\": \"Observation Wrappers - Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/v0.28.0/api/wrappers/observation_wrappers/\",\n      \"description\": \"# Observation Wrappers#. The transformation defined in that method must be reflected by the `env` observation space. Otherwise, you need to specify the new observation space of the wrapper by setting `self.observation_space` in the `__init__()` method of your wrapper. The function `f` should be defined on the observation space of the base environment, `env`, and should, ideally, return values in the same space. If the transformation you wish to apply to observations returns values in a *different* space, you should subclass `ObservationWrapper`, implement the transformation, and set the new observation space accordingly. make(\\\"CartPole-v1\\\")>>> env = TransformObservation(env, lambda obs: obs +0.1* np. >>> import gymnasium as gym>>> fromgymnasium.wrappers import TransformObservation>>> env = gym. make(\\\"CartPole-v1\\\")>>> env = TransformObservation(env, lambda obs:{'obs': obs, 'time': 0})>>> env. * **ValueError** – If any of the filter\\\\_keys are not included in the original env’s observation space. *class* gymnasium.wrappers.FlattenObservation(*env: Env*)#. >>> import gymnasium as gym>>> fromgymnasium.wrappers import FlattenObservation>>> env = gym.\"\n    },\n    {\n      \"title\": \"OpenAI Gym Step Function : r/reinforcementlearning - Reddit\",\n      \"url\": \"https://www.reddit.com/r/reinforcementlearning/comments/1dezvei/openai_gym_step_function/\",\n      \"description\": \"The environment gets reset, then the step function is called every timestep until it returns a terminal indicator. So you need to handle all of\"\n    },\n    {\n      \"title\": \"Understanding action & observation spaces in gym for custom ...\",\n      \"url\": \"https://stackoverflow.com/questions/76289764/understanding-action-observation-spaces-in-gym-for-custom-environments-and-age\",\n      \"description\": \"The oddity is in the use of gym's observation spaces. In many examples, the custom environment includes initializing a gym observation space.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Env - Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/api/env/",
        "description": "* `step()` - Updates an environment with actions returning the next agent observation, the reward for taking that actions, if the environment has terminated or truncated due to the latest action and information from the environment about the step, i.e. metrics, debug info. :   * **observation** (*ObsType*) – An element of the environment’s `observation_space` as the next observation due to the agent actions. :   Resets the environment to an initial internal state, returning an initial observation and info. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset and the env’s `np_random_seed` will *not* be altered. The environment’s `metadata` render modes (env.metadata[“render\\_modes”]) should contain the possible ways to implement the render modes. Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., `gymnasium.make(\"CartPole-v1\",render_mode=\"human\")`. When implementing an environment, the `Env.reset()` and `Env.step()` functions must be created to describe the dynamics of the environment."
      },
      {
        "title": "Core - Gym Documentation",
        "url": "https://www.gymlibrary.dev/api/core/",
        "description": "## gym.Env¶. gym.Env.step(*self*, *action: ActType*) → Tuple[ObsType, float, bool, bool, dict]¶. When end of episode is reached, you are responsible for calling `reset()` to reset this environment’s state. Accepts an action and returns either a tuple (observation, reward, terminated, truncated, info). :   * **observation** (*object*) – this will be an element of the environment’s `observation_space`. * **reward** (*float*) – The amount of reward returned as a result of taking the action. If the environment already has a random number generator and `reset()` is called with `seed=None`, the RNG should not be reset. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset. This will be an element of `observation_space` (typically a numpy array) and is analogous to the observation returned by `step()`. It is of datatype `Space` provided by Gym. For example, if the observation space is of type `Box` and the shape of the object is `(4,)`, this denotes a valid observation will be an array of 4 numbers."
      },
      {
        "title": "Wrappers - Gym Documentation",
        "url": "https://www.gymlibrary.dev/api/wrappers/",
        "description": "When using the AutoResetWrapper to collect rollouts, note that the when `self.env.step()` returns `done`, a new observation from after calling `self.env.reset()` is returned by `self.step()` alongside the terminal reward and done state from the previous episode . | `AutoResetWrapper` | `gym.Wrapper` | `env` | The wrapped environment will automatically reset when the done state is reached. | `FilterObservation` | `gym.ObservationWrapper` | `env`, `filter_keys=None` | If you have an environment that returns dictionaries as observations, but you would like to only keep a subset of the entries, you can use this wrapper. | `FlattenObservation` | `gym.ObservationWrapper` | `env` | Observation wrapper that flattens the observation |. | `ResizeObservation` | `gym.ObservationWrapper` | `env`, `shape` | This wrapper works on environments with image observations (or more generally observations of shape AxBxC) and resizes the observation to the shape given by the tuple `shape`. |  | `gym.ObservationWrapper` | `env`, `f` | This wrapper will apply `f` to observations |."
      },
      {
        "title": "Gymnasium/gymnasium/core.py at main - GitHub",
        "url": "https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py",
        "description": "\"\"\"Wraps a :class:`gymnasium.Env` to allow a modular transformation of the :meth:`step` and :meth:`reset` methods."
      },
      {
        "title": "Wrappers - Gymnasium Documentation - Farama Foundation",
        "url": "https://gymnasium.farama.org/v0.27.0/api/experimental/wrappers/",
        "description": "A passive environment checker wrapper that surrounds the step, reset and render functions to check they follow the gymnasium API. Initialises the wrapper"
      },
      {
        "title": "Gym Wrappers | alexandervandekleut.github.io",
        "url": "https://alexandervandekleut.github.io/gym-wrappers/",
        "description": "The `gym.Wrapper` class inherits from the `gym.Env` class, which defines environments according to the OpenAI API for reinforcement learning. Implementing the `gym.Wrapper` class requires defining an `__init__` method that accepts the environment to be extended as a parameter. Wrapper): def __init__(self, env): super(). env = env def step(self, action): next_state, reward, done, info = self. We can modify specific aspects of the environment by using subclasses of `gym.Wrapper` that override how the environment processes observations, rewards, and action. 1. `gym.ObservationWrapper`: Used to modify the observations returned by the environment. To do this, override the `observation` method of the environment. This method accepts a single parameter (the observation to be modified) and returns the modified observation. 2. `gym.RewardWrapper`: Used to modify the rewards returned by the environment. To do this, override the `reward` method of the environment. This method accepts a single parameter (the reward to be modified) and returns the modified reward. To do this, override the `action` method of the environment."
      },
      {
        "title": "Mastering Gymnasium: RL Environments That Click | by Thinking Loop",
        "url": "https://medium.com/@ThinkingLoop/mastering-gymnasium-rl-environments-that-click-02590ac1c533",
        "description": "# Mastering Gymnasium: RL Environments That Click | by Thinking Loop | Medium. # Mastering Gymnasium: RL Environments That Click. ## A practical, code-first guide to building, wrapping, vectorizing, and debugging reinforcement learning environments with Gymnasium — without the usual confusion. _Master OpenAI Gymnasium for reinforcement learning: reset/step API, wrappers, vector envs, seeding, recording, and custom environments with practical code._. It’s usually not the algorithm. It’s the environment loop. Gymnasium is where that loop becomes predictable. And once the loop is predictable, RL becomes… almost fun. If you’ve ever fought mysterious `done` logic, broken wrappers, or training runs that “worked yesterday,” this guide is for you. Let’s make Gymnasium feel like a tool you control—not a roulette wheel. ## What Gymnasium actually gives you (and why it matters). Gymnasium is the modern, maintained standard interface for reinforcement learning environments, designed as a successor-style drop-in for the older Gym ecosystem. That matters because the environment API is not a small detail."
      },
      {
        "title": "Observation Wrappers - Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/v0.28.0/api/wrappers/observation_wrappers/",
        "description": "# Observation Wrappers#. The transformation defined in that method must be reflected by the `env` observation space. Otherwise, you need to specify the new observation space of the wrapper by setting `self.observation_space` in the `__init__()` method of your wrapper. The function `f` should be defined on the observation space of the base environment, `env`, and should, ideally, return values in the same space. If the transformation you wish to apply to observations returns values in a *different* space, you should subclass `ObservationWrapper`, implement the transformation, and set the new observation space accordingly. make(\"CartPole-v1\")>>> env = TransformObservation(env, lambda obs: obs +0.1* np. >>> import gymnasium as gym>>> fromgymnasium.wrappers import TransformObservation>>> env = gym. make(\"CartPole-v1\")>>> env = TransformObservation(env, lambda obs:{'obs': obs, 'time': 0})>>> env. * **ValueError** – If any of the filter\\_keys are not included in the original env’s observation space. *class* gymnasium.wrappers.FlattenObservation(*env: Env*)#. >>> import gymnasium as gym>>> fromgymnasium.wrappers import FlattenObservation>>> env = gym."
      },
      {
        "title": "OpenAI Gym Step Function : r/reinforcementlearning - Reddit",
        "url": "https://www.reddit.com/r/reinforcementlearning/comments/1dezvei/openai_gym_step_function/",
        "description": "The environment gets reset, then the step function is called every timestep until it returns a terminal indicator. So you need to handle all of"
      },
      {
        "title": "Understanding action & observation spaces in gym for custom ...",
        "url": "https://stackoverflow.com/questions/76289764/understanding-action-observation-spaces-in-gym-for-custom-environments-and-age",
        "description": "The oddity is in the use of gym's observation spaces. In many examples, the custom environment includes initializing a gym observation space."
      }
    ]
  }
}

▼thought

Now I have a very comprehensive picture of Gymnasium. Let me compile this into a thorough, well-structured research report. Let me organize the information: 1. What is Gymnasium - definition, origin, relationship to OpenAI Gym 2. The Farama Foundation - the nonprofit behind it 3. Core API design - Env, spaces, step/reset, terminated/truncated 4. Built-in environment families 5. Key features - wrappers, vector environments, functional environments 6. Version history and milestones (v1.0, latest v1.3.0) 7. Ecosystem - related projects (PettingZoo, Minigrid, etc.) 8. Adoption and impact - download stats, NeurIPS 2025 acceptance 9. Migration from OpenAI Gym 10. Practical usage patterns Let me write this up now.

Gymnasium 深度研究报告

一句话定位

Gymnasium 是强化学习（RL）领域的事实标准 API——它是 OpenAI Gym 的官方维护分支，由非营利组织 Farama Foundation 维护，为单智能体 RL 环境提供统一的 Python 接口。截至 2026 年 5 月，Gymnasium 自身累计下载量超过 7000 万次，加上其前身 Gym 的 3.3 亿次下载，整个生态的安装量已超过 4 亿次 Farama Foundation 安装统计。

1. 起源与历史：从 OpenAI Gym 到 Gymnasium

1.1 OpenAI Gym 的诞生与停滞

2016 年，OpenAI 发布了 Gym 库，目的是为 RL 研究提供一个标准化的环境接口。在 Deep Q-Network（DQN）于 2013 年展示出通用神经网络算法可以在 Atari 游戏上达到专家级表现之后 [Mnih et al., 2013]，RL 领域迎来了爆发式增长，但各研究团队各自实现环境接口，导致结果难以复现和比较。Gym 解决了这个问题。

然而，OpenAI 在 2021 年前后逐渐停止了对 Gym 的维护。Gym 的最后一个官方版本停留在 v0.21（旧 API）和 v0.26（引入了部分 breaking changes 但未完成过渡）。

1.2 Farama Foundation 接手

2021 年，原 Gym 的维护团队将项目移交给了新成立的 Farama Foundation——一个专门维护开源 RL 基础设施的非营利组织。项目被重命名为 Gymnasium，并于 2022 年 10 月正式作为独立项目发布。

关键时间线：

时间	事件
2016	OpenAI 发布 Gym
2021	OpenAI 停止维护，Farama Foundation 接手
2022.10	Gymnasium 作为独立项目正式发布
2024.07	Gymnasium 白皮书发布（arXiv:2407.17032）
2024.10	Gymnasium v1.0.0 发布——首个稳定 API
2025.11	白皮书被 NeurIPS 2025 Datasets & Benchmarks 接收
2026.04	最新版本 v1.3.0 发布

Gymnasium 白皮书明确写道："Gymnasium is a maintained fork of OpenAI's Gym library by its maintainers (OpenAI handed over maintenance a few years ago to an outside team), and is where future maintenance will occur going forward." GitHub

2. Farama Foundation：背后的非营利组织

Farama Foundation 的使命是以中立、非营利的方式维护全球开源 RL 工具。它目前维护着超过 15 个项目，构成了 RL 研究的基础设施层。

2.1 核心项目矩阵

项目	定位	累计下载量
Gymnasium	单智能体 RL 环境标准 API	7000 万+
PettingZoo	多智能体 RL（MARL）环境 API	520 万+
Arcade Learning Environment (ALE)	Atari 2600 模拟器	919 万+
Minigrid	网格世界环境	107 万+
Gymnasium-Robotics	机器人仿真环境	54 万+
Shimmy	跨 API 兼容层	478 万+
MO-Gymnasium	多目标 RL	26 万+
HighwayEnv	自动驾驶仿真	23 万+
ViZDoom	Doom 游戏环境	35 万+

Farama Foundation 项目页

2.2 治理模式

Farama 采用志愿者维护 + 社区捐赠的模式。核心维护者包括 Mark Towers（University of Southampton）、Jordan Terry、Ariel Kwiatkowski（Meta AI / FAIR）等人。项目通过 GitHub Sponsors 接受捐赠。

3. 核心 API 设计

Gymnasium 的核心哲学是：环境是一个简单的 Python 类，通过 reset() 和 step() 两个方法与智能体交互。

3.1 最小运行示例

import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="human")
observation, info = env.reset(seed=42)

for _ in range(1000):
    action = env.action_space.sample()  # 随机策略
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

3.2 `step()` 的五元组返回值

这是 Gymnasium 与旧 Gym 最核心的区别：

返回值	类型	含义
`observation`	`ObsType`	环境的新观测
`reward`	`float`	即时奖励
`terminated`	`bool`	任务自然结束（成功/失败）
`truncated`	`bool`	因外部约束结束（超时/步数限制）
`info`	`dict`	辅助诊断信息

terminated vs truncated 的区分是 Gymnasium 最重要的设计决策之一。在旧 Gym 中，只有一个 done 标志，无法区分"智能体真的完成了任务"和"只是因为时间到了"。这对 value-based 方法中的 bootstrap 操作至关重要：

# 正确的 bootstrap 逻辑
if terminated:
    next_value = 0           # 自然结束，没有未来价值
elif truncated:
    next_value = value_fn(next_obs)  # 超时截断，需要估计剩余价值

Gymnasium 迁移指南

3.3 空间系统（Spaces）

Gymnasium 提供了丰富的空间类型来描述观测和动作：

空间类型	描述	示例
`Box`	连续张量空间	`Box(low=-1, high=1, shape=(4,))`
`Discrete`	离散整数空间	`Discrete(2)` — 左/右
`MultiDiscrete`	多维离散空间	`MultiDiscrete([3, 2])`
`MultiBinary`	多维二值空间	`MultiBinary(5)`
`Text`	字符串空间	`Text(max_length=10)`
`Dict`	字典复合空间	`Dict({"pos": Box(...), "vel": Box(...)})`
`Tuple`	元组复合空间	`Tuple((Discrete(2), Box(...)))`
`Graph`	图结构空间	节点+边+全局属性
`Sequence`	变长序列空间	`Sequence(Box(...))`

v1.1 还新增了 概率掩码采样（space.sample(probability=...)），允许策略直接输出概率分布进行采样 Gymnasium v1.1 Release Notes。

4. 内置环境家族

Gymnasium 内置了五大类环境，覆盖从入门教学到前沿研究的各种复杂度：

4.1 Classic Control（经典控制）

5 个环境：Acrobot、CartPole、Mountain Car、Continuous Mountain Car、Pendulum。基于真实物理问题，状态和动作空间较小，适合作为 RL 算法的入门测试。所有环境在初始状态上具有随机性 Classic Control 文档。

4.2 Box2D

基于 2D 物理引擎的玩具游戏：Lunar Lander、Car Racing、Bipedal Walker。使用 PyGame 渲染，是社区中最受欢迎的基准测试之一。v1.2.3 将 Box2D 依赖从 box2d-py 迁移到了 box2d v1.2.3 Release Notes。

4.3 Toy Text

极简离散环境：Blackjack、Taxi、Cliff Walking、Frozen Lake。状态和动作空间极小，适合调试 RL 算法实现。v1.3.0 新增了 Taxi-v4，修复了 is_rainy 的实现 bug v1.3.0 Release Notes。

4.4 MuJoCo

基于 MuJoCo 物理引擎的连续控制任务：HalfCheetah、Hopper、Humanoid、Ant、Swimmer、Walker2d、Pusher、Reacher 等。这是深度 RL 研究中最常用的连续控制基准。v1.0 引入了 MuJoCo v5 环境，支持最新的 MuJoCo 版本。v1.1 新增了 RGBD 渲染（同时输出 RGB 和深度图）MuJoCo 文档。

4.5 Atari（已独立为 ALE）

Atari 2600 游戏环境（Pong、Breakout、Space Invaders 等 60+ 款游戏）在 v1.0 中从 Gymnasium 核心剥离，迁移到了独立的 Arcade Learning Environment (ALE) 项目。这是为了完成一个长达数年的上游化进程——将 DeepMind 和 OpenAI 各自维护的 ALE 分支统一。现在用户需要 import ale_py 来注册 Atari 环境 Gymnasium v1.0 博客。

5. 关键特性深度解析

5.1 Wrapper 系统

Wrapper 是 Gymnasium 最强大的扩展机制。通过嵌套包装，可以在不修改原始环境代码的情况下变换观测、奖励、动作等：

gymnasium.wrappers
├── ObservationWrapper     # 变换观测
├── RewardWrapper          # 变换奖励
├── ActionWrapper          # 变换动作
├── RecordVideo            # 录制视频
├── RecordEpisodeStatistics # 记录回合统计
├── NormalizeObservation   # 观测归一化
├── NormalizeReward        # 奖励归一化
├── TimeLimit              # 时间限制
├── FrameStackObservation  # 帧堆叠
├── FlattenObservation     # 展平观测
├── ResizeObservation      # 缩放图像观测
├── ClipReward             # 裁剪奖励
├── TransformObservation   # 自定义观测变换
├── TransformReward        # 自定义奖励变换
├── TransformAction        # 自定义动作变换
├── StickyAction           # 动作粘滞（模拟随机延迟）
├── AddWhiteNoise          # 添加白噪声（v1.2.0）
├── ObstructView           # 遮挡视野（v1.2.0）
├── DiscretizeObservation  # 离散化观测（v1.2.1）
├── DiscretizeAction       # 离散化动作（v1.2.1）
├── RepeatAction           # 重复动作（v1.3.0）
└── DelayObservation       # 延迟观测（v1.0）

v1.0 的重大变化：Env 和 VectorEnv 的 Wrapper 被分离到不同命名空间（gymnasium.wrappers vs gymnasium.wrappers.vector），因为两者不再共享继承关系。

5.2 Vector Environments（向量化环境）

v1.0 对向量化环境进行了彻底重构：

Env 和 VectorEnv 现在是两个独立的基类（旧版本中 VectorEnv 继承自 Env，这在语义上不正确）
新增 gymnasium.make_vec() 函数，一行代码创建多个并行环境
支持三种向量化模式：sync（同步）、async（异步）、vector_entry_point（环境自定义）
v1.1 新增了三种 autoreset 模式：next-step、same-step、disabled，兼容旧版行为

# 一行创建 3 个并行环境
envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="async")

Gymnasium v1.0 博客

5.3 Functional Environments（函数式环境）

v1.0 引入了 FuncEnv——函数式环境范式。它将环境的动态分解为纯函数：

transition(state, action) → 状态转移
observation(state) → 观测
reward(state, action, next_state) → 奖励
terminal(state) → 终止判断

这种设计使得环境可以轻松地在 JAX 中实现，利用 jit 编译和 vmap 向量化获得巨大的性能提升。这对于 planning 算法（如 MCTS、模型预测控制）特别有用。

5.4 跨框架互操作

v1.0 和 v1.1 大幅改进了跨深度学习框架的支持：

JaxToTorch、JaxToNumpy、NumpyToTorch 等转换 Wrapper
完整的 DLPack API 支持（零拷贝张量转换）
v1.2.0 新增 ArrayConversion 通用转换 Wrapper，支持所有 Array API 兼容框架

6. 版本演进与最新动态

6.1 v1.0.0（2024-10-08）—— 里程碑

超过 200 个 PR 合并，核心变化：

Env / VectorEnv 分离
make_vec() 函数
FuncEnv 函数式环境
MuJoCo v5 环境
移除插件系统（需显式 import 外部库）
ALE 独立为单独项目
承诺：这是可预见未来最后一次 breaking change

6.2 v1.1.0（2025-02-26）

概率掩码采样
RGBD 渲染
Autoreset 模式支持
DLPack 完整支持

6.3 v1.2.0（2025-06-27）

MuJoCo v2/v3 迁移至 Gymnasium-Robotics
Python 3.13 支持（放弃 3.8/3.9）
AddWhiteNoise、ObstructView Wrapper
ArrayConversion 通用转换

6.4 v1.3.0（2026-04-22）—— 当前最新

Taxi-v4 环境
RepeatAction Wrapper
pygame → pygame-ce 迁移（解锁 Python 3.14 兼容性）
向量环境内存泄漏修复
新增 10 个第三方环境（含网络安全类别）

Gymnasium Release Notes

7. 生态系统全景

7.1 训练库兼容性

几乎所有主流 RL 训练库都支持 Gymnasium API：

训练库	类型	说明
CleanRL	单文件实现	官方推荐新手使用
Stable-Baselines3	经典算法集	PPO、A2C、SAC、TD3 等
RLlib	分布式	Ray 生态
Tianshou	PyTorch	清华大学维护
SB3-Contrib	SB3 扩展	TRPO、QR-DQN 等
AgileRL	进化 RL	结合进化算法

7.2 第三方环境生态

Gymnasium 的第三方环境列表持续增长，覆盖领域包括：

机器人：Gymnasium-Robotics、panda-gym、Metaworld、Robo-gym
自动驾驶：HighwayEnv、BlueSky-Gym（空中交通）
游戏：ViZDoom（Doom）、Craftium（Minecraft-like）、Procgen2
金融交易：AnyTrading
网络安全（v1.3.0 新增类别）
多目标 RL：MO-Gymnasium
安全 RL：Safety-Gymnasium、Robust-Gymnasium（ICLR 2025）

第三方环境列表

7.3 多智能体扩展：PettingZoo

PettingZoo 是 Gymnasium 的多智能体版本，支持两种 API 风格：

AEC API（Agent Environment Cycle）：顺序决策
Parallel API：同步并行决策

PettingZoo 累计下载量超过 520 万次，支持 Python 3.10-3.14。

8. 从 OpenAI Gym 迁移

如果你还在使用旧版 OpenAI Gym（import gym），以下是迁移的核心变化：

组件	旧 Gym v0.21	新 Gymnasium
导入	`import gym`	`import gymnasium as gym`
重置	`obs = env.reset()`	`obs, info = env.reset()`
种子	`env.seed(42)`	`env.reset(seed=42)`
步进	`obs, reward, done, info = env.step(a)`	`obs, reward, terminated, truncated, info = env.step(a)`
渲染	`env.render(mode="human")`	`gym.make(id, render_mode="human")`
Bootstrap	`(1-done) * gamma * next_val`	`(1-terminated) * gamma * next_val`

对于无法升级的旧环境，可以使用 Shimmy 兼容层包装。

9. 学术影响与引用

Gymnasium 白皮书 "Gymnasium: A Standard Interface for Reinforcement Learning Environments" 已被 NeurIPS 2025 Datasets & Benchmarks 接收 NeurIPS 2025 Poster。引用格式：

@article{towers2024gymnasium,
  title={Gymnasium: A Standard Interface for Reinforcement Learning Environments},
  author={Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and
          De Cola, Gianluca and Deleu, Tristan and Goul{\~a}o, Manuel and
          Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},
  journal={arXiv preprint arXiv:2407.17032},
  year={2024}
}

白皮书指出，截至 2025 年 4 月，Gymnasium 月下载量已超过 100 万次，累计超过 1800 万次安装（该数字在 2026 年 5 月已增长至 7000 万+）。

10. 总结与展望

核心价值

Gymnasium 解决了 RL 研究中三个根本问题：

标准化：统一的 reset/step 接口使得算法和环境可以任意组合
可复现性：严格的版本控制（-v0、-v1 后缀）和种子机制
生态效应：作为"API 标准"，它连接了环境开发者和算法开发者两个社区

未来方向

根据 v1.0 发布博客，Gymnasium 的核心 API 已进入长期稳定阶段。未来的发展重点可能是：

函数式环境（JAX 原生）的进一步推广
更多垂直领域的第三方环境
Python 版本跟进（已支持到 3.14）
与 LLM-based agent 的接口适配（站内已有论文如 "Healthcare AI GYM for Medical Agents" 使用 gymnasium-compatible 环境训练医疗 AI arXiv:2605.02943）

补充说明

依据范围：本报告基于 Gymnasium 官方文档、GitHub 仓库、Farama Foundation 官网、arXiv 白皮书及发布说明，所有关键论断均可通过内联链接一键核实。
尚需核实的点：Gymnasium 的具体财务数据（捐赠金额）、各训练库对 v1.3.0 的兼容性更新进度未逐一验证，建议查阅各训练库的 release notes。

282195 tokens · $0.0437

深度研究一下Gymnasium

▼thought

▼✓当前时间time_zone: "Asia/Shanghai"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"unix_ms\": 1778230255171,\n  \"utc_iso\": \"2026-05-08T08:50:55.171Z\",\n  \"server_timezone\": \"UTC\",\n  \"server_local_calendar\": {\n    \"year\": 2026,\n    \"month\": 5,\n    \"day\": 8,\n    \"hour\": 8,\n    \"minute\": 50,\n    \"second\": 55\n  },\n  \"server_local_display_zh\": \"2026年5月8日星期五 08:50:55\",\n  \"requested_zone\": {\n    \"time_zone\": \"Asia/Shanghai\",\n    \"local_calendar\": {\n      \"year\": 2026,\n      \"month\": 5,\n      \"day\": 8,\n      \"hour\": 16,\n      \"minute\": 50,\n      \"second\": 55\n    },\n    \"local_display_zh\": \"2026年5月8日星期五 16:50:55\"\n  }\n}"
    }
  ],
  "details": {
    "unix_ms": 1778230255171,
    "utc_iso": "2026-05-08T08:50:55.171Z",
    "server_timezone": "UTC",
    "server_local_calendar": {
      "year": 2026,
      "month": 5,
      "day": 8,
      "hour": 8,
      "minute": 50,
      "second": 55
    },
    "server_local_display_zh": "2026年5月8日星期五 08:50:55",
    "requested_zone": {
      "time_zone": "Asia/Shanghai",
      "local_calendar": {
        "year": 2026,
        "month": 5,
        "day": 8,
        "hour": 16,
        "minute": 50,
        "second": 55
      },
      "local_display_zh": "2026年5月8日星期五 16:50:55"
    }
  }
}

▼✓信息流全文检索关键词: "Gymnasium Farama reinforcement learning", 条数: 30

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"items\": [\n    {\n      \"id\": \"https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5\",\n      \"url\": \"https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5\",\n      \"source_url\": \"https://www.machinebrief.com/rss.xml\",\n      \"title\": \"Revolutionizing Language Models: A New Era with Adaptive Power-Mean Optimization\",\n      \"author\": [\n        \"Felix Navarro\"\n      ],\n      \"summary\": \"Adaptive Power-Mean Policy Optimization (APMPO) transforms reinforcement learning for Large Language Models. By embracing dynamic policy strategies, it outperforms traditional static methods, enhancing reasoning capabilities dramatically.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T14:08:40.000Z\",\n      \"fetched_at\": \"2026-05-07T16:40:31.825Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04920v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04920\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization\",\n      \"author\": [\n        \"Xiyan Fu\",\n        \"Wei Liu\"\n      ],\n      \"summary\": \"arXiv:2605.04920v1 Announce Type: cross Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn\",\n      \"url\": \"https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn\",\n      \"source_url\": \"https://www.machinebrief.com/rss.xml\",\n      \"title\": \"Reinforcement Learning Gets a Safety Upgrade with ALaM\",\n      \"author\": [\n        \"Julian Voss\"\n      ],\n      \"summary\": \"The ALaM framework tackles safety in reinforcement learning by optimizing state-wise constraints. It's a big deal for stable training and risk management.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T15:53:58.000Z\",\n      \"fetched_at\": \"2026-05-05T16:10:22.708Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02943v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02943\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Healthcare AI GYM for Medical Agents\",\n      \"author\": [\n        \"Minbyul Jeong\"\n      ],\n      \"summary\": \"arXiv:2605.02943v1 Announce Type: new Abstract: Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \\\\gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2412.08893v3\",\n      \"url\": \"https://arxiv.org/abs/2412.08893\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Optimal Control with Natural Images: Efficient Reinforcement Learning using Overcomplete Sparse Codes\",\n      \"author\": [\n        \"Peter N. Loxley\"\n      ],\n      \"summary\": \"arXiv:2412.08893v3 Announce Type: replace Abstract: Optimal control and sequential decision making are widely used in many complex tasks. Optimal control over a sequence of natural images is a first step towards understanding the role of vision in control. Here, we formalize this problem as a reinforcement learning task, and derive general conditions under which an image includes enough information to implement an optimal policy. Reinforcement learning is shown to provide a computationally efficient method for finding optimal policies when natural images are encoded into \\\"efficient\\\" image representations. This is demonstrated by introducing a new reinforcement learning benchmark that easily scales to large numbers of states and long horizons. In particular, by representing each image as an overcomplete sparse code, we are able to efficiently solve an optimal control task that is orders of magnitude larger than those tasks solvable using complete codes. Theoretical justification for this behaviour is provided. This work also demonstrates that deep learning is not necessary for efficient optimal control with natural images.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03327v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03327\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment\",\n      \"author\": [\n        \"Hongbo Jin\",\n        \"Rongpeng Zhu\",\n        \"Zhongjing Du\",\n        \"Xu Jiang\",\n        \"Jingqi Tian\",\n        \"Qiaoman Zhang\",\n        \"Jiayu Ding\"\n      ],\n      \"summary\": \"arXiv:2605.03327v1 Announce Type: new Abstract: Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04066v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04066\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning\",\n      \"author\": [\n        \"Yiming Huang\",\n        \"Zhenbo Shi\",\n        \"Shuzheng Gao\",\n        \"Cuiyun Gao\",\n        \"Peiyi Han\",\n        \"Chuanyi Liu\"\n      ],\n      \"summary\": \"arXiv:2605.04066v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model's evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04065v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04065\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs\",\n      \"author\": [\n        \"Yiming Huang\",\n        \"Zhenbo Shi\",\n        \"Xin-Cheng Wen\",\n        \"Jichuan Zeng\",\n        \"Cuiyun Gao\",\n        \"Peiyi Han\",\n        \"Chuanyi Liu\"\n      ],\n      \"summary\": \"arXiv:2605.04065v1 Announce Type: new Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2506.07548v2\",\n      \"url\": \"https://arxiv.org/abs/2506.07548\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage\",\n      \"author\": [\n        \"Weiqiang Jin\",\n        \"Yang Liu\",\n        \"Shixiang Tang\",\n        \"Jinhu Qi\",\n        \"Wentao Zhang\",\n        \"Junli Wang\",\n        \"Biao Zhao\",\n        \"Hongyang Du\"\n      ],\n      \"summary\": \"arXiv:2506.07548v2 Announce Type: replace Abstract: Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04266v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04266\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Explaining and Preventing Alignment Collapse in Iterative RLHF\",\n      \"author\": [\n        \"Etienne Gauthier\",\n        \"Francis Bach\",\n        \"Michael I. Jordan\"\n      ],\n      \"summary\": \"arXiv:2605.04266v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy's parameter-steering effect on RM updates. We instantiate FPO via a scalable first-order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama-3.2-1B.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.14872v2\",\n      \"url\": \"https://arxiv.org/abs/2602.14872\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards\",\n      \"author\": [\n        \"Yu Huang\",\n        \"Zixin Wen\",\n        \"Yuejie Chi\",\n        \"Yuting Wei\",\n        \"Aarti Singh\",\n        \"Yingbin Liang\",\n        \"Yuxin Chen\"\n      ],\n      \"summary\": \"arXiv:2602.14872v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RLVR for transformers on compositional reasoning tasks. Our theory shows that mixed-difficulty training naturally follows an implicit curriculum: without any explicit schedule, easier problems become learnable first and shape the frontier for harder ones, creating a learning progression from easy to hard during optimization. The effectiveness of this curriculum is governed by the smoothness of the difficulty spectrum. When the spectrum is smooth, training dynamics enters a well-behaved relay regime, in which persistent gradient signals on easier problems make slightly harder ones tractable and keep training at the edge of competence. When the spectrum contains abrupt discontinuities, training undergoes grokking-type phase transitions with prolonged plateaus before progress recurs. As a technical contribution, our analysis develops and adapts techniques from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2601.07389v2\",\n      \"url\": \"https://arxiv.org/abs/2601.07389\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training\",\n      \"author\": [\n        \"Xueyan Niu\",\n        \"Bo Bai\",\n        \"Wei Han\",\n        \"Weixi Zhang\"\n      ],\n      \"summary\": \"arXiv:2601.07389v2 Announce Type: replace Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.19837v3\",\n      \"url\": \"https://arxiv.org/abs/2602.19837\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent\",\n      \"author\": [\n        \"Bj\\\\\\\"orn Hoppmann\",\n        \"Christoph Scholz\"\n      ],\n      \"summary\": \"arXiv:2602.19837v3 Announce Type: replace-cross Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind's Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04960v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04960\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance\",\n      \"author\": [\n        \"Song Yu\",\n        \"Li Li\",\n        \"Wenwen Zhao\",\n        \"Zhisheng Yang\"\n      ],\n      \"summary\": \"arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04712v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04712\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning\",\n      \"author\": [\n        \"Lirui Luo\",\n        \"Guoxi Zhang\",\n        \"Hongming Xu\",\n        \"Cong Fang\",\n        \"Qing Li\"\n      ],\n      \"summary\": \"arXiv:2605.04712v1 Announce Type: new Abstract: In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2604.27859v2\",\n      \"url\": \"https://arxiv.org/abs/2604.27859\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"A Brief Overview: Agentic Reinforcement Learning In Large Language Models\",\n      \"author\": [\n        \"Fangming Cui\",\n        \"Ruixiao Zhu\",\n        \"Cheng Fang\",\n        \"Sunan Li\",\n        \"Jiahong Li\"\n      ],\n      \"summary\": \"arXiv:2604.27859v2 Announce Type: replace Abstract: Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02063v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02063\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition\",\n      \"author\": [\n        \"Vik Pant\",\n        \"Eric Yu\"\n      ],\n      \"summary\": \"arXiv:2605.02063v1 Announce Type: cross Abstract: We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward-type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game-theoretic oracles, 2 heuristic baselines, and 101 constant-action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708-run training corpus and a 1,116-run behavioral audit corpus, both released under CC-BY-4.0 with Croissant 1.0 metadata. Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-05T16:07:19.561Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.03452v2\",\n      \"url\": \"https://arxiv.org/abs/2602.03452\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing\",\n      \"author\": [\n        \"Yujuan Pang\",\n        \"Jiaxin Li\",\n        \"Xin Sheng\",\n        \"Ran Peng\",\n        \"Yong Ma\"\n      ],\n      \"summary\": \"arXiv:2602.03452v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \\\\emph{positive--negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03434v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03434\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits\",\n      \"author\": [\n        \"Yu-Ting Lee\",\n        \"Samuel Yen-Chi Chen\",\n        \"Fu-Chieh Chang\"\n      ],\n      \"summary\": \"arXiv:2605.03434v1 Announce Type: new Abstract: Reinforcement learning is one of the most challenging learning paradigms where efficacy and efficiency gains are extremely valuable. Hierarchical reinforcement learning is a variant that leverages temporal abstraction to structure decision-making. While parametrized quantum computations have shown success in non-hierarchical reinforcement learning, whether these advantages adapt to hierarchical decision-making remains a critical open question. In this work, we develop a hybrid hierarchical agent based on the option-critic architecture. This hybrid agent substitutes classical components with variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies. Evaluated on standard benchmarking environments, results show that a hybrid agent utilizing a quantum feature extractor can outperform classical baselines while saving up to 66\\\\% trainable parameters. We also identify an architectural bottleneck that quantum option-value estimation severely degrades performance. Further ablation studies reveal how architectural choices of the quantum circuits affect performance. Our work establishes design principles for parameter-efficient hybrid hierarchical agents.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2511.08717v4\",\n      \"url\": \"https://arxiv.org/abs/2511.08717\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Optimal control of the future via prospective learning with control\",\n      \"author\": [\n        \"Yuxin Bai\",\n        \"Aranyak Acharyya\",\n        \"Ashwin De Silva\",\n        \"Zeyu Shen\",\n        \"James Hassett\",\n        \"Joshua T. Vogelstein\"\n      ],\n      \"summary\": \"arXiv:2511.08717v4 Announce Type: replace-cross Abstract: Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PLuC), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control: foraging, a canonical task relevant to both natural and artificial agents. We illustrate that modern RL algorithms, which assume stationarity, struggle in these non-stationary reset-free environments. Even with time-aware modifications, they converge orders of magnitude slower than our prospective foraging agents on a simple 1-D foraging benchmark. Code is available at: https://github.com/neurodata/procontrol.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.05890v2\",\n      \"url\": \"https://arxiv.org/abs/2602.05890\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training\",\n      \"author\": [\n        \"Dingwei Zhu\",\n        \"Zhiheng Xi\",\n        \"Shihan Dou\",\n        \"Jiahan Li\",\n        \"Chenhao Huang\",\n        \"Junjie Ye\",\n        \"Sixian Li\",\n        \"Mingxu Chai\",\n        \"Yuhui Wang\",\n        \"Yajie Yang\",\n        \"Ming Zhang\",\n        \"Jiazheng Zhang\",\n        \"Shichun Liu\",\n        \"Caishuang Huang\",\n        \"Yunke Zhang\",\n        \"Yuran Wang\",\n        \"Tao Gui\",\n        \"Xipeng Qiu\",\n        \"Qi Zhang\",\n        \"Xuanjing Huang\"\n      ],\n      \"summary\": \"arXiv:2602.05890v2 Announce Type: replace-cross Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04880v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04880\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs\",\n      \"author\": [\n        \"Erel Shtossel\",\n        \"Alicia Vidler\",\n        \"Uri Shaham\",\n        \"Gal A. Kaminka\"\n      ],\n      \"summary\": \"arXiv:2605.04880v1 Announce Type: new Abstract: Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04939v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04939\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Modular Reinforcement Learning For Cooperative Swarms\",\n      \"author\": [\n        \"Erel Shtossel\",\n        \"Gal A. Kaminka\"\n      ],\n      \"summary\": \"arXiv:2605.04939v1 Announce Type: cross Abstract: A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36\",\n      \"url\": \"https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36\",\n      \"source_url\": \"https://www.machinebrief.com/rss.xml\",\n      \"title\": \"Revolutionizing Code Generation with Requirement-Aware Learning\",\n      \"author\": [\n        \"Felix Navarro\"\n      ],\n      \"summary\": \"RECRL, a new curriculum reinforcement learning framework, significantly enhances code generation by addressing requirement difficulty. It shows a notable performance boost across five benchmarks.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T15:25:34.000Z\",\n      \"fetched_at\": \"2026-05-05T16:10:22.708Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04368v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04368\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Extending Differential Temporal Difference Methods for Episodic Problems\",\n      \"author\": [\n        \"Kris De Asis\",\n        \"Mohamed Elsayed\",\n        \"Jiamin He\"\n      ],\n      \"summary\": \"arXiv:2605.04368v1 Announce Type: new Abstract: Differential temporal difference (TD) methods are value-based reinforcement learning algorithms that have been proposed for infinite-horizon problems. They rely on reward centering, where each reward is centered by the average reward. This keeps the return bounded and removes a value function's state-independent offset. However, reward centering can alter the optimal policy in episodic problems, limiting its applicability. Motivated by recent works that emphasize the role of normalization in streaming deep reinforcement learning, we study reward centering in episodic problems and propose a generalization of differential TD. We prove that this generalization maintains the ordering of policies in the presence of termination, and thus extends differential TD to episodic problems. We show equivalence with a form of linear TD, thereby inheriting theoretical guarantees that have been shown for those algorithms. We then extend several streaming reinforcement learning algorithms to their differential counterparts. Across a range of base algorithms and environments, we empirically validate that reward centering can improve sample efficiency in episodic problems.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05110v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05110\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts\",\n      \"author\": [\n        \"Seungeun Rho\",\n        \"Shamel Fahmi\",\n        \"Jeonghwan Kim\",\n        \"Arianna Ilvonen\",\n        \"Sehoon Ha\",\n        \"Gabriel Nelson\"\n      ],\n      \"summary\": \"arXiv:2605.05110v1 Announce Type: cross Abstract: Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05112v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05112\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime\",\n      \"author\": [\n        \"Tianshu Zhu\",\n        \"Wenyu Zhang\",\n        \"Xiaoying Zuo\",\n        \"Lun Tian\",\n        \"Haotian Zhao\",\n        \"Yucheng Zeng\",\n        \"Jingnan Gu\",\n        \"Daxiang Dong\",\n        \"Jianmin Wu\",\n        \"Dawei Yin\",\n        \"Dou Shen\"\n      ],\n      \"summary\": \"arXiv:2605.05112v1 Announce Type: new Abstract: SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success--failure contrastive structure. Guided by this principle, we propose Prefix Sampling (PS), which replays trajectory prefixes to steer skewed groups toward this regime: successful prefixes serve as head starts for mostly failing groups, while failing prefixes serve as handicaps for mostly passing groups. In stateful agent environments, prefix states are reconstructed through replay while replayed tokens are excluded from the loss, restricting optimization to continuations generated by the current policy. On SWE-bench-style agentic RL, PS delivers end-to-end wall-clock speedups of 2.01x on Qwen3-14B and 1.55x on Qwen3-32B while preserving or improving final verified performance. For 14B, the SWE-bench Verified peak rises from the baseline peak of 0.273 to 0.295 under PS. Additional mathematical reasoning experiments on AIME 2025 show the same pass-rate control pattern and decompose the gains into replay, bidirectional coverage, and adaptive control.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02913v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02913\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning\",\n      \"author\": [\n        \"Rohan Surana\",\n        \"Gagan Mundada\",\n        \"Xunyi Jiang\",\n        \"Chuhan Wang\",\n        \"Zhenwei Tang\",\n        \"Difan Jiao\",\n        \"Zihan Huang\",\n        \"Yuxin Xiong\",\n        \"Junda Wu\",\n        \"Sheldon Yu\",\n        \"Xintong Li\",\n        \"Raghav Jain\",\n        \"Nikki Kuang\",\n        \"Sizhe Zhou\",\n        \"Bowen Jin\",\n        \"Zhendong Chu\",\n        \"Tong Yu\",\n        \"Ryan Rossi\",\n        \"Kuan-Hao Huang\",\n        \"Jingbo Shang\",\n        \"Jiawei Han\",\n        \"Julian McAuley\"\n      ],\n      \"summary\": \"arXiv:2605.02913v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04542v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04542\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation\",\n      \"author\": [\n        \"Akiyoshi Tomihari\",\n        \"Issei Sato\"\n      ],\n      \"summary\": \"arXiv:2605.04542v1 Announce Type: new Abstract: Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2604.01345v2\",\n      \"url\": \"https://arxiv.org/abs/2604.01345\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning\",\n      \"author\": [\n        \"Vikram Krishnamurthy\",\n        \"Luke Snow\"\n      ],\n      \"summary\": \"arXiv:2604.01345v2 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses. Adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    }\n  ],\n  \"total\": 3509\n}"
    }
  ],
  "details": {
    "items": [
      {
        "id": "https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5",
        "url": "https://www.machinebrief.com/news/revolutionizing-language-models-a-new-era-with-adaptive-powe-ais5",
        "source_url": "https://www.machinebrief.com/rss.xml",
        "title": "Revolutionizing Language Models: A New Era with Adaptive Power-Mean Optimization",
        "author": [
          "Felix Navarro"
        ],
        "summary": "Adaptive Power-Mean Policy Optimization (APMPO) transforms reinforcement learning for Large Language Models. By embracing dynamic policy strategies, it outperforms traditional static methods, enhancing reasoning capabilities dramatically.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T14:08:40.000Z",
        "fetched_at": "2026-05-07T16:40:31.825Z"
      },
      {
        "id": "oai:arXiv.org:2605.04920v1",
        "url": "https://arxiv.org/abs/2605.04920",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization",
        "author": [
          "Xiyan Fu",
          "Wei Liu"
        ],
        "summary": "arXiv:2605.04920v1 Announce Type: cross Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn",
        "url": "https://www.machinebrief.com/news/reinforcement-learning-gets-a-safety-upgrade-with-alam-67sn",
        "source_url": "https://www.machinebrief.com/rss.xml",
        "title": "Reinforcement Learning Gets a Safety Upgrade with ALaM",
        "author": [
          "Julian Voss"
        ],
        "summary": "The ALaM framework tackles safety in reinforcement learning by optimizing state-wise constraints. It's a big deal for stable training and risk management.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T15:53:58.000Z",
        "fetched_at": "2026-05-05T16:10:22.708Z"
      },
      {
        "id": "oai:arXiv.org:2605.02943v1",
        "url": "https://arxiv.org/abs/2605.02943",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Healthcare AI GYM for Medical Agents",
        "author": [
          "Minbyul Jeong"
        ],
        "summary": "arXiv:2605.02943v1 Announce Type: new Abstract: Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \\gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2412.08893v3",
        "url": "https://arxiv.org/abs/2412.08893",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Optimal Control with Natural Images: Efficient Reinforcement Learning using Overcomplete Sparse Codes",
        "author": [
          "Peter N. Loxley"
        ],
        "summary": "arXiv:2412.08893v3 Announce Type: replace Abstract: Optimal control and sequential decision making are widely used in many complex tasks. Optimal control over a sequence of natural images is a first step towards understanding the role of vision in control. Here, we formalize this problem as a reinforcement learning task, and derive general conditions under which an image includes enough information to implement an optimal policy. Reinforcement learning is shown to provide a computationally efficient method for finding optimal policies when natural images are encoded into \"efficient\" image representations. This is demonstrated by introducing a new reinforcement learning benchmark that easily scales to large numbers of states and long horizons. In particular, by representing each image as an overcomplete sparse code, we are able to efficiently solve an optimal control task that is orders of magnitude larger than those tasks solvable using complete codes. Theoretical justification for this behaviour is provided. This work also demonstrates that deep learning is not necessary for efficient optimal control with natural images.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.03327v1",
        "url": "https://arxiv.org/abs/2605.03327",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment",
        "author": [
          "Hongbo Jin",
          "Rongpeng Zhu",
          "Zhongjing Du",
          "Xu Jiang",
          "Jingqi Tian",
          "Qiaoman Zhang",
          "Jiayu Ding"
        ],
        "summary": "arXiv:2605.03327v1 Announce Type: new Abstract: Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2605.04066v1",
        "url": "https://arxiv.org/abs/2605.04066",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning",
        "author": [
          "Yiming Huang",
          "Zhenbo Shi",
          "Shuzheng Gao",
          "Cuiyun Gao",
          "Peiyi Han",
          "Chuanyi Liu"
        ],
        "summary": "arXiv:2605.04066v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model's evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2605.04065v1",
        "url": "https://arxiv.org/abs/2605.04065",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs",
        "author": [
          "Yiming Huang",
          "Zhenbo Shi",
          "Xin-Cheng Wen",
          "Jichuan Zeng",
          "Cuiyun Gao",
          "Peiyi Han",
          "Chuanyi Liu"
        ],
        "summary": "arXiv:2605.04065v1 Announce Type: new Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2506.07548v2",
        "url": "https://arxiv.org/abs/2506.07548",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage",
        "author": [
          "Weiqiang Jin",
          "Yang Liu",
          "Shixiang Tang",
          "Jinhu Qi",
          "Wentao Zhang",
          "Junli Wang",
          "Biao Zhao",
          "Hongyang Du"
        ],
        "summary": "arXiv:2506.07548v2 Announce Type: replace Abstract: Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.04266v1",
        "url": "https://arxiv.org/abs/2605.04266",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Explaining and Preventing Alignment Collapse in Iterative RLHF",
        "author": [
          "Etienne Gauthier",
          "Francis Bach",
          "Michael I. Jordan"
        ],
        "summary": "arXiv:2605.04266v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy's parameter-steering effect on RM updates. We instantiate FPO via a scalable first-order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama-3.2-1B.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2602.14872v2",
        "url": "https://arxiv.org/abs/2602.14872",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards",
        "author": [
          "Yu Huang",
          "Zixin Wen",
          "Yuejie Chi",
          "Yuting Wei",
          "Aarti Singh",
          "Yingbin Liang",
          "Yuxin Chen"
        ],
        "summary": "arXiv:2602.14872v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RLVR for transformers on compositional reasoning tasks. Our theory shows that mixed-difficulty training naturally follows an implicit curriculum: without any explicit schedule, easier problems become learnable first and shape the frontier for harder ones, creating a learning progression from easy to hard during optimization. The effectiveness of this curriculum is governed by the smoothness of the difficulty spectrum. When the spectrum is smooth, training dynamics enters a well-behaved relay regime, in which persistent gradient signals on easier problems make slightly harder ones tractable and keep training at the edge of competence. When the spectrum contains abrupt discontinuities, training undergoes grokking-type phase transitions with prolonged plateaus before progress recurs. As a technical contribution, our analysis develops and adapts techniques from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2601.07389v2",
        "url": "https://arxiv.org/abs/2601.07389",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training",
        "author": [
          "Xueyan Niu",
          "Bo Bai",
          "Wei Han",
          "Weixi Zhang"
        ],
        "summary": "arXiv:2601.07389v2 Announce Type: replace Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2602.19837v3",
        "url": "https://arxiv.org/abs/2602.19837",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent",
        "author": [
          "Bj\\\"orn Hoppmann",
          "Christoph Scholz"
        ],
        "summary": "arXiv:2602.19837v3 Announce Type: replace-cross Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind's Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04960v1",
        "url": "https://arxiv.org/abs/2605.04960",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance",
        "author": [
          "Song Yu",
          "Li Li",
          "Wenwen Zhao",
          "Zhisheng Yang"
        ],
        "summary": "arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04712v1",
        "url": "https://arxiv.org/abs/2605.04712",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning",
        "author": [
          "Lirui Luo",
          "Guoxi Zhang",
          "Hongming Xu",
          "Cong Fang",
          "Qing Li"
        ],
        "summary": "arXiv:2605.04712v1 Announce Type: new Abstract: In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2604.27859v2",
        "url": "https://arxiv.org/abs/2604.27859",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "A Brief Overview: Agentic Reinforcement Learning In Large Language Models",
        "author": [
          "Fangming Cui",
          "Ruixiao Zhu",
          "Cheng Fang",
          "Sunan Li",
          "Jiahong Li"
        ],
        "summary": "arXiv:2604.27859v2 Announce Type: replace Abstract: Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.02063v1",
        "url": "https://arxiv.org/abs/2605.02063",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition",
        "author": [
          "Vik Pant",
          "Eric Yu"
        ],
        "summary": "arXiv:2605.02063v1 Announce Type: cross Abstract: We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward-type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game-theoretic oracles, 2 heuristic baselines, and 101 constant-action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708-run training corpus and a 1,116-run behavioral audit corpus, both released under CC-BY-4.0 with Croissant 1.0 metadata. Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T04:00:00.000Z",
        "fetched_at": "2026-05-05T16:07:19.561Z"
      },
      {
        "id": "oai:arXiv.org:2602.03452v2",
        "url": "https://arxiv.org/abs/2602.03452",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing",
        "author": [
          "Yujuan Pang",
          "Jiaxin Li",
          "Xin Sheng",
          "Ran Peng",
          "Yong Ma"
        ],
        "summary": "arXiv:2602.03452v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \\emph{positive--negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.03434v1",
        "url": "https://arxiv.org/abs/2605.03434",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits",
        "author": [
          "Yu-Ting Lee",
          "Samuel Yen-Chi Chen",
          "Fu-Chieh Chang"
        ],
        "summary": "arXiv:2605.03434v1 Announce Type: new Abstract: Reinforcement learning is one of the most challenging learning paradigms where efficacy and efficiency gains are extremely valuable. Hierarchical reinforcement learning is a variant that leverages temporal abstraction to structure decision-making. While parametrized quantum computations have shown success in non-hierarchical reinforcement learning, whether these advantages adapt to hierarchical decision-making remains a critical open question. In this work, we develop a hybrid hierarchical agent based on the option-critic architecture. This hybrid agent substitutes classical components with variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies. Evaluated on standard benchmarking environments, results show that a hybrid agent utilizing a quantum feature extractor can outperform classical baselines while saving up to 66\\% trainable parameters. We also identify an architectural bottleneck that quantum option-value estimation severely degrades performance. Further ablation studies reveal how architectural choices of the quantum circuits affect performance. Our work establishes design principles for parameter-efficient hybrid hierarchical agents.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2511.08717v4",
        "url": "https://arxiv.org/abs/2511.08717",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Optimal control of the future via prospective learning with control",
        "author": [
          "Yuxin Bai",
          "Aranyak Acharyya",
          "Ashwin De Silva",
          "Zeyu Shen",
          "James Hassett",
          "Joshua T. Vogelstein"
        ],
        "summary": "arXiv:2511.08717v4 Announce Type: replace-cross Abstract: Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PLuC), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control: foraging, a canonical task relevant to both natural and artificial agents. We illustrate that modern RL algorithms, which assume stationarity, struggle in these non-stationary reset-free environments. Even with time-aware modifications, they converge orders of magnitude slower than our prospective foraging agents on a simple 1-D foraging benchmark. Code is available at: https://github.com/neurodata/procontrol.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2602.05890v2",
        "url": "https://arxiv.org/abs/2602.05890",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training",
        "author": [
          "Dingwei Zhu",
          "Zhiheng Xi",
          "Shihan Dou",
          "Jiahan Li",
          "Chenhao Huang",
          "Junjie Ye",
          "Sixian Li",
          "Mingxu Chai",
          "Yuhui Wang",
          "Yajie Yang",
          "Ming Zhang",
          "Jiazheng Zhang",
          "Shichun Liu",
          "Caishuang Huang",
          "Yunke Zhang",
          "Yuran Wang",
          "Tao Gui",
          "Xipeng Qiu",
          "Qi Zhang",
          "Xuanjing Huang"
        ],
        "summary": "arXiv:2602.05890v2 Announce Type: replace-cross Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2605.04880v1",
        "url": "https://arxiv.org/abs/2605.04880",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs",
        "author": [
          "Erel Shtossel",
          "Alicia Vidler",
          "Uri Shaham",
          "Gal A. Kaminka"
        ],
        "summary": "arXiv:2605.04880v1 Announce Type: new Abstract: Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04939v1",
        "url": "https://arxiv.org/abs/2605.04939",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Modular Reinforcement Learning For Cooperative Swarms",
        "author": [
          "Erel Shtossel",
          "Gal A. Kaminka"
        ],
        "summary": "arXiv:2605.04939v1 Announce Type: cross Abstract: A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36",
        "url": "https://www.machinebrief.com/news/revolutionizing-code-generation-with-requirement-aware-learn-5p36",
        "source_url": "https://www.machinebrief.com/rss.xml",
        "title": "Revolutionizing Code Generation with Requirement-Aware Learning",
        "author": [
          "Felix Navarro"
        ],
        "summary": "RECRL, a new curriculum reinforcement learning framework, significantly enhances code generation by addressing requirement difficulty. It shows a notable performance boost across five benchmarks.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T15:25:34.000Z",
        "fetched_at": "2026-05-05T16:10:22.708Z"
      },
      {
        "id": "oai:arXiv.org:2605.04368v1",
        "url": "https://arxiv.org/abs/2605.04368",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Extending Differential Temporal Difference Methods for Episodic Problems",
        "author": [
          "Kris De Asis",
          "Mohamed Elsayed",
          "Jiamin He"
        ],
        "summary": "arXiv:2605.04368v1 Announce Type: new Abstract: Differential temporal difference (TD) methods are value-based reinforcement learning algorithms that have been proposed for infinite-horizon problems. They rely on reward centering, where each reward is centered by the average reward. This keeps the return bounded and removes a value function's state-independent offset. However, reward centering can alter the optimal policy in episodic problems, limiting its applicability. Motivated by recent works that emphasize the role of normalization in streaming deep reinforcement learning, we study reward centering in episodic problems and propose a generalization of differential TD. We prove that this generalization maintains the ordering of policies in the presence of termination, and thus extends differential TD to episodic problems. We show equivalence with a form of linear TD, thereby inheriting theoretical guarantees that have been shown for those algorithms. We then extend several streaming reinforcement learning algorithms to their differential counterparts. Across a range of base algorithms and environments, we empirically validate that reward centering can improve sample efficiency in episodic problems.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.05110v1",
        "url": "https://arxiv.org/abs/2605.05110",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts",
        "author": [
          "Seungeun Rho",
          "Shamel Fahmi",
          "Jeonghwan Kim",
          "Arianna Ilvonen",
          "Sehoon Ha",
          "Gabriel Nelson"
        ],
        "summary": "arXiv:2605.05110v1 Announce Type: cross Abstract: Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.05112v1",
        "url": "https://arxiv.org/abs/2605.05112",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime",
        "author": [
          "Tianshu Zhu",
          "Wenyu Zhang",
          "Xiaoying Zuo",
          "Lun Tian",
          "Haotian Zhao",
          "Yucheng Zeng",
          "Jingnan Gu",
          "Daxiang Dong",
          "Jianmin Wu",
          "Dawei Yin",
          "Dou Shen"
        ],
        "summary": "arXiv:2605.05112v1 Announce Type: new Abstract: SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success--failure contrastive structure. Guided by this principle, we propose Prefix Sampling (PS), which replays trajectory prefixes to steer skewed groups toward this regime: successful prefixes serve as head starts for mostly failing groups, while failing prefixes serve as handicaps for mostly passing groups. In stateful agent environments, prefix states are reconstructed through replay while replayed tokens are excluded from the loss, restricting optimization to continuations generated by the current policy. On SWE-bench-style agentic RL, PS delivers end-to-end wall-clock speedups of 2.01x on Qwen3-14B and 1.55x on Qwen3-32B while preserving or improving final verified performance. For 14B, the SWE-bench Verified peak rises from the baseline peak of 0.273 to 0.295 under PS. Additional mathematical reasoning experiments on AIME 2025 show the same pass-rate control pattern and decompose the gains into replay, bidirectional coverage, and adaptive control.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.02913v1",
        "url": "https://arxiv.org/abs/2605.02913",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning",
        "author": [
          "Rohan Surana",
          "Gagan Mundada",
          "Xunyi Jiang",
          "Chuhan Wang",
          "Zhenwei Tang",
          "Difan Jiao",
          "Zihan Huang",
          "Yuxin Xiong",
          "Junda Wu",
          "Sheldon Yu",
          "Xintong Li",
          "Raghav Jain",
          "Nikki Kuang",
          "Sizhe Zhou",
          "Bowen Jin",
          "Zhendong Chu",
          "Tong Yu",
          "Ryan Rossi",
          "Kuan-Hao Huang",
          "Jingbo Shang",
          "Jiawei Han",
          "Julian McAuley"
        ],
        "summary": "arXiv:2605.02913v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "oai:arXiv.org:2605.04542v1",
        "url": "https://arxiv.org/abs/2605.04542",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation",
        "author": [
          "Akiyoshi Tomihari",
          "Issei Sato"
        ],
        "summary": "arXiv:2605.04542v1 Announce Type: new Abstract: Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2604.01345v2",
        "url": "https://arxiv.org/abs/2604.01345",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning",
        "author": [
          "Vikram Krishnamurthy",
          "Luke Snow"
        ],
        "summary": "arXiv:2604.01345v2 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses. Adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      }
    ],
    "total": 3509
  }
}

▼✓网页搜索查询: "Gymnasium Farama Foundation reinforcement learning 2024 2025", 结果数: 15

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Python Reinforcement Learning using Gymnasium – Full Course\",\n      \"url\": \"https://www.youtube.com/watch?v=vufTSJbzKGU\",\n      \"description\": \"Gymnasium is an open source Python library maintained by the Farama Foundation ... Stanford CS224R Deep Reinforcement Learning | Spring 2025\"\n    },\n    {\n      \"title\": \"Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/index.html\",\n      \"description\": \"Light Logo Dark Logo Gymnasium Documentation. Farama Foundation logo Farama Foundation. ## An API standard for reinforcement learning with a diverse collection of reference environments. **Gymnasium is a maintained fork of OpenAI’s Gym library.** The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a migration guide for old Gym environments:. import  gymnasium  as  gym # Initialise the environment env = gym. make(\\\"LunarLander-v3\\\", render_mode = \\\"human\\\") # Reset the environment to generate the first observation observation, info = env. reset(seed = 42) for _ in range(1000): # this is where you would insert your policy action = env. sample()# step (transition) through the environment with the action# receiving the next observation, reward and if the episode has terminated or truncated observation, reward, terminated, truncated, info = env. step(action) # If the episode has ended then we can reset to start a new episode if terminated or truncated: observation, info = env.\"\n    },\n    {\n      \"title\": \"[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...\",\n      \"url\": \"https://arxiv.org/pdf/2407.17032\",\n      \"description\": \"Gymnasium: A Standardized Interface for Reinforcement Learning Environments Mark Towers† University of Southampton & Farama Foundation mt5g17@soton.ac.uk Ariel Kwiatkowski†‡ Meta AI, FAIR & Farama Foundation kwiat@meta.com Jordan Terry† Farama Foundation jkterry@farama.org John U. Gymnasium is an open-source library that provides a standardized API for RL environments, aiming to tackle this issue, with over 18 million installations. Gymnasium’s main feature is a set of abstractions that allow for wide interoperability between envi-ronments and training algorithms, making it easier for researchers to develop and test new environments and/or RL algorithms. In addition, Gymnasium provides a collection of built-in easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL re-search. In summary, Gymnasium provides the following novel contributions to the field of DRL: • A maintained API for handling Reinforcement Learning Environment with a wide range of built-in environments (Figure 1), a collection of compatible external environments, and support from numerous training libraries.\"\n    },\n    {\n      \"title\": \"[PDF] arXiv:2407.17032v1 [cs.LG] 24 Jul 2024 - ePrints Soton\",\n      \"url\": \"https://eprints.soton.ac.uk/493069/1/2407.17032v1.pdf\",\n      \"description\": \"This paper outlines the main design decisions for Gymnasium, its key features, and the differences to alternative APIs. 1 Introduction With the publication of a Deep Q-Networks (DQN) [Mnih et al., 2013], Reinforcement Learning (RL) was awoken from its Artificial Intelligence (AI) winter, showing that a general neural network-based algorithm can achieve expert-level performance across a range of complex tasks. • Third party - Arcade Learning Environments [Bellemare et al., 2013] for playing Atari 2600 ROMs, Safety Gymnasium [Ji et al., 2023] for testing safe RL algorithms in robotics environments, HighwayEnv [Leurent, 2018] for simulating various driving situations and PyFlyt [Tai et al., 2023] for drone flying simulation.\"\n    },\n    {\n      \"title\": \"External Environments - Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/environments/third_party_environments/\",\n      \"description\": \"The Farama Foundation maintains a number of other projects, which use the Gymnasium API, environments include: gridworlds (Minigrid), robotics (Gymnasium-Robotics), 3D navigation (Miniworld), web interaction (MiniWoB++), arcade games (Arcade Learning Environment), Doom (ViZDoom), Meta-objective robotics (Metaworld), autonomous driving (HighwayEnv), Retro Games (stable-retro), and many more. * BlueSky-Gym: Reinforcement Learning Environments for Air Traffic Applications. AnyTrading is a collection of Gym environments for reinforcement learning-based trading algorithms with a great focus on simplicity, flexibility, and comprehensiveness. Craftium wraps the Minetest game engine into the Gymnasium API, providing a modern and easy-to-use platform for designing Minecraft-like RL environments. * gym-pybullet-drones: Environments for quadcopter control. * panda-gym: Robotics environments using the PyBullet physics engine. A simple environment for benchmarking single and multi-agent reinforcement learning algorithms on a clone of Slime Volleyball game. Robo-gym provides a collection of reinforcement learning environments involving robotic tasks applicable in both simulation and real-world robotics. Gym environments that let you control real robots in a laboratory via the internet.\"\n    },\n    {\n      \"title\": \"Farama-Foundation/Gymnasium: An API standard for single ... - GitHub\",\n      \"url\": \"https://github.com/farama-foundation/gymnasium\",\n      \"description\": \"Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between\"\n    },\n    {\n      \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning Environments | OpenReview\",\n      \"url\": \"https://openreview.net/forum?id=qPMLvJxtPK&referrer=%5Bthe%20profile%20of%20John%20U.%20Balis%5D(%2Fprofile%3Fid%3D~John_U._Balis1)\",\n      \"description\": \"back arrowBack to **the profile of John U. ## Gymnasium: A Standard Interface for Reinforcement Learning Environments. **TL;DR:** An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym). **Abstract:** Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \\\\url{https://github.com/Farama-Foundation/Gymnasium}.\"\n    },\n    {\n      \"title\": \"Blog | The Farama Foundation\",\n      \"url\": \"https://farama.org/blog\",\n      \"description\": \"Gymnasium offers three options, for which, we present descriptions ... reinforcement learning (“RL”) libraries in a neutral nonprofit body.\"\n    },\n    {\n      \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning ...\",\n      \"url\": \"https://ui.adsabs.harvard.edu/abs/2024arXiv240717032T/abstract\",\n      \"description\": \"Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue.\"\n    },\n    {\n      \"title\": \"‪Jordan K Terry‬ - ‪Google Scholar‬\",\n      \"url\": \"https://scholar.google.com/citations?user=gb83gcIAAAAJ&hl=en\",\n      \"description\": \"Gymnasium: A standard interface for reinforcement learning environments. M ... URL http://github. com/Farama-Foundation/Gymnasium-Robotics, 10, 0. 66\"\n    },\n    {\n      \"title\": \"A Standardized Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://arxiv.org/html/2407.17032v4\",\n      \"description\": \"By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full\"\n    },\n    {\n      \"title\": \"Build Custom Gymnasium Environments: Step-by-Step Guide\",\n      \"url\": \"https://ai.gopubby.com/creating-custom-environments-with-gymnasium-a-step-by-step-guide-266f23398fcd\",\n      \"description\": \"Learn How to Build a 2D Grid Game Environment for AI and Reinforcement Learning Using Farama Foundation's Gymnasium.\"\n    },\n    {\n      \"title\": \"Reinforcement Learning with Gymnasium: A Practical Guide\",\n      \"url\": \"https://www.datacamp.com/tutorial/reinforcement-learning-with-gymnasium\",\n      \"description\": \"# Reinforcement Learning with Gymnasium: A Practical Guide. Understand the basics of Reinforcement Learning (RL) and explore the Gymnasium software package to build and test RL algorithms using Python. Reinforcement Learning (RL) is one of the three main machine learning paradigms, the other two being supervised and unsupervised learning. In RL, an agent learns to interact with its environment to maximize the cumulative rewards. Reinforcement Learning with Human Feedback (RLHF) allows the agent to modify behavior based on human inputs at each step. When deep neural networks are used to apply RL algorithms, it is called Deep Reinforcement Learning. In this tutorial, I’ll show you how to get started with Gymnasium, an open-source Python library for developing and comparing reinforcement learning algorithms. I'll demonstrate how to set it up, explore various RL environments, and use Python to build a simple agent to implement an RL algorithm. Gymnasium is an open-source Python library designed to support the development of RL algorithms.\"\n    },\n    {\n      \"title\": \"[PDF] GYMNASIUM - OpenReview\",\n      \"url\": \"https://openreview.net/pdf?id=feFlfuOse1\",\n      \"description\": \"Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on\"\n    },\n    {\n      \"title\": \"Gymnasium for Reinforcement Learning | by ANSHUMAN PRAJAPATI\",\n      \"url\": \"https://medium.com/@abdprajapati090/gymnasium-for-reinforcement-learning-412fd895e7b9\",\n      \"description\": \"Gymnasium is an open-source Python library that provides: A standard API for defining environments in reinforcement learning (RL). OpenReview+\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Python Reinforcement Learning using Gymnasium – Full Course",
        "url": "https://www.youtube.com/watch?v=vufTSJbzKGU",
        "description": "Gymnasium is an open source Python library maintained by the Farama Foundation ... Stanford CS224R Deep Reinforcement Learning | Spring 2025"
      },
      {
        "title": "Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/index.html",
        "description": "Light Logo Dark Logo Gymnasium Documentation. Farama Foundation logo Farama Foundation. ## An API standard for reinforcement learning with a diverse collection of reference environments. **Gymnasium is a maintained fork of OpenAI’s Gym library.** The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a migration guide for old Gym environments:. import  gymnasium  as  gym # Initialise the environment env = gym. make(\"LunarLander-v3\", render_mode = \"human\") # Reset the environment to generate the first observation observation, info = env. reset(seed = 42) for _ in range(1000): # this is where you would insert your policy action = env. sample()# step (transition) through the environment with the action# receiving the next observation, reward and if the episode has terminated or truncated observation, reward, terminated, truncated, info = env. step(action) # If the episode has ended then we can reset to start a new episode if terminated or truncated: observation, info = env."
      },
      {
        "title": "[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...",
        "url": "https://arxiv.org/pdf/2407.17032",
        "description": "Gymnasium: A Standardized Interface for Reinforcement Learning Environments Mark Towers† University of Southampton & Farama Foundation mt5g17@soton.ac.uk Ariel Kwiatkowski†‡ Meta AI, FAIR & Farama Foundation kwiat@meta.com Jordan Terry† Farama Foundation jkterry@farama.org John U. Gymnasium is an open-source library that provides a standardized API for RL environments, aiming to tackle this issue, with over 18 million installations. Gymnasium’s main feature is a set of abstractions that allow for wide interoperability between envi-ronments and training algorithms, making it easier for researchers to develop and test new environments and/or RL algorithms. In addition, Gymnasium provides a collection of built-in easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL re-search. In summary, Gymnasium provides the following novel contributions to the field of DRL: • A maintained API for handling Reinforcement Learning Environment with a wide range of built-in environments (Figure 1), a collection of compatible external environments, and support from numerous training libraries."
      },
      {
        "title": "[PDF] arXiv:2407.17032v1 [cs.LG] 24 Jul 2024 - ePrints Soton",
        "url": "https://eprints.soton.ac.uk/493069/1/2407.17032v1.pdf",
        "description": "This paper outlines the main design decisions for Gymnasium, its key features, and the differences to alternative APIs. 1 Introduction With the publication of a Deep Q-Networks (DQN) [Mnih et al., 2013], Reinforcement Learning (RL) was awoken from its Artificial Intelligence (AI) winter, showing that a general neural network-based algorithm can achieve expert-level performance across a range of complex tasks. • Third party - Arcade Learning Environments [Bellemare et al., 2013] for playing Atari 2600 ROMs, Safety Gymnasium [Ji et al., 2023] for testing safe RL algorithms in robotics environments, HighwayEnv [Leurent, 2018] for simulating various driving situations and PyFlyt [Tai et al., 2023] for drone flying simulation."
      },
      {
        "title": "External Environments - Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/environments/third_party_environments/",
        "description": "The Farama Foundation maintains a number of other projects, which use the Gymnasium API, environments include: gridworlds (Minigrid), robotics (Gymnasium-Robotics), 3D navigation (Miniworld), web interaction (MiniWoB++), arcade games (Arcade Learning Environment), Doom (ViZDoom), Meta-objective robotics (Metaworld), autonomous driving (HighwayEnv), Retro Games (stable-retro), and many more. * BlueSky-Gym: Reinforcement Learning Environments for Air Traffic Applications. AnyTrading is a collection of Gym environments for reinforcement learning-based trading algorithms with a great focus on simplicity, flexibility, and comprehensiveness. Craftium wraps the Minetest game engine into the Gymnasium API, providing a modern and easy-to-use platform for designing Minecraft-like RL environments. * gym-pybullet-drones: Environments for quadcopter control. * panda-gym: Robotics environments using the PyBullet physics engine. A simple environment for benchmarking single and multi-agent reinforcement learning algorithms on a clone of Slime Volleyball game. Robo-gym provides a collection of reinforcement learning environments involving robotic tasks applicable in both simulation and real-world robotics. Gym environments that let you control real robots in a laboratory via the internet."
      },
      {
        "title": "Farama-Foundation/Gymnasium: An API standard for single ... - GitHub",
        "url": "https://github.com/farama-foundation/gymnasium",
        "description": "Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between"
      },
      {
        "title": "Gymnasium: A Standard Interface for Reinforcement Learning Environments | OpenReview",
        "url": "https://openreview.net/forum?id=qPMLvJxtPK&referrer=%5Bthe%20profile%20of%20John%20U.%20Balis%5D(%2Fprofile%3Fid%3D~John_U._Balis1)",
        "description": "back arrowBack to **the profile of John U. ## Gymnasium: A Standard Interface for Reinforcement Learning Environments. **TL;DR:** An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym). **Abstract:** Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at \\url{https://github.com/Farama-Foundation/Gymnasium}."
      },
      {
        "title": "Blog | The Farama Foundation",
        "url": "https://farama.org/blog",
        "description": "Gymnasium offers three options, for which, we present descriptions ... reinforcement learning (“RL”) libraries in a neutral nonprofit body."
      },
      {
        "title": "Gymnasium: A Standard Interface for Reinforcement Learning ...",
        "url": "https://ui.adsabs.harvard.edu/abs/2024arXiv240717032T/abstract",
        "description": "Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue."
      },
      {
        "title": "‪Jordan K Terry‬ - ‪Google Scholar‬",
        "url": "https://scholar.google.com/citations?user=gb83gcIAAAAJ&hl=en",
        "description": "Gymnasium: A standard interface for reinforcement learning environments. M ... URL http://github. com/Farama-Foundation/Gymnasium-Robotics, 10, 0. 66"
      },
      {
        "title": "A Standardized Interface for Reinforcement Learning Environments",
        "url": "https://arxiv.org/html/2407.17032v4",
        "description": "By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full"
      },
      {
        "title": "Build Custom Gymnasium Environments: Step-by-Step Guide",
        "url": "https://ai.gopubby.com/creating-custom-environments-with-gymnasium-a-step-by-step-guide-266f23398fcd",
        "description": "Learn How to Build a 2D Grid Game Environment for AI and Reinforcement Learning Using Farama Foundation's Gymnasium."
      },
      {
        "title": "Reinforcement Learning with Gymnasium: A Practical Guide",
        "url": "https://www.datacamp.com/tutorial/reinforcement-learning-with-gymnasium",
        "description": "# Reinforcement Learning with Gymnasium: A Practical Guide. Understand the basics of Reinforcement Learning (RL) and explore the Gymnasium software package to build and test RL algorithms using Python. Reinforcement Learning (RL) is one of the three main machine learning paradigms, the other two being supervised and unsupervised learning. In RL, an agent learns to interact with its environment to maximize the cumulative rewards. Reinforcement Learning with Human Feedback (RLHF) allows the agent to modify behavior based on human inputs at each step. When deep neural networks are used to apply RL algorithms, it is called Deep Reinforcement Learning. In this tutorial, I’ll show you how to get started with Gymnasium, an open-source Python library for developing and comparing reinforcement learning algorithms. I'll demonstrate how to set it up, explore various RL environments, and use Python to build a simple agent to implement an RL algorithm. Gymnasium is an open-source Python library designed to support the development of RL algorithms."
      },
      {
        "title": "[PDF] GYMNASIUM - OpenReview",
        "url": "https://openreview.net/pdf?id=feFlfuOse1",
        "description": "Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on"
      },
      {
        "title": "Gymnasium for Reinforcement Learning | by ANSHUMAN PRAJAPATI",
        "url": "https://medium.com/@abdprajapati090/gymnasium-for-reinforcement-learning-412fd895e7b9",
        "description": "Gymnasium is an open-source Python library that provides: A standard API for defining environments in reinforcement learning (RL). OpenReview+"
      }
    ]
  }
}

▼thought

▼✓抓取网页正文网址: "https://gymnasium.farama.org/index.html"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/index.html\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div class=\\\"page\\\">\\n          <a href=\\\"#\\\">\\n            <svg xmlns=\\\"http://www.w3.org/2000/svg\\\" viewBox=\\\"0 0 24 24\\\">\\n              <path d=\\\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\\\"></path>\\n            </svg>\\n            <span>Back to top</span>\\n          </a>\\n          <div><p>\\n  <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/index.md\\\" title=\\\"Edit this page\\\">\\n    \\n    <span>Edit this page</span>\\n  </a>\\n</p>\\n            <p><label for=\\\"__toc\\\">\\n              <p>Toggle table of contents sidebar</p>\\n              <i><svg>\\n                  <use href=\\\"#svg-toc\\\"></use>\\n                </svg></i>\\n            </label>\\n          </p></div>\\n          <article role=\\\"main\\\">\\n            \\n            <figure>\\n<img alt=\\\"Gymnasium Logo\\\" src=\\\"https://gymnasium.farama.org/_images/gymnasium-text.png\\\">\\n</figure>\\n<p>\\n<h2>\\nAn API standard for reinforcement learning with a diverse collection of reference environments\\n</h2>\\n</p>\\n<figure>\\n<a href=\\\"https://gymnasium.farama.org/_images/lunar_lander.gif\\\"><img alt=\\\"Lunar Lander\\\" src=\\\"https://gymnasium.farama.org/_images/lunar_lander.gif\\\">\\n</a>\\n</figure>\\n<p><strong>Gymnasium is a maintained fork of OpenAI’s Gym library.</strong> The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a <a href=\\\"https://gymnasium.farama.org/introduction/migration_guide/\\\"><span>migration guide</span></a> for old Gym environments:</p>\\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\\n\\n<span># Initialise the environment</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"LunarLander-v3\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n\\n<span># Reset the environment to generate the first observation</span>\\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>\\n<span>for</span> <span>_</span> <span>in</span> <span>range</span><span>(</span><span>1000</span><span>):</span>\\n    <span># this is where you would insert your policy</span>\\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\\n\\n    <span># step (transition) through the environment with the action</span>\\n    <span># receiving the next observation, reward and if the episode has terminated or truncated</span>\\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n\\n    <span># If the episode has ended then we can reset to start a new episode</span>\\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span><span>:</span>\\n        <span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n\\n<span>env</span><span>.</span><span>close</span><span>()</span>\\n</pre></div>\\n\\n\\n\\n\\n\\n\\n          </article>\\n        </div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/index.html",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div class=\"page\">\n          <a href=\"#\">\n            <svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 24 24\">\n              <path d=\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\"></path>\n            </svg>\n            <span>Back to top</span>\n          </a>\n          <div><p>\n  <a href=\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/index.md\" title=\"Edit this page\">\n    \n    <span>Edit this page</span>\n  </a>\n</p>\n            <p><label for=\"__toc\">\n              <p>Toggle table of contents sidebar</p>\n              <i><svg>\n                  <use href=\"#svg-toc\"></use>\n                </svg></i>\n            </label>\n          </p></div>\n          <article role=\"main\">\n            \n            <figure>\n<img alt=\"Gymnasium Logo\" src=\"https://gymnasium.farama.org/_images/gymnasium-text.png\">\n</figure>\n<p>\n<h2>\nAn API standard for reinforcement learning with a diverse collection of reference environments\n</h2>\n</p>\n<figure>\n<a href=\"https://gymnasium.farama.org/_images/lunar_lander.gif\"><img alt=\"Lunar Lander\" src=\"https://gymnasium.farama.org/_images/lunar_lander.gif\">\n</a>\n</figure>\n<p><strong>Gymnasium is a maintained fork of OpenAI’s Gym library.</strong> The Gymnasium interface is simple, pythonic, and capable of representing general RL problems, and has a <a href=\"https://gymnasium.farama.org/introduction/migration_guide/\"><span>migration guide</span></a> for old Gym environments:</p>\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\n\n<span># Initialise the environment</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"LunarLander-v3\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>\n\n<span># Reset the environment to generate the first observation</span>\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>\n<span>for</span> <span>_</span> <span>in</span> <span>range</span><span>(</span><span>1000</span><span>):</span>\n    <span># this is where you would insert your policy</span>\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\n\n    <span># step (transition) through the environment with the action</span>\n    <span># receiving the next observation, reward and if the episode has terminated or truncated</span>\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n\n    <span># If the episode has ended then we can reset to start a new episode</span>\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span><span>:</span>\n        <span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n\n<span>env</span><span>.</span><span>close</span><span>()</span>\n</pre></div>\n\n\n\n\n\n\n          </article>\n        </div></div>"
  }
}

▼✓抓取网页正文网址: "https://arxiv.org/abs/2407.17032"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://arxiv.org/abs/2407.17032\",\n  \"title\": \"Gymnasium: A Standard Interface for Reinforcement Learning Environments\",\n  \"author\": \"[Submitted on 24 Jul 2024 (v1), last revised 2 Nov 2025 (this version, v4)]\",\n  \"summary\": \"Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered b…\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div id=\\\"content-inner\\\">\\n    \\n    \\n    <div><p><span>Authors:</span><a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Towers,+M\\\" rel=\\\"nofollow\\\">Mark Towers</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kwiatkowski,+A\\\" rel=\\\"nofollow\\\">Ariel Kwiatkowski</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Terry,+J\\\" rel=\\\"nofollow\\\">Jordan Terry</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Balis,+J+U\\\" rel=\\\"nofollow\\\">John U. Balis</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=De+Cola,+G\\\" rel=\\\"nofollow\\\">Gianluca De Cola</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Deleu,+T\\\" rel=\\\"nofollow\\\">Tristan Deleu</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Goul%C3%A3o,+M\\\" rel=\\\"nofollow\\\">Manuel Goulão</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kallinteris,+A\\\" rel=\\\"nofollow\\\">Andreas Kallinteris</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Krimmel,+M\\\" rel=\\\"nofollow\\\">Markus Krimmel</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=KG,+A\\\" rel=\\\"nofollow\\\">Arjun KG</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Perez-Vicente,+R\\\" rel=\\\"nofollow\\\">Rodrigo Perez-Vicente</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Pierr%C3%A9,+A\\\" rel=\\\"nofollow\\\">Andrea Pierré</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Schulhoff,+S\\\" rel=\\\"nofollow\\\">Sander Schulhoff</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tai,+J+J\\\" rel=\\\"nofollow\\\">Jun Jet Tai</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tan,+H\\\" rel=\\\"nofollow\\\">Hannah Tan</a>, <a href=\\\"https://arxiv.org/search/cs?searchtype=author&amp;query=Younis,+O+G\\\" rel=\\\"nofollow\\\">Omar G. Younis</a></p></div>            \\n    <p><a href=\\\"https://arxiv.org/pdf/2407.17032\\\">View PDF</a>\\n    <a href=\\\"https://arxiv.org/html/2407.17032v4\\\">HTML (experimental)</a></p><blockquote>\\n            <span>Abstract:</span>Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at <a href=\\\"https://github.com/Farama-Foundation/Gymnasium\\\" rel=\\\"external noopener nofollow\\\">this https URL</a>\\n    </blockquote>\\n\\n    <!--CONTEXT-->\\n    <div>\\n      <table summary=\\\"Additional metadata\\\">        <tbody><tr>\\n          <td>Comments:</td>\\n          <td>Accepted at NeurIPS Datasets and Benchmarks 2025</td>\\n        </tr>\\n<tr>\\n          <td>Subjects:</td>\\n          <td>\\n            <span>Machine Learning (cs.LG)</span>; Digital Libraries (cs.DL)</td>\\n        </tr><tr>\\n          <td>Cite as:</td>\\n          <td><span><a href=\\\"https://arxiv.org/abs/2407.17032\\\">arXiv:2407.17032</a> [cs.LG]</span></td>\\n        </tr>\\n        <tr>\\n          <td>&nbsp;</td>\\n          <td>(or <span>\\n              <a href=\\\"https://arxiv.org/abs/2407.17032v4\\\">arXiv:2407.17032v4</a> [cs.LG]</span> for this version)\\n          </td>\\n        </tr>\\n        <tr>\\n          <td>&nbsp;</td>\\n          <td>              <a href=\\\"https://doi.org/10.48550/arXiv.2407.17032\\\" id=\\\"arxiv-doi-link\\\">https://doi.org/10.48550/arXiv.2407.17032</a><div>\\n              <!-- tooltip description -->\\n              <p><span></span>                  arXiv-issued DOI via DataCite</p>\\n            </div>\\n          </td>\\n        </tr></tbody></table>\\n    </div>\\n  </div><div>\\n      <h2>Submission history</h2><p> From: Mark Towers [<a href=\\\"https://arxiv.org/show-email/97d21085/2407.17032\\\" rel=\\\"nofollow\\\">view email</a>]      <br>            <strong><a href=\\\"https://arxiv.org/abs/2407.17032v1\\\" rel=\\\"nofollow\\\">[v1]</a></strong>\\n        Wed, 24 Jul 2024 06:35:05 UTC (388 KB)<br>\\n            <strong><a href=\\\"https://arxiv.org/abs/2407.17032v2\\\" rel=\\\"nofollow\\\">[v2]</a></strong>\\n        Wed, 9 Oct 2024 20:48:15 UTC (1,320 KB)<br>\\n            <strong><a href=\\\"https://arxiv.org/abs/2407.17032v3\\\" rel=\\\"nofollow\\\">[v3]</a></strong>\\n        Fri, 8 Nov 2024 16:08:51 UTC (1,320 KB)<br>\\n    <strong>[v4]</strong>\\n        Sun, 2 Nov 2025 13:42:19 UTC (1,018 KB)<br>\\n</p></div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://arxiv.org/abs/2407.17032",
    "title": "Gymnasium: A Standard Interface for Reinforcement Learning Environments",
    "author": "[Submitted on 24 Jul 2024 (v1), last revised 2 Nov 2025 (this version, v4)]",
    "summary": "Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered b…",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div id=\"content-inner\">\n    \n    \n    <div><p><span>Authors:</span><a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Towers,+M\" rel=\"nofollow\">Mark Towers</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kwiatkowski,+A\" rel=\"nofollow\">Ariel Kwiatkowski</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Terry,+J\" rel=\"nofollow\">Jordan Terry</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Balis,+J+U\" rel=\"nofollow\">John U. Balis</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=De+Cola,+G\" rel=\"nofollow\">Gianluca De Cola</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Deleu,+T\" rel=\"nofollow\">Tristan Deleu</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Goul%C3%A3o,+M\" rel=\"nofollow\">Manuel Goulão</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Kallinteris,+A\" rel=\"nofollow\">Andreas Kallinteris</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Krimmel,+M\" rel=\"nofollow\">Markus Krimmel</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=KG,+A\" rel=\"nofollow\">Arjun KG</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Perez-Vicente,+R\" rel=\"nofollow\">Rodrigo Perez-Vicente</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Pierr%C3%A9,+A\" rel=\"nofollow\">Andrea Pierré</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Schulhoff,+S\" rel=\"nofollow\">Sander Schulhoff</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tai,+J+J\" rel=\"nofollow\">Jun Jet Tai</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Tan,+H\" rel=\"nofollow\">Hannah Tan</a>, <a href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Younis,+O+G\" rel=\"nofollow\">Omar G. Younis</a></p></div>            \n    <p><a href=\"https://arxiv.org/pdf/2407.17032\">View PDF</a>\n    <a href=\"https://arxiv.org/html/2407.17032v4\">HTML (experimental)</a></p><blockquote>\n            <span>Abstract:</span>Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at <a href=\"https://github.com/Farama-Foundation/Gymnasium\" rel=\"external noopener nofollow\">this https URL</a>\n    </blockquote>\n\n    <!--CONTEXT-->\n    <div>\n      <table summary=\"Additional metadata\">        <tbody><tr>\n          <td>Comments:</td>\n          <td>Accepted at NeurIPS Datasets and Benchmarks 2025</td>\n        </tr>\n<tr>\n          <td>Subjects:</td>\n          <td>\n            <span>Machine Learning (cs.LG)</span>; Digital Libraries (cs.DL)</td>\n        </tr><tr>\n          <td>Cite as:</td>\n          <td><span><a href=\"https://arxiv.org/abs/2407.17032\">arXiv:2407.17032</a> [cs.LG]</span></td>\n        </tr>\n        <tr>\n          <td>&nbsp;</td>\n          <td>(or <span>\n              <a href=\"https://arxiv.org/abs/2407.17032v4\">arXiv:2407.17032v4</a> [cs.LG]</span> for this version)\n          </td>\n        </tr>\n        <tr>\n          <td>&nbsp;</td>\n          <td>              <a href=\"https://doi.org/10.48550/arXiv.2407.17032\" id=\"arxiv-doi-link\">https://doi.org/10.48550/arXiv.2407.17032</a><div>\n              <!-- tooltip description -->\n              <p><span></span>                  arXiv-issued DOI via DataCite</p>\n            </div>\n          </td>\n        </tr></tbody></table>\n    </div>\n  </div><div>\n      <h2>Submission history</h2><p> From: Mark Towers [<a href=\"https://arxiv.org/show-email/97d21085/2407.17032\" rel=\"nofollow\">view email</a>]      <br>            <strong><a href=\"https://arxiv.org/abs/2407.17032v1\" rel=\"nofollow\">[v1]</a></strong>\n        Wed, 24 Jul 2024 06:35:05 UTC (388 KB)<br>\n            <strong><a href=\"https://arxiv.org/abs/2407.17032v2\" rel=\"nofollow\">[v2]</a></strong>\n        Wed, 9 Oct 2024 20:48:15 UTC (1,320 KB)<br>\n            <strong><a href=\"https://arxiv.org/abs/2407.17032v3\" rel=\"nofollow\">[v3]</a></strong>\n        Fri, 8 Nov 2024 16:08:51 UTC (1,320 KB)<br>\n    <strong>[v4]</strong>\n        Sun, 2 Nov 2025 13:42:19 UTC (1,018 KB)<br>\n</p></div></div>"
  }
}

▼✓网页搜索结果数: 15, 查询: "Gymnasium vs OpenAI Gym differences history migration Farama Foundation"

▼thought

Good, I have a lot of information now. Let me get more details: the GitHub repo, the migration guide, the built-in environments, and the Farama Foundation context.

▼✓抓取网页正文网址: "https://github.com/farama-foundation/gymnasium"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://github.com/farama-foundation/gymnasium\",\n  \"title\": \"GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)\",\n  \"summary\": \"An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym) - Farama-Foundation/Gymnasium\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div data-hpc=\\\"true\\\"><article itemprop=\\\"text\\\"><p dir=\\\"auto\\\"><a href=\\\"https://badge.fury.io/py/gymnasium\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/b391159f1cb2ce27383263b393800ee2835d088f451900c3f50c62f51c013dbc/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f67796d6e617369756d2e737667\\\" alt=\\\"Python\\\" data-canonical-src=\\\"https://img.shields.io/pypi/pyversions/gymnasium.svg\\\"></a>\\n<a href=\\\"https://badge.fury.io/py/gymnasium\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/0777e88232a2ef56fd5da4c0e5b4e9827021e9c1a8dc7b7b629a2b82dce48f5b/68747470733a2f2f62616467652e667572792e696f2f70792f67796d6e617369756d2e737667\\\" alt=\\\"PyPI\\\" data-canonical-src=\\\"https://badge.fury.io/py/gymnasium.svg\\\"></a>\\n<a href=\\\"https://arxiv.org/abs/2407.17032\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/b594949273dfef4572c0e54cce28590df41cbb44198329f769b1f814c29cb57f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f61725869762d323430372e31373033322d6233316231622e737667\\\" alt=\\\"arXiv\\\" data-canonical-src=\\\"https://img.shields.io/badge/arXiv-2407.17032-b31b1b.svg\\\"></a>\\n<a href=\\\"https://pre-commit.com/\\\" rel=\\\"nofollow\\\"><img src=\\\"https://camo.githubusercontent.com/9b46a404a478a8f690c0f1e4450b6d22bafb14b5d0a237d9626fe889da71a710/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7072652d2d636f6d6d69742d656e61626c65642d627269676874677265656e3f6c6f676f3d7072652d636f6d6d6974266c6f676f436f6c6f723d7768697465\\\" alt=\\\"pre-commit\\\" data-canonical-src=\\\"https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&amp;logoColor=white\\\"></a>\\n<a href=\\\"https://github.com/Farama-Foundation/Gymnasium/blob/main/LICENSE\\\"><img src=\\\"https://camo.githubusercontent.com/d0897855a06181ac533d42e8883efb28d79dbd3379df1e24c9f9c88f20bcf6e2/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f466172616d612d466f756e646174696f6e2f47796d6e617369756d\\\" alt=\\\"License\\\" data-canonical-src=\\\"https://img.shields.io/github/license/Farama-Foundation/Gymnasium\\\"></a>\\n<a href=\\\"https://github.com/psf/black\\\"><img src=\\\"https://camo.githubusercontent.com/7019b88be88468d6b83fcbf59d2c06bfa4992bafa7d129f9b89ab29f8c7c1acb/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f64652532307374796c652d626c61636b2d3030303030302e737667\\\" alt=\\\"Code style: black\\\" data-canonical-src=\\\"https://img.shields.io/badge/code%20style-black-000000.svg\\\"></a></p>\\n<p dir=\\\"auto\\\">\\n    <a href=\\\"https://gymnasium.farama.org/\\\" rel=\\\"nofollow\\\">\\n    <img src=\\\"https://raw.githubusercontent.com/Farama-Foundation/Gymnasium/main/gymnasium-text.png\\\" width=\\\"500px\\\">\\n</a>\\n</p>\\n<p dir=\\\"auto\\\">Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. This is a fork of OpenAI's <a href=\\\"https://github.com/openai/gym\\\">Gym</a> library by its maintainers (OpenAI handed over maintenance a few years ago to an outside team), and is where future maintenance will occur going forward.</p>\\n<p dir=\\\"auto\\\">The documentation website is at <a href=\\\"https://gymnasium.farama.org/\\\" rel=\\\"nofollow\\\">gymnasium.farama.org</a>, and we have a public discord server (which we also use to coordinate development work) that you can join here: <a href=\\\"https://discord.gg/bnJ6kubTg6\\\" rel=\\\"nofollow\\\">https://discord.gg/bnJ6kubTg6</a></p>\\n\\n<p dir=\\\"auto\\\">Gymnasium includes the following families of environments along with a wide variety of third-party environments</p>\\n<ul dir=\\\"auto\\\">\\n<li><a href=\\\"https://gymnasium.farama.org/environments/classic_control/\\\" rel=\\\"nofollow\\\">Classic Control</a> - These are classic reinforcement learning based on real-world problems and physics.</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/box2d/\\\" rel=\\\"nofollow\\\">Box2D</a> - These environments all involve toy games based around physics control, using box2d based physics and PyGame-based rendering</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/toy_text/\\\" rel=\\\"nofollow\\\">Toy Text</a> - These environments are designed to be extremely simple, with small discrete state and action spaces, and hence easy to learn. As a result, they are suitable for debugging implementations of reinforcement learning algorithms.</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/mujoco/\\\" rel=\\\"nofollow\\\">MuJoCo</a> - A physics engine based environments with multi-joint control which are more complex than the Box2D environments.</li>\\n<li><a href=\\\"https://ale.farama.org/\\\" rel=\\\"nofollow\\\">Atari</a> - Emulator of Atari 2600 ROMs simulated that have a high range of complexity for agents to learn.</li>\\n<li><a href=\\\"https://gymnasium.farama.org/environments/third_party_environments/\\\" rel=\\\"nofollow\\\">Third-party</a> - A number of environments have been created that are compatible with the Gymnasium API. Be aware of the version that the software was created for and use the <code>apply_env_compatibility</code> in <code>gymnasium.make</code> if necessary.</li>\\n</ul>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Installation</h2><a id=\\\"user-content-installation\\\" aria-label=\\\"Permalink: Installation\\\" href=\\\"#installation\\\"></a></p>\\n<p dir=\\\"auto\\\">To install the base Gymnasium library, use <code>pip install gymnasium</code></p>\\n<p dir=\\\"auto\\\">This does not include dependencies for all families of environments (there's a massive number, and some can be problematic to install on certain systems). You can install these dependencies for one family like <code>pip install \\\"gymnasium[atari]\\\"</code> or use <code>pip install \\\"gymnasium[all]\\\"</code> to install all dependencies.</p>\\n<p dir=\\\"auto\\\">We support and test for Python 3.10, 3.11, 3.12 and 3.13 on Linux and macOS. We will accept PRs related to Windows, but do not officially support it.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">API</h2><a id=\\\"user-content-api\\\" aria-label=\\\"Permalink: API\\\" href=\\\"#api\\\"></a></p>\\n<p dir=\\\"auto\\\">The Gymnasium API models environments as simple Python <code>env</code> classes. Creating environment instances and interacting with them is very simple- here's an example using the \\\"CartPole-v1\\\" environment:</p>\\n<div dir=\\\"auto\\\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\\\"CartPole-v1\\\"</span>)\\n\\n<span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>(<span>seed</span><span>=</span><span>42</span>)\\n<span>for</span> <span>_</span> <span>in</span> <span>range</span>(<span>1000</span>):\\n    <span>action</span> <span>=</span> <span>env</span>.<span>action_space</span>.<span>sample</span>()\\n    <span>observation</span>, <span>reward</span>, <span>terminated</span>, <span>truncated</span>, <span>info</span> <span>=</span> <span>env</span>.<span>step</span>(<span>action</span>)\\n\\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span>:\\n        <span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>()\\n<span>env</span>.<span>close</span>()</pre></div>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Notable Related Libraries</h2><a id=\\\"user-content-notable-related-libraries\\\" aria-label=\\\"Permalink: Notable Related Libraries\\\" href=\\\"#notable-related-libraries\\\"></a></p>\\n<p dir=\\\"auto\\\">Please note that this is an incomplete list, and just includes libraries that the maintainers most commonly point newcomers to when asked for recommendations.</p>\\n<ul dir=\\\"auto\\\">\\n<li><a href=\\\"https://github.com/vwxyzjn/cleanrl\\\">CleanRL</a> is a learning library based on the Gymnasium API. It is designed to cater to newer people in the field and provides very good reference implementations.</li>\\n<li><a href=\\\"https://github.com/Farama-Foundation/PettingZoo\\\">PettingZoo</a> is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments.</li>\\n<li>The Farama Foundation also has a collection of many other <a href=\\\"https://farama.org/projects\\\" rel=\\\"nofollow\\\">environments</a> that are maintained by the same team as Gymnasium and use the Gymnasium API.</li>\\n</ul>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Environment Versioning</h2><a id=\\\"user-content-environment-versioning\\\" aria-label=\\\"Permalink: Environment Versioning\\\" href=\\\"#environment-versioning\\\"></a></p>\\n<p dir=\\\"auto\\\">Gymnasium keeps strict versioning for reproducibility reasons. All environments end in a suffix like \\\"-v0\\\".  When changes are made to environments that might impact learning results, the number is increased by one to prevent potential confusion. These were inherited from Gym.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Contributing</h2><a id=\\\"user-content-contributing\\\" aria-label=\\\"Permalink: Contributing\\\" href=\\\"#contributing\\\"></a></p>\\n<p dir=\\\"auto\\\">We welcome contributions from the community!\\nPlease see our <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/blob/main/CONTRIBUTING.md\\\">CONTRIBUTING.md</a> for details on how to get started.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Support Gymnasium's Development</h2><a id=\\\"user-content-support-gymnasiums-development\\\" aria-label=\\\"Permalink: Support Gymnasium's Development\\\" href=\\\"#support-gymnasiums-development\\\"></a></p>\\n<p dir=\\\"auto\\\">If you are financially able to do so and would like to support the development of Gymnasium, please join others in the community in <a href=\\\"https://github.com/sponsors/Farama-Foundation\\\">donating to us</a>.</p>\\n<p dir=\\\"auto\\\"><h2 tabindex=\\\"-1\\\" dir=\\\"auto\\\">Citation</h2><a id=\\\"user-content-citation\\\" aria-label=\\\"Permalink: Citation\\\" href=\\\"#citation\\\"></a></p>\\n<p dir=\\\"auto\\\">You can cite Gymnasium using our related paper (<a href=\\\"https://arxiv.org/abs/2407.17032\\\" rel=\\\"nofollow\\\">https://arxiv.org/abs/2407.17032</a>) as:</p>\\n<div><pre><code>@article{towers2024gymnasium,\\n  title={Gymnasium: A Standard Interface for Reinforcement Learning Environments},\\n  author={Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and De Cola, Gianluca and Deleu, Tristan and Goul{\\\\~a}o, Manuel and Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},\\n  journal={arXiv preprint arXiv:2407.17032},\\n  year={2024}\\n}\\n</code></pre></div>\\n</article></div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://github.com/farama-foundation/gymnasium",
    "title": "GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)",
    "summary": "An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym) - Farama-Foundation/Gymnasium",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div data-hpc=\"true\"><article itemprop=\"text\"><p dir=\"auto\"><a href=\"https://badge.fury.io/py/gymnasium\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/b391159f1cb2ce27383263b393800ee2835d088f451900c3f50c62f51c013dbc/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f67796d6e617369756d2e737667\" alt=\"Python\" data-canonical-src=\"https://img.shields.io/pypi/pyversions/gymnasium.svg\"></a>\n<a href=\"https://badge.fury.io/py/gymnasium\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/0777e88232a2ef56fd5da4c0e5b4e9827021e9c1a8dc7b7b629a2b82dce48f5b/68747470733a2f2f62616467652e667572792e696f2f70792f67796d6e617369756d2e737667\" alt=\"PyPI\" data-canonical-src=\"https://badge.fury.io/py/gymnasium.svg\"></a>\n<a href=\"https://arxiv.org/abs/2407.17032\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/b594949273dfef4572c0e54cce28590df41cbb44198329f769b1f814c29cb57f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f61725869762d323430372e31373033322d6233316231622e737667\" alt=\"arXiv\" data-canonical-src=\"https://img.shields.io/badge/arXiv-2407.17032-b31b1b.svg\"></a>\n<a href=\"https://pre-commit.com/\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/9b46a404a478a8f690c0f1e4450b6d22bafb14b5d0a237d9626fe889da71a710/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7072652d2d636f6d6d69742d656e61626c65642d627269676874677265656e3f6c6f676f3d7072652d636f6d6d6974266c6f676f436f6c6f723d7768697465\" alt=\"pre-commit\" data-canonical-src=\"https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&amp;logoColor=white\"></a>\n<a href=\"https://github.com/Farama-Foundation/Gymnasium/blob/main/LICENSE\"><img src=\"https://camo.githubusercontent.com/d0897855a06181ac533d42e8883efb28d79dbd3379df1e24c9f9c88f20bcf6e2/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f466172616d612d466f756e646174696f6e2f47796d6e617369756d\" alt=\"License\" data-canonical-src=\"https://img.shields.io/github/license/Farama-Foundation/Gymnasium\"></a>\n<a href=\"https://github.com/psf/black\"><img src=\"https://camo.githubusercontent.com/7019b88be88468d6b83fcbf59d2c06bfa4992bafa7d129f9b89ab29f8c7c1acb/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f64652532307374796c652d626c61636b2d3030303030302e737667\" alt=\"Code style: black\" data-canonical-src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a></p>\n<p dir=\"auto\">\n    <a href=\"https://gymnasium.farama.org/\" rel=\"nofollow\">\n    <img src=\"https://raw.githubusercontent.com/Farama-Foundation/Gymnasium/main/gymnasium-text.png\" width=\"500px\">\n</a>\n</p>\n<p dir=\"auto\">Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. This is a fork of OpenAI's <a href=\"https://github.com/openai/gym\">Gym</a> library by its maintainers (OpenAI handed over maintenance a few years ago to an outside team), and is where future maintenance will occur going forward.</p>\n<p dir=\"auto\">The documentation website is at <a href=\"https://gymnasium.farama.org/\" rel=\"nofollow\">gymnasium.farama.org</a>, and we have a public discord server (which we also use to coordinate development work) that you can join here: <a href=\"https://discord.gg/bnJ6kubTg6\" rel=\"nofollow\">https://discord.gg/bnJ6kubTg6</a></p>\n\n<p dir=\"auto\">Gymnasium includes the following families of environments along with a wide variety of third-party environments</p>\n<ul dir=\"auto\">\n<li><a href=\"https://gymnasium.farama.org/environments/classic_control/\" rel=\"nofollow\">Classic Control</a> - These are classic reinforcement learning based on real-world problems and physics.</li>\n<li><a href=\"https://gymnasium.farama.org/environments/box2d/\" rel=\"nofollow\">Box2D</a> - These environments all involve toy games based around physics control, using box2d based physics and PyGame-based rendering</li>\n<li><a href=\"https://gymnasium.farama.org/environments/toy_text/\" rel=\"nofollow\">Toy Text</a> - These environments are designed to be extremely simple, with small discrete state and action spaces, and hence easy to learn. As a result, they are suitable for debugging implementations of reinforcement learning algorithms.</li>\n<li><a href=\"https://gymnasium.farama.org/environments/mujoco/\" rel=\"nofollow\">MuJoCo</a> - A physics engine based environments with multi-joint control which are more complex than the Box2D environments.</li>\n<li><a href=\"https://ale.farama.org/\" rel=\"nofollow\">Atari</a> - Emulator of Atari 2600 ROMs simulated that have a high range of complexity for agents to learn.</li>\n<li><a href=\"https://gymnasium.farama.org/environments/third_party_environments/\" rel=\"nofollow\">Third-party</a> - A number of environments have been created that are compatible with the Gymnasium API. Be aware of the version that the software was created for and use the <code>apply_env_compatibility</code> in <code>gymnasium.make</code> if necessary.</li>\n</ul>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Installation</h2><a id=\"user-content-installation\" aria-label=\"Permalink: Installation\" href=\"#installation\"></a></p>\n<p dir=\"auto\">To install the base Gymnasium library, use <code>pip install gymnasium</code></p>\n<p dir=\"auto\">This does not include dependencies for all families of environments (there's a massive number, and some can be problematic to install on certain systems). You can install these dependencies for one family like <code>pip install \"gymnasium[atari]\"</code> or use <code>pip install \"gymnasium[all]\"</code> to install all dependencies.</p>\n<p dir=\"auto\">We support and test for Python 3.10, 3.11, 3.12 and 3.13 on Linux and macOS. We will accept PRs related to Windows, but do not officially support it.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">API</h2><a id=\"user-content-api\" aria-label=\"Permalink: API\" href=\"#api\"></a></p>\n<p dir=\"auto\">The Gymnasium API models environments as simple Python <code>env</code> classes. Creating environment instances and interacting with them is very simple- here's an example using the \"CartPole-v1\" environment:</p>\n<div dir=\"auto\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\"CartPole-v1\"</span>)\n\n<span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>(<span>seed</span><span>=</span><span>42</span>)\n<span>for</span> <span>_</span> <span>in</span> <span>range</span>(<span>1000</span>):\n    <span>action</span> <span>=</span> <span>env</span>.<span>action_space</span>.<span>sample</span>()\n    <span>observation</span>, <span>reward</span>, <span>terminated</span>, <span>truncated</span>, <span>info</span> <span>=</span> <span>env</span>.<span>step</span>(<span>action</span>)\n\n    <span>if</span> <span>terminated</span> <span>or</span> <span>truncated</span>:\n        <span>observation</span>, <span>info</span> <span>=</span> <span>env</span>.<span>reset</span>()\n<span>env</span>.<span>close</span>()</pre></div>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Notable Related Libraries</h2><a id=\"user-content-notable-related-libraries\" aria-label=\"Permalink: Notable Related Libraries\" href=\"#notable-related-libraries\"></a></p>\n<p dir=\"auto\">Please note that this is an incomplete list, and just includes libraries that the maintainers most commonly point newcomers to when asked for recommendations.</p>\n<ul dir=\"auto\">\n<li><a href=\"https://github.com/vwxyzjn/cleanrl\">CleanRL</a> is a learning library based on the Gymnasium API. It is designed to cater to newer people in the field and provides very good reference implementations.</li>\n<li><a href=\"https://github.com/Farama-Foundation/PettingZoo\">PettingZoo</a> is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments.</li>\n<li>The Farama Foundation also has a collection of many other <a href=\"https://farama.org/projects\" rel=\"nofollow\">environments</a> that are maintained by the same team as Gymnasium and use the Gymnasium API.</li>\n</ul>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Environment Versioning</h2><a id=\"user-content-environment-versioning\" aria-label=\"Permalink: Environment Versioning\" href=\"#environment-versioning\"></a></p>\n<p dir=\"auto\">Gymnasium keeps strict versioning for reproducibility reasons. All environments end in a suffix like \"-v0\".  When changes are made to environments that might impact learning results, the number is increased by one to prevent potential confusion. These were inherited from Gym.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Contributing</h2><a id=\"user-content-contributing\" aria-label=\"Permalink: Contributing\" href=\"#contributing\"></a></p>\n<p dir=\"auto\">We welcome contributions from the community!\nPlease see our <a href=\"https://github.com/Farama-Foundation/Gymnasium/blob/main/CONTRIBUTING.md\">CONTRIBUTING.md</a> for details on how to get started.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Support Gymnasium's Development</h2><a id=\"user-content-support-gymnasiums-development\" aria-label=\"Permalink: Support Gymnasium's Development\" href=\"#support-gymnasiums-development\"></a></p>\n<p dir=\"auto\">If you are financially able to do so and would like to support the development of Gymnasium, please join others in the community in <a href=\"https://github.com/sponsors/Farama-Foundation\">donating to us</a>.</p>\n<p dir=\"auto\"><h2 tabindex=\"-1\" dir=\"auto\">Citation</h2><a id=\"user-content-citation\" aria-label=\"Permalink: Citation\" href=\"#citation\"></a></p>\n<p dir=\"auto\">You can cite Gymnasium using our related paper (<a href=\"https://arxiv.org/abs/2407.17032\" rel=\"nofollow\">https://arxiv.org/abs/2407.17032</a>) as:</p>\n<div><pre><code>@article{towers2024gymnasium,\n  title={Gymnasium: A Standard Interface for Reinforcement Learning Environments},\n  author={Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and De Cola, Gianluca and Deleu, Tristan and Goul{\\~a}o, Manuel and Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},\n  journal={arXiv preprint arXiv:2407.17032},\n  year={2024}\n}\n</code></pre></div>\n</article></div></div>"
  }
}

▼✓抓取网页正文网址: "https://gymnasium.farama.org/introduction/migration_guide/"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/introduction/migration_guide/\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article role=\\\"main\\\">\\n            \\n            <section id=\\\"gym-migration-guide\\\">\\n<h2>Gym Migration Guide<a href=\\\"#gym-migration-guide\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"who-should-read-this-guide\\\">\\n<h2>Who Should Read This Guide?<a href=\\\"#who-should-read-this-guide\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><strong>If you’re new to Gymnasium</strong>: You can probably skip this page! This guide is for users migrating from older versions of OpenAI Gym. If you’re just starting with RL, head to <a href=\\\"https://gymnasium.farama.org/introduction/basic_usage/\\\"><span>Basic Usage</span></a> instead.</p>\\n<p><strong>If you’re migrating from OpenAI Gym</strong>: This guide will help you update your code to work with Gymnasium. The changes are significant but straightforward once you understand the reasoning behind them.</p>\\n<p><strong>If you’re updating old tutorials</strong>: Many online RL tutorials use the old v0.21 API. This guide shows you how to modernize that code.</p>\\n</section>\\n<section id=\\\"why-did-the-api-change\\\">\\n<h2>Why Did the API Change?<a href=\\\"#why-did-the-api-change\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p>Gymnasium is a fork of <a href=\\\"https://github.com/openai/gym/releases/tag/0.26.2\\\">OpenAI Gym v0.26</a>, which introduced breaking changes from <a href=\\\"https://github.com/openai/gym/releases/tag/v0.21.0\\\">Gym v0.21</a>. These changes weren’t made lightly - they solved important problems that made RL research and development more difficult.</p>\\n<p>The main issues with the old API were:</p>\\n<ul>\\n<li><p><strong>Ambiguous episode endings</strong>: The single <code><span>done</span></code> flag couldn’t distinguish between “task completed” and “time limit reached”</p></li>\\n<li><p><strong>Inconsistent seeding</strong>: Random number generation was unreliable and hard to reproduce</p></li>\\n<li><p><strong>Rendering complexity</strong>: Switching between visual modes was unnecessarily complicated</p></li>\\n<li><p><strong>Reproducibility problems</strong>: Subtle bugs made it difficult to reproduce research results</p></li>\\n</ul>\\n<p>For environments that can’t be updated, see the compatibility guide section below.</p>\\n</section>\\n<section id=\\\"quick-reference-complete-changes-table\\\">\\n<h2>Quick Reference: Complete Changes Table<a href=\\\"#quick-reference-complete-changes-table\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<div>\\n<table>\\n<thead>\\n<tr><th><p><strong>Component</strong></p></th>\\n<th><p><strong>v0.21 (Old)</strong></p></th>\\n<th><p><strong>v0.26+ (New)</strong></p></th>\\n<th><p><strong>Impact</strong></p></th>\\n</tr>\\n</thead>\\n<tbody>\\n<tr><td><p><strong>Package Import</strong></p></td>\\n<td><p><code><span>import</span> <span>gym</span></code></p></td>\\n<td><p><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p></td>\\n<td><p>All code</p></td>\\n</tr>\\n<tr><td><p><strong>Environment Reset</strong></p></td>\\n<td><p><code><span>obs</span> <span>=</span> <span>env.reset()</span></code></p></td>\\n<td><p><code><span>obs,</span> <span>info</span> <span>=</span> <span>env.reset()</span></code></p></td>\\n<td><p>Training loops</p></td>\\n</tr>\\n<tr><td><p><strong>Random Seeding</strong></p></td>\\n<td><p><code><span>env.seed(42)</span></code></p></td>\\n<td><p><code><span>env.reset(seed=42)</span></code></p></td>\\n<td><p>Reproducibility</p></td>\\n</tr>\\n<tr><td><p><strong>Step Function</strong></p></td>\\n<td><p><code><span>obs,</span> <span>reward,</span> <span>done,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\\n<td><p><code><span>obs,</span> <span>reward,</span> <span>terminated,</span> <span>truncated,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\\n<td><p>RL algorithms</p></td>\\n</tr>\\n<tr><td><p><strong>Episode Ending</strong></p></td>\\n<td><p><code><span>while</span> <span>not</span> <span>done:</span></code></p></td>\\n<td><p><code><span>while</span> <span>not</span> <span>(terminated</span> <span>or</span> <span>truncated):</span></code></p></td>\\n<td><p>Training loops</p></td>\\n</tr>\\n<tr><td><p><strong>Render Mode</strong></p></td>\\n<td><p><code><span>env.render(mode=\\\"human\\\")</span></code></p></td>\\n<td><p><code><span>gym.make(env_id,</span> <span>render_mode=\\\"human\\\")</span></code></p></td>\\n<td><p>Visualization</p></td>\\n</tr>\\n<tr><td><p><strong>Time Limit Detection</strong></p></td>\\n<td><p><code><span>info.get('TimeLimit.truncated')</span></code></p></td>\\n<td><p><code><span>truncated</span></code> return value</p></td>\\n<td><p>RL algorithms</p></td>\\n</tr>\\n<tr><td><p><strong>Value Bootstrapping</strong></p></td>\\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-done)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-terminated)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\\n<td><p>RL correctness</p></td>\\n</tr>\\n</tbody>\\n</table>\\n</div>\\n</section>\\n<section id=\\\"side-by-side-code-comparison\\\">\\n<h2>Side-by-Side Code Comparison<a href=\\\"#side-by-side-code-comparison\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"old-v0-21-code\\\">\\n<h3>Old v0.21 Code<a href=\\\"#old-v0-21-code\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<div><pre><span></span><span>import</span><span> </span><span>gym</span>\\n\\n<span># Environment creation and seeding</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"LunarLander-v3\\\"</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>123</span><span>)</span>\\n<span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n\\n<span># Training loop</span>\\n<span>done</span> <span>=</span> <span>False</span>\\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n    <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n\\n<span>env</span><span>.</span><span>close</span><span>()</span>\\n</pre></div>\\n</section>\\n<section id=\\\"new-v0-26-code-including-v1-0-0\\\">\\n<h3>New v0.26+ Code (Including v1.0.0)<a href=\\\"#new-v0-26-code-including-v1-0-0\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>  <span># Note: 'gymnasium' not 'gym'</span>\\n\\n<span># Environment creation with render mode specified upfront</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"LunarLander-v3\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n\\n<span># Reset with seed parameter</span>\\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>123</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\\n\\n<span># Training loop with terminated/truncated distinction</span>\\n<span>done</span> <span>=</span> <span>False</span>\\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n\\n    <span># Episode ends if either terminated OR truncated</span>\\n    <span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\\n\\n<span>env</span><span>.</span><span>close</span><span>()</span>\\n</pre></div>\\n</section>\\n</section>\\n<section id=\\\"key-changes-breakdown\\\">\\n<h2>Key Changes Breakdown<a href=\\\"#key-changes-breakdown\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"package-name-change\\\">\\n<h3>1. Package Name Change<a href=\\\"#package-name-change\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p><strong>Old</strong>: <code><span>import</span> <span>gym</span></code>\\n<strong>New</strong>: <code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p>\\n<p>Why: Gymnasium is a separate project that maintains and improves upon the original Gym codebase.</p>\\n<div><pre><span></span><span># OLD</span>\\n<span>import</span><span> </span><span>gym</span>\\n\\n<span># NEW</span>\\n<span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\\n</pre></div>\\n</section>\\n<section id=\\\"seeding-and-random-number-generation\\\">\\n<h3>2. Seeding and Random Number Generation<a href=\\\"#seeding-and-random-number-generation\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>The biggest conceptual change is how randomness is handled.</p>\\n<p><strong>Old v0.21</strong>: Separate <code><span>seed()</span></code> method</p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>  <span># Set random seed</span>\\n<span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Reset environment</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Seed passed to <code><span>reset()</span></code></p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Seed and reset together</span>\\n</pre></div>\\n<p><strong>Why this changed</strong>: Some environments (especially emulated games) can only set their random state at the beginning of an episode, not mid-episode. The old approach could lead to inconsistent behavior.</p>\\n<p><strong>Practical impact</strong>:</p>\\n<div><pre><span></span><span># OLD: Seeding applied to all future episodes</span>\\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>\\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\\n    <span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n\\n<span># NEW: Each episode can have its own seed</span>\\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\\n    <span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span> <span>+</span> <span>episode</span><span>)</span>  <span># Each episode gets unique seed</span>\\n</pre></div>\\n</section>\\n<section id=\\\"environment-reset-changes\\\">\\n<h3>3. Environment Reset Changes<a href=\\\"#environment-reset-changes\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p><strong>Old v0.21</strong>: Returns only observation</p>\\n<div><pre><span></span><span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Returns observation AND info</p>\\n<div><pre><span></span><span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\\n</pre></div>\\n<p><strong>Why this changed</strong>:</p>\\n<ul>\\n<li><p><code><span>info</span></code> provides consistent access to debugging information</p></li>\\n<li><p><code><span>seed</span></code> parameter enables reproducible episodes</p></li>\\n<li><p><code><span>options</span></code> parameter allows episode-specific configuration</p></li>\\n</ul>\\n<p><strong>Common migration pattern</strong>:</p>\\n<div><pre><span></span><span># If you don't need the new features, just unpack the tuple</span>\\n<span>obs</span><span>,</span> <span>_</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Ignore info with underscore</span>\\n\\n<span># If you want to maintain the same random behavior as v0.21</span>\\n<span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Set seed once</span>\\n<span># Then for subsequent resets:</span>\\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Uses internal random state</span>\\n</pre></div>\\n</section>\\n<section id=\\\"step-function-the-done-terminated-truncated-split\\\">\\n<h3>4. Step Function: The <code><span>done</span></code> → <code><span>terminated</span></code>/<code><span>truncated</span></code> Split<a href=\\\"#step-function-the-done-terminated-truncated-split\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>This is the most important change for training algorithms.</p>\\n<p><strong>Old v0.21</strong>: Single <code><span>done</span></code> flag</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Separate <code><span>terminated</span></code> and <code><span>truncated</span></code> flags</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n</pre></div>\\n<p><strong>Why this matters</strong>:</p>\\n<ul>\\n<li><p><strong><code><span>terminated</span></code></strong>: Episode ended because the task was completed or failed (agent reached goal, died, etc.)</p></li>\\n<li><p><strong><code><span>truncated</span></code></strong>: Episode ended due to external constraints (time limit, step limit, etc.)</p></li>\\n</ul>\\n<p>This distinction is crucial for value function bootstrapping in RL algorithms:</p>\\n<div><pre><span></span><span># OLD (ambiguous)</span>\\n<span>if</span> <span>done</span><span>:</span>\\n    <span># Should we bootstrap? We don't know if this was natural termination or time limit!</span>\\n    <span>next_value</span> <span>=</span> <span>0</span>  <span># Assumption that may be wrong</span>\\n\\n<span># NEW (clear)</span>\\n<span>if</span> <span>terminated</span><span>:</span>\\n    <span>next_value</span> <span>=</span> <span>0</span>      <span># Natural ending - no future value</span>\\n<span>elif</span> <span>truncated</span><span>:</span>\\n    <span>next_value</span> <span>=</span> <span>value_function</span><span>(</span><span>next_obs</span><span>)</span>  <span># Time limit - estimate future value</span>\\n</pre></div>\\n<p><strong>Migration strategy</strong>:</p>\\n<div><pre><span></span><span># Simple migration (works for many cases)</span>\\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\\n\\n<span># Better migration (preserves RL algorithm correctness)</span>\\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>if</span> <span>terminated</span><span>:</span>\\n    <span># Episode naturally ended - use reward as-is</span>\\n    <span>target</span> <span>=</span> <span>reward</span>\\n<span>elif</span> <span>truncated</span><span>:</span>\\n    <span># Episode cut short - may need to estimate remaining value</span>\\n    <span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>discount</span> <span>*</span> <span>estimate_value</span><span>(</span><span>obs</span><span>)</span>\\n</pre></div>\\n<p>For more information, see our <a href=\\\"https://farama.org/Gymnasium-Terminated-Truncated-Step-API\\\">blog post</a> about it.</p>\\n</section>\\n<section id=\\\"render-mode-changes\\\">\\n<h3>5. Render Mode Changes<a href=\\\"#render-mode-changes\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p><strong>Old v0.21</strong>: Render mode specified each time</p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>     <span># Visual window</span>\\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"rgb_array\\\"</span><span>)</span> <span># Get pixel array</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Render mode fixed at creation</p>\\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>     <span># For visual display</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"rgb_array\\\"</span><span>)</span> <span># For recording</span>\\n<span>env</span><span>.</span><span>render</span><span>()</span>  <span># Uses the mode specified at creation</span>\\n</pre></div>\\n<p><strong>Why this changed</strong>: Some environments can’t switch render modes on-the-fly. Fixing the mode at creation enables better optimization and prevents bugs.</p>\\n<p><strong>Practical implications</strong>:</p>\\n<div><pre><span></span><span># OLD: Could switch modes dynamically</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\\n    <span># ... episode code ...</span>\\n    <span>if</span> <span>episode</span> <span>%</span> <span>10</span> <span>==</span> <span>0</span><span>:</span>\\n        <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>  <span># Show every 10th episode</span>\\n\\n<span># NEW: Create separate environments for different purposes</span>\\n<span>training_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>)</span>  <span># No rendering for speed</span>\\n<span>eval_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>  <span># Visual for evaluation</span>\\n\\n<span># Or use None for no rendering, then create visual env when needed</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>None</span><span>)</span>  <span># Fast training</span>\\n<span>if</span> <span>need_visualization</span><span>:</span>\\n    <span>visual_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\\\"human\\\"</span><span>)</span>\\n</pre></div>\\n</section>\\n</section>\\n<section id=\\\"timelimit-wrapper-changes\\\">\\n<h2>TimeLimit Wrapper Changes<a href=\\\"#timelimit-wrapper-changes\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p>The <a href=\\\"https://gymnasium.farama.org/api/wrappers/misc_wrappers/#gymnasium.wrappers.TimeLimit\\\" title=\\\"gymnasium.wrappers.TimeLimit\\\"><code><span>TimeLimit</span></code></a> wrapper behavior also changed to align with the new termination model.</p>\\n<p><strong>Old v0.21</strong>: Added <code><span>TimeLimit.truncated</span></code> to info dict</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>if</span> <span>done</span> <span>and</span> <span>info</span><span>.</span><span>get</span><span>(</span><span>'TimeLimit.truncated'</span><span>,</span> <span>False</span><span>):</span>\\n    <span># Episode ended due to time limit</span>\\n    <span>pass</span>\\n</pre></div>\\n<p><strong>New v0.26+</strong>: Uses the <code><span>truncated</span></code> return value</p>\\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\\n<span>if</span> <span>truncated</span><span>:</span>\\n    <span># Episode ended due to time limit (or other truncation)</span>\\n    <span>pass</span>\\n<span>if</span> <span>terminated</span><span>:</span>\\n    <span># Episode ended naturally (success/failure)</span>\\n    <span>pass</span>\\n</pre></div>\\n<p>This makes time limit detection much cleaner and more explicit.</p>\\n</section>\\n<section id=\\\"environment-specific-changes\\\">\\n<h2>Environment-Specific Changes<a href=\\\"#environment-specific-changes\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"removed-environments\\\">\\n<h3>Removed Environments<a href=\\\"#removed-environments\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>Some environments were moved or removed:</p>\\n<div><pre><span></span><span># OLD: Robotics environments in main gym</span>\\n<span>import</span><span> </span><span>gym</span>\\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"FetchReach-v1\\\"</span><span>)</span>  <span># No longer available</span>\\n\\n<span># NEW: Moved to separate package</span>\\n<span>import</span><span> </span><span>gymnasium</span>\\n\\n<span>import</span><span> </span><span>gymnasium_robotics</span>\\n<span>import</span><span> </span><span>ale_py</span>\\n\\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>((</span><span>gymnasium_robotics</span><span>,</span> <span>ale_py</span><span>))</span>\\n\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"FetchReach-v1\\\"</span><span>)</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"ALE/Pong-v5\\\"</span><span>)</span>\\n</pre></div>\\n</section>\\n</section>\\n<section id=\\\"compatibility-helpers\\\">\\n<h2>Compatibility Helpers<a href=\\\"#compatibility-helpers\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"loading-openai-gym-environments\\\">\\n<h3>Loading OpenAI Gym environments<a href=\\\"#loading-openai-gym-environments\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>For environments that can’t be updated to Gymnasium, we provide compatibility wrappers either for v21 and v26 style environments, where either the environment name or the environment itself can be passed.</p>\\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span>\\n<span>import</span><span> </span><span>shimmy</span>  <span># install shimmy with `pip install shimmy`</span>\\n\\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>(</span><span>shimmy</span><span>)</span>\\n\\n\\n<span># Gym v0.21 style environments</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV21Environment-v0\\\"</span><span>,</span> <span>env_id</span><span>=</span><span>\\\"CartPole-v1\\\"</span><span>)</span>\\n<span># or</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV21Environment-v0\\\"</span><span>,</span> <span>env</span><span>=</span><span>OldV21Env</span><span>())</span>\\n\\n<span># Gym v0.26 style environments</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV26Environment-v0\\\"</span><span>,</span> <span>env_id</span><span>=</span><span>\\\"Cartpole-v1\\\"</span><span>)</span>\\n<span># or</span>\\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\\\"GymV26Environment-v0\\\"</span><span>,</span> <span>env</span><span>=</span><span>OldV26Env</span><span>())</span>\\n</pre></div>\\n</section>\\n<section id=\\\"step-api-compatibility\\\">\\n<h3>Step API Compatibility<a href=\\\"#step-api-compatibility\\\" title=\\\"Link to this heading\\\">¶</a></h3>\\n<p>If environments implement the (old) done step API, Gymnasium provides functions (<a href=\\\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\\\" title=\\\"gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\\\"><code><span>convert_to_terminated_truncated_step_api()</span></code></a> and <a href=\\\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_done_step_api\\\" title=\\\"gymnasium.utils.step_api_compatibility.convert_to_done_step_api\\\"><code><span>convert_to_done_step_api()</span></code></a>) that will convert an environment with the old step API (using <code><span>done</span></code>) to the new step API (using <code><span>termination</span></code> and <code><span>truncation</span></code>), and vice versa.</p>\\n</section>\\n</section>\\n<section id=\\\"testing-your-migration\\\">\\n<h2>Testing Your Migration<a href=\\\"#testing-your-migration\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p>After migrating, verify that:</p>\\n<ul>\\n<li><p>[ ] <strong>Import statements</strong> use <code><span>gymnasium</span></code> instead of <code><span>gym</span></code></p></li>\\n<li><p>[ ] <strong>Reset calls</strong> handle the <code><span>(obs,</span> <span>info)</span></code> return format</p></li>\\n<li><p>[ ] <strong>Step calls</strong> handle <code><span>terminated</span></code> and <code><span>truncated</span></code> separately</p></li>\\n<li><p>[ ] <strong>Render mode</strong> is specified during environment creation</p></li>\\n<li><p>[ ] <strong>Random seeding</strong> uses the <code><span>seed</span></code> parameter in <code><span>reset()</span></code></p></li>\\n<li><p>[ ] <strong>Training algorithms</strong> properly distinguish termination types</p></li>\\n</ul>\\n<p>Use the <a href=\\\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.env_checker.check_env\\\" title=\\\"gymnasium.utils.env_checker.check_env\\\"><code><span>check_env()</span></code></a> to verify their implementation.</p>\\n</section>\\n<section id=\\\"getting-help\\\">\\n<h2>Getting Help<a href=\\\"#getting-help\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><strong>If you encounter issues during migration</strong>:</p>\\n<ol>\\n<li><p><strong>Check the compatibility guide</strong>: Some old environments can be used with compatibility wrappers</p></li>\\n<li><p><strong>Look at the environment documentation</strong>: Each environment may have specific migration notes</p></li>\\n<li><p><strong>Test with simple environments first</strong>: Start with CartPole before moving to complex environments</p></li>\\n<li><p><strong>Compare old vs new behavior</strong>: Run the same code with both APIs to understand differences</p></li>\\n</ol>\\n<p><strong>Common resources</strong>:</p>\\n<ul>\\n<li><p><a href=\\\"https://gymnasium.farama.org/api/env\\\">Gymnasium API documentation</a></p></li>\\n<li><p><a href=\\\"https://github.com/Farama-Foundation/Gymnasium/issues\\\">GitHub issues</a> for bug reports</p></li>\\n<li><p><a href=\\\"https://discord.gg/bnJ6kubTg6\\\">Discord community</a> for questions</p></li>\\n</ul>\\n</section>\\n</section>\\n\\n          </article></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/introduction/migration_guide/",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article role=\"main\">\n            \n            <section id=\"gym-migration-guide\">\n<h2>Gym Migration Guide<a href=\"#gym-migration-guide\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"who-should-read-this-guide\">\n<h2>Who Should Read This Guide?<a href=\"#who-should-read-this-guide\" title=\"Link to this heading\">¶</a></h2>\n<p><strong>If you’re new to Gymnasium</strong>: You can probably skip this page! This guide is for users migrating from older versions of OpenAI Gym. If you’re just starting with RL, head to <a href=\"https://gymnasium.farama.org/introduction/basic_usage/\"><span>Basic Usage</span></a> instead.</p>\n<p><strong>If you’re migrating from OpenAI Gym</strong>: This guide will help you update your code to work with Gymnasium. The changes are significant but straightforward once you understand the reasoning behind them.</p>\n<p><strong>If you’re updating old tutorials</strong>: Many online RL tutorials use the old v0.21 API. This guide shows you how to modernize that code.</p>\n</section>\n<section id=\"why-did-the-api-change\">\n<h2>Why Did the API Change?<a href=\"#why-did-the-api-change\" title=\"Link to this heading\">¶</a></h2>\n<p>Gymnasium is a fork of <a href=\"https://github.com/openai/gym/releases/tag/0.26.2\">OpenAI Gym v0.26</a>, which introduced breaking changes from <a href=\"https://github.com/openai/gym/releases/tag/v0.21.0\">Gym v0.21</a>. These changes weren’t made lightly - they solved important problems that made RL research and development more difficult.</p>\n<p>The main issues with the old API were:</p>\n<ul>\n<li><p><strong>Ambiguous episode endings</strong>: The single <code><span>done</span></code> flag couldn’t distinguish between “task completed” and “time limit reached”</p></li>\n<li><p><strong>Inconsistent seeding</strong>: Random number generation was unreliable and hard to reproduce</p></li>\n<li><p><strong>Rendering complexity</strong>: Switching between visual modes was unnecessarily complicated</p></li>\n<li><p><strong>Reproducibility problems</strong>: Subtle bugs made it difficult to reproduce research results</p></li>\n</ul>\n<p>For environments that can’t be updated, see the compatibility guide section below.</p>\n</section>\n<section id=\"quick-reference-complete-changes-table\">\n<h2>Quick Reference: Complete Changes Table<a href=\"#quick-reference-complete-changes-table\" title=\"Link to this heading\">¶</a></h2>\n<div>\n<table>\n<thead>\n<tr><th><p><strong>Component</strong></p></th>\n<th><p><strong>v0.21 (Old)</strong></p></th>\n<th><p><strong>v0.26+ (New)</strong></p></th>\n<th><p><strong>Impact</strong></p></th>\n</tr>\n</thead>\n<tbody>\n<tr><td><p><strong>Package Import</strong></p></td>\n<td><p><code><span>import</span> <span>gym</span></code></p></td>\n<td><p><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p></td>\n<td><p>All code</p></td>\n</tr>\n<tr><td><p><strong>Environment Reset</strong></p></td>\n<td><p><code><span>obs</span> <span>=</span> <span>env.reset()</span></code></p></td>\n<td><p><code><span>obs,</span> <span>info</span> <span>=</span> <span>env.reset()</span></code></p></td>\n<td><p>Training loops</p></td>\n</tr>\n<tr><td><p><strong>Random Seeding</strong></p></td>\n<td><p><code><span>env.seed(42)</span></code></p></td>\n<td><p><code><span>env.reset(seed=42)</span></code></p></td>\n<td><p>Reproducibility</p></td>\n</tr>\n<tr><td><p><strong>Step Function</strong></p></td>\n<td><p><code><span>obs,</span> <span>reward,</span> <span>done,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\n<td><p><code><span>obs,</span> <span>reward,</span> <span>terminated,</span> <span>truncated,</span> <span>info</span> <span>=</span> <span>env.step(action)</span></code></p></td>\n<td><p>RL algorithms</p></td>\n</tr>\n<tr><td><p><strong>Episode Ending</strong></p></td>\n<td><p><code><span>while</span> <span>not</span> <span>done:</span></code></p></td>\n<td><p><code><span>while</span> <span>not</span> <span>(terminated</span> <span>or</span> <span>truncated):</span></code></p></td>\n<td><p>Training loops</p></td>\n</tr>\n<tr><td><p><strong>Render Mode</strong></p></td>\n<td><p><code><span>env.render(mode=\"human\")</span></code></p></td>\n<td><p><code><span>gym.make(env_id,</span> <span>render_mode=\"human\")</span></code></p></td>\n<td><p>Visualization</p></td>\n</tr>\n<tr><td><p><strong>Time Limit Detection</strong></p></td>\n<td><p><code><span>info.get('TimeLimit.truncated')</span></code></p></td>\n<td><p><code><span>truncated</span></code> return value</p></td>\n<td><p>RL algorithms</p></td>\n</tr>\n<tr><td><p><strong>Value Bootstrapping</strong></p></td>\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-done)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\n<td><p><code><span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>(1-terminated)</span> <span>*</span> <span>gamma</span> <span>*</span> <span>next_value</span></code></p></td>\n<td><p>RL correctness</p></td>\n</tr>\n</tbody>\n</table>\n</div>\n</section>\n<section id=\"side-by-side-code-comparison\">\n<h2>Side-by-Side Code Comparison<a href=\"#side-by-side-code-comparison\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"old-v0-21-code\">\n<h3>Old v0.21 Code<a href=\"#old-v0-21-code\" title=\"Link to this heading\">¶</a></h3>\n<div><pre><span></span><span>import</span><span> </span><span>gym</span>\n\n<span># Environment creation and seeding</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"LunarLander-v3\"</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>123</span><span>)</span>\n<span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n\n<span># Training loop</span>\n<span>done</span> <span>=</span> <span>False</span>\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n    <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"human\"</span><span>)</span>\n\n<span>env</span><span>.</span><span>close</span><span>()</span>\n</pre></div>\n</section>\n<section id=\"new-v0-26-code-including-v1-0-0\">\n<h3>New v0.26+ Code (Including v1.0.0)<a href=\"#new-v0-26-code-including-v1-0-0\" title=\"Link to this heading\">¶</a></h3>\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>  <span># Note: 'gymnasium' not 'gym'</span>\n\n<span># Environment creation with render mode specified upfront</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"LunarLander-v3\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>\n\n<span># Reset with seed parameter</span>\n<span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>123</span><span>,</span> <span>options</span><span>=</span><span>{})</span>\n\n<span># Training loop with terminated/truncated distinction</span>\n<span>done</span> <span>=</span> <span>False</span>\n<span>while</span> <span>not</span> <span>done</span><span>:</span>\n    <span>action</span> <span>=</span> <span>env</span><span>.</span><span>action_space</span><span>.</span><span>sample</span><span>()</span>\n    <span>observation</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n\n    <span># Episode ends if either terminated OR truncated</span>\n    <span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\n\n<span>env</span><span>.</span><span>close</span><span>()</span>\n</pre></div>\n</section>\n</section>\n<section id=\"key-changes-breakdown\">\n<h2>Key Changes Breakdown<a href=\"#key-changes-breakdown\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"package-name-change\">\n<h3>1. Package Name Change<a href=\"#package-name-change\" title=\"Link to this heading\">¶</a></h3>\n<p><strong>Old</strong>: <code><span>import</span> <span>gym</span></code>\n<strong>New</strong>: <code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span></code></p>\n<p>Why: Gymnasium is a separate project that maintains and improves upon the original Gym codebase.</p>\n<div><pre><span></span><span># OLD</span>\n<span>import</span><span> </span><span>gym</span>\n\n<span># NEW</span>\n<span>import</span><span> </span><span>gymnasium</span><span> </span><span>as</span><span> </span><span>gym</span>\n</pre></div>\n</section>\n<section id=\"seeding-and-random-number-generation\">\n<h3>2. Seeding and Random Number Generation<a href=\"#seeding-and-random-number-generation\" title=\"Link to this heading\">¶</a></h3>\n<p>The biggest conceptual change is how randomness is handled.</p>\n<p><strong>Old v0.21</strong>: Separate <code><span>seed()</span></code> method</p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>  <span># Set random seed</span>\n<span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Reset environment</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Seed passed to <code><span>reset()</span></code></p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Seed and reset together</span>\n</pre></div>\n<p><strong>Why this changed</strong>: Some environments (especially emulated games) can only set their random state at the beginning of an episode, not mid-episode. The old approach could lead to inconsistent behavior.</p>\n<p><strong>Practical impact</strong>:</p>\n<div><pre><span></span><span># OLD: Seeding applied to all future episodes</span>\n<span>env</span><span>.</span><span>seed</span><span>(</span><span>42</span><span>)</span>\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\n    <span>obs</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n\n<span># NEW: Each episode can have its own seed</span>\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\n    <span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span> <span>+</span> <span>episode</span><span>)</span>  <span># Each episode gets unique seed</span>\n</pre></div>\n</section>\n<section id=\"environment-reset-changes\">\n<h3>3. Environment Reset Changes<a href=\"#environment-reset-changes\" title=\"Link to this heading\">¶</a></h3>\n<p><strong>Old v0.21</strong>: Returns only observation</p>\n<div><pre><span></span><span>observation</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Returns observation AND info</p>\n<div><pre><span></span><span>observation</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>\n</pre></div>\n<p><strong>Why this changed</strong>:</p>\n<ul>\n<li><p><code><span>info</span></code> provides consistent access to debugging information</p></li>\n<li><p><code><span>seed</span></code> parameter enables reproducible episodes</p></li>\n<li><p><code><span>options</span></code> parameter allows episode-specific configuration</p></li>\n</ul>\n<p><strong>Common migration pattern</strong>:</p>\n<div><pre><span></span><span># If you don't need the new features, just unpack the tuple</span>\n<span>obs</span><span>,</span> <span>_</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Ignore info with underscore</span>\n\n<span># If you want to maintain the same random behavior as v0.21</span>\n<span>env</span><span>.</span><span>reset</span><span>(</span><span>seed</span><span>=</span><span>42</span><span>)</span>  <span># Set seed once</span>\n<span># Then for subsequent resets:</span>\n<span>obs</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>reset</span><span>()</span>  <span># Uses internal random state</span>\n</pre></div>\n</section>\n<section id=\"step-function-the-done-terminated-truncated-split\">\n<h3>4. Step Function: The <code><span>done</span></code> → <code><span>terminated</span></code>/<code><span>truncated</span></code> Split<a href=\"#step-function-the-done-terminated-truncated-split\" title=\"Link to this heading\">¶</a></h3>\n<p>This is the most important change for training algorithms.</p>\n<p><strong>Old v0.21</strong>: Single <code><span>done</span></code> flag</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Separate <code><span>terminated</span></code> and <code><span>truncated</span></code> flags</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n</pre></div>\n<p><strong>Why this matters</strong>:</p>\n<ul>\n<li><p><strong><code><span>terminated</span></code></strong>: Episode ended because the task was completed or failed (agent reached goal, died, etc.)</p></li>\n<li><p><strong><code><span>truncated</span></code></strong>: Episode ended due to external constraints (time limit, step limit, etc.)</p></li>\n</ul>\n<p>This distinction is crucial for value function bootstrapping in RL algorithms:</p>\n<div><pre><span></span><span># OLD (ambiguous)</span>\n<span>if</span> <span>done</span><span>:</span>\n    <span># Should we bootstrap? We don't know if this was natural termination or time limit!</span>\n    <span>next_value</span> <span>=</span> <span>0</span>  <span># Assumption that may be wrong</span>\n\n<span># NEW (clear)</span>\n<span>if</span> <span>terminated</span><span>:</span>\n    <span>next_value</span> <span>=</span> <span>0</span>      <span># Natural ending - no future value</span>\n<span>elif</span> <span>truncated</span><span>:</span>\n    <span>next_value</span> <span>=</span> <span>value_function</span><span>(</span><span>next_obs</span><span>)</span>  <span># Time limit - estimate future value</span>\n</pre></div>\n<p><strong>Migration strategy</strong>:</p>\n<div><pre><span></span><span># Simple migration (works for many cases)</span>\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>done</span> <span>=</span> <span>terminated</span> <span>or</span> <span>truncated</span>\n\n<span># Better migration (preserves RL algorithm correctness)</span>\n<span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>if</span> <span>terminated</span><span>:</span>\n    <span># Episode naturally ended - use reward as-is</span>\n    <span>target</span> <span>=</span> <span>reward</span>\n<span>elif</span> <span>truncated</span><span>:</span>\n    <span># Episode cut short - may need to estimate remaining value</span>\n    <span>target</span> <span>=</span> <span>reward</span> <span>+</span> <span>discount</span> <span>*</span> <span>estimate_value</span><span>(</span><span>obs</span><span>)</span>\n</pre></div>\n<p>For more information, see our <a href=\"https://farama.org/Gymnasium-Terminated-Truncated-Step-API\">blog post</a> about it.</p>\n</section>\n<section id=\"render-mode-changes\">\n<h3>5. Render Mode Changes<a href=\"#render-mode-changes\" title=\"Link to this heading\">¶</a></h3>\n<p><strong>Old v0.21</strong>: Render mode specified each time</p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"human\"</span><span>)</span>     <span># Visual window</span>\n<span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"rgb_array\"</span><span>)</span> <span># Get pixel array</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Render mode fixed at creation</p>\n<div><pre><span></span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>     <span># For visual display</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"rgb_array\"</span><span>)</span> <span># For recording</span>\n<span>env</span><span>.</span><span>render</span><span>()</span>  <span># Uses the mode specified at creation</span>\n</pre></div>\n<p><strong>Why this changed</strong>: Some environments can’t switch render modes on-the-fly. Fixing the mode at creation enables better optimization and prevents bugs.</p>\n<p><strong>Practical implications</strong>:</p>\n<div><pre><span></span><span># OLD: Could switch modes dynamically</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>\n<span>for</span> <span>episode</span> <span>in</span> <span>range</span><span>(</span><span>10</span><span>):</span>\n    <span># ... episode code ...</span>\n    <span>if</span> <span>episode</span> <span>%</span> <span>10</span> <span>==</span> <span>0</span><span>:</span>\n        <span>env</span><span>.</span><span>render</span><span>(</span><span>mode</span><span>=</span><span>\"human\"</span><span>)</span>  <span># Show every 10th episode</span>\n\n<span># NEW: Create separate environments for different purposes</span>\n<span>training_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>)</span>  <span># No rendering for speed</span>\n<span>eval_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>  <span># Visual for evaluation</span>\n\n<span># Or use None for no rendering, then create visual env when needed</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>None</span><span>)</span>  <span># Fast training</span>\n<span>if</span> <span>need_visualization</span><span>:</span>\n    <span>visual_env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>render_mode</span><span>=</span><span>\"human\"</span><span>)</span>\n</pre></div>\n</section>\n</section>\n<section id=\"timelimit-wrapper-changes\">\n<h2>TimeLimit Wrapper Changes<a href=\"#timelimit-wrapper-changes\" title=\"Link to this heading\">¶</a></h2>\n<p>The <a href=\"https://gymnasium.farama.org/api/wrappers/misc_wrappers/#gymnasium.wrappers.TimeLimit\" title=\"gymnasium.wrappers.TimeLimit\"><code><span>TimeLimit</span></code></a> wrapper behavior also changed to align with the new termination model.</p>\n<p><strong>Old v0.21</strong>: Added <code><span>TimeLimit.truncated</span></code> to info dict</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>done</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>if</span> <span>done</span> <span>and</span> <span>info</span><span>.</span><span>get</span><span>(</span><span>'TimeLimit.truncated'</span><span>,</span> <span>False</span><span>):</span>\n    <span># Episode ended due to time limit</span>\n    <span>pass</span>\n</pre></div>\n<p><strong>New v0.26+</strong>: Uses the <code><span>truncated</span></code> return value</p>\n<div><pre><span></span><span>obs</span><span>,</span> <span>reward</span><span>,</span> <span>terminated</span><span>,</span> <span>truncated</span><span>,</span> <span>info</span> <span>=</span> <span>env</span><span>.</span><span>step</span><span>(</span><span>action</span><span>)</span>\n<span>if</span> <span>truncated</span><span>:</span>\n    <span># Episode ended due to time limit (or other truncation)</span>\n    <span>pass</span>\n<span>if</span> <span>terminated</span><span>:</span>\n    <span># Episode ended naturally (success/failure)</span>\n    <span>pass</span>\n</pre></div>\n<p>This makes time limit detection much cleaner and more explicit.</p>\n</section>\n<section id=\"environment-specific-changes\">\n<h2>Environment-Specific Changes<a href=\"#environment-specific-changes\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"removed-environments\">\n<h3>Removed Environments<a href=\"#removed-environments\" title=\"Link to this heading\">¶</a></h3>\n<p>Some environments were moved or removed:</p>\n<div><pre><span></span><span># OLD: Robotics environments in main gym</span>\n<span>import</span><span> </span><span>gym</span>\n<span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"FetchReach-v1\"</span><span>)</span>  <span># No longer available</span>\n\n<span># NEW: Moved to separate package</span>\n<span>import</span><span> </span><span>gymnasium</span>\n\n<span>import</span><span> </span><span>gymnasium_robotics</span>\n<span>import</span><span> </span><span>ale_py</span>\n\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>((</span><span>gymnasium_robotics</span><span>,</span> <span>ale_py</span><span>))</span>\n\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"FetchReach-v1\"</span><span>)</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"ALE/Pong-v5\"</span><span>)</span>\n</pre></div>\n</section>\n</section>\n<section id=\"compatibility-helpers\">\n<h2>Compatibility Helpers<a href=\"#compatibility-helpers\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"loading-openai-gym-environments\">\n<h3>Loading OpenAI Gym environments<a href=\"#loading-openai-gym-environments\" title=\"Link to this heading\">¶</a></h3>\n<p>For environments that can’t be updated to Gymnasium, we provide compatibility wrappers either for v21 and v26 style environments, where either the environment name or the environment itself can be passed.</p>\n<div><pre><span></span><span>import</span><span> </span><span>gymnasium</span>\n<span>import</span><span> </span><span>shimmy</span>  <span># install shimmy with `pip install shimmy`</span>\n\n<span>gymnasium</span><span>.</span><span>register_envs</span><span>(</span><span>shimmy</span><span>)</span>\n\n\n<span># Gym v0.21 style environments</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV21Environment-v0\"</span><span>,</span> <span>env_id</span><span>=</span><span>\"CartPole-v1\"</span><span>)</span>\n<span># or</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV21Environment-v0\"</span><span>,</span> <span>env</span><span>=</span><span>OldV21Env</span><span>())</span>\n\n<span># Gym v0.26 style environments</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV26Environment-v0\"</span><span>,</span> <span>env_id</span><span>=</span><span>\"Cartpole-v1\"</span><span>)</span>\n<span># or</span>\n<span>env</span> <span>=</span> <span>gymnasium</span><span>.</span><span>make</span><span>(</span><span>\"GymV26Environment-v0\"</span><span>,</span> <span>env</span><span>=</span><span>OldV26Env</span><span>())</span>\n</pre></div>\n</section>\n<section id=\"step-api-compatibility\">\n<h3>Step API Compatibility<a href=\"#step-api-compatibility\" title=\"Link to this heading\">¶</a></h3>\n<p>If environments implement the (old) done step API, Gymnasium provides functions (<a href=\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\" title=\"gymnasium.utils.step_api_compatibility.convert_to_terminated_truncated_step_api\"><code><span>convert_to_terminated_truncated_step_api()</span></code></a> and <a href=\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.step_api_compatibility.convert_to_done_step_api\" title=\"gymnasium.utils.step_api_compatibility.convert_to_done_step_api\"><code><span>convert_to_done_step_api()</span></code></a>) that will convert an environment with the old step API (using <code><span>done</span></code>) to the new step API (using <code><span>termination</span></code> and <code><span>truncation</span></code>), and vice versa.</p>\n</section>\n</section>\n<section id=\"testing-your-migration\">\n<h2>Testing Your Migration<a href=\"#testing-your-migration\" title=\"Link to this heading\">¶</a></h2>\n<p>After migrating, verify that:</p>\n<ul>\n<li><p>[ ] <strong>Import statements</strong> use <code><span>gymnasium</span></code> instead of <code><span>gym</span></code></p></li>\n<li><p>[ ] <strong>Reset calls</strong> handle the <code><span>(obs,</span> <span>info)</span></code> return format</p></li>\n<li><p>[ ] <strong>Step calls</strong> handle <code><span>terminated</span></code> and <code><span>truncated</span></code> separately</p></li>\n<li><p>[ ] <strong>Render mode</strong> is specified during environment creation</p></li>\n<li><p>[ ] <strong>Random seeding</strong> uses the <code><span>seed</span></code> parameter in <code><span>reset()</span></code></p></li>\n<li><p>[ ] <strong>Training algorithms</strong> properly distinguish termination types</p></li>\n</ul>\n<p>Use the <a href=\"https://gymnasium.farama.org/api/utils/#gymnasium.utils.env_checker.check_env\" title=\"gymnasium.utils.env_checker.check_env\"><code><span>check_env()</span></code></a> to verify their implementation.</p>\n</section>\n<section id=\"getting-help\">\n<h2>Getting Help<a href=\"#getting-help\" title=\"Link to this heading\">¶</a></h2>\n<p><strong>If you encounter issues during migration</strong>:</p>\n<ol>\n<li><p><strong>Check the compatibility guide</strong>: Some old environments can be used with compatibility wrappers</p></li>\n<li><p><strong>Look at the environment documentation</strong>: Each environment may have specific migration notes</p></li>\n<li><p><strong>Test with simple environments first</strong>: Start with CartPole before moving to complex environments</p></li>\n<li><p><strong>Compare old vs new behavior</strong>: Run the same code with both APIs to understand differences</p></li>\n</ol>\n<p><strong>Common resources</strong>:</p>\n<ul>\n<li><p><a href=\"https://gymnasium.farama.org/api/env\">Gymnasium API documentation</a></p></li>\n<li><p><a href=\"https://github.com/Farama-Foundation/Gymnasium/issues\">GitHub issues</a> for bug reports</p></li>\n<li><p><a href=\"https://discord.gg/bnJ6kubTg6\">Discord community</a> for questions</p></li>\n</ul>\n</section>\n</section>\n\n          </article></div>"
  }
}

▼✓网页搜索结果数: 10, 查询: "Gymnasium built-in environments list classic control Box2D MuJoCo Atari Toy Text"

▼thought

Now let me get more details about the Farama Foundation, the ecosystem, and the latest version/developments.

▼✓网页搜索结果数: 10, 查询: "Farama Foundation Gymnasium ecosystem PettingZoo Minigrid 2025 2026"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Projects | The Farama Foundation\",\n      \"url\": \"https://farama.org/projects\",\n      \"description\": \"PettingZoo and Gymnasium bindings for popular reinforcement learning environments outside of Farama ... Copyright © The Farama Foundation 2026. Contact Us.\"\n    },\n    {\n      \"title\": \"Farama-Foundation/Gymnasium: An API standard for single ... - GitHub\",\n      \"url\": \"https://github.com/farama-foundation/gymnasium\",\n      \"description\": \"PettingZoo is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments. The Farama Foundation\"\n    },\n    {\n      \"title\": \"GitHub - Farama-Foundation/PettingZoo: An API standard for multi ...\",\n      \"url\": \"https://github.com/Farama-Foundation/PettingZoo\",\n      \"description\": \"PettingZoo is a Python library for conducting research in multi-agent reinforcement learning, akin to a multi-agent version of Gymnasium.\"\n    },\n    {\n      \"title\": \"PettingZoo Documentation\",\n      \"url\": \"https://pettingzoo.farama.org/index.html\",\n      \"description\": \"PettingZoo is a simple, pythonic interface capable of representing general multi-agent reinforcement learning (MARL) problems.\"\n    },\n    {\n      \"title\": \"A Reference Architecture of Reinforcement Learning Frameworks\",\n      \"url\": \"https://arxiv.org/html/2603.06413v1\",\n      \"description\": \"(2025) Gymnasium: a standard interface for reinforcement learning environments. Note: GitHub repository: https://github.com/Farama-Foundation/\"\n    },\n    {\n      \"title\": \"Installations | The Farama Foundation\",\n      \"url\": \"https://farama.org/stats/installations\",\n      \"description\": \"Installations ; PettingZoo5,163,800 ; Minari368,297 ; Gymnasium-Robotics529,442 ; Jumpy38,085 ; Metaworld114,491.\"\n    },\n    {\n      \"title\": \"Release Notes - PettingZoo Documentation\",\n      \"url\": \"https://pettingzoo.farama.org/release_notes/index.html\",\n      \"description\": \"This is a patch release for PettingZoo addressing dependency updates. * Updated Ray dependency in tutorials from 2.7.0 to 2.55.0 (#1329). This is a maintenance release for PettingZoo, adding support for `Python 3.13` and `Python 3.14`. This release also removes the deprecated MPE environments from PettingZoo. MPE environments are now maintained in the package MPE2. * MPE environments were removed from PettingZoo. * Waterworld was removed from PettingZoo. * Support for Python 3.13 and Python 3.14 was added. * Python 3.13 and Python 3.14 were added to CI workflows. * MPE documentation updated to reflect migration to `mpe2`. In this release of PettingZoo, we have introduced support for `python 3.12`, `gymnasium>=1.0.0`, and `agilerl>=2.0.0`. We have deprecated `python 3.8`, and added warnings to prepare the future transfer of the MPE environments to the package MPE2 in a future release. PettingZoo 1.25.0 coincides with a new release of Supersuit. * Update gymnasium to >=1.0.0 (#1272). * AgileRL tutorials updated for version 2.0.0 (#1258).\"\n    },\n    {\n      \"title\": \"Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/issues/944\",\n      \"description\": \"This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from\"\n    },\n    {\n      \"title\": \"NeurIPS Poster Gymnasium: A Standard Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://neurips.cc/virtual/2025/poster/121446\",\n      \"description\": \"# Gymnasium: A Standard Interface for Reinforcement Learning Environments. Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field.Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential.\"\n    },\n    {\n      \"title\": \"A Standard Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://www.researchgate.net/publication/382526464_Gymnasium_A_Standard_Interface_for_Reinforcement_Learning_Environments\",\n      \"description\": \"Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Projects | The Farama Foundation",
        "url": "https://farama.org/projects",
        "description": "PettingZoo and Gymnasium bindings for popular reinforcement learning environments outside of Farama ... Copyright © The Farama Foundation 2026. Contact Us."
      },
      {
        "title": "Farama-Foundation/Gymnasium: An API standard for single ... - GitHub",
        "url": "https://github.com/farama-foundation/gymnasium",
        "description": "PettingZoo is a multi-agent version of Gymnasium with a number of implemented environments, for example, multi-agent Atari environments. The Farama Foundation"
      },
      {
        "title": "GitHub - Farama-Foundation/PettingZoo: An API standard for multi ...",
        "url": "https://github.com/Farama-Foundation/PettingZoo",
        "description": "PettingZoo is a Python library for conducting research in multi-agent reinforcement learning, akin to a multi-agent version of Gymnasium."
      },
      {
        "title": "PettingZoo Documentation",
        "url": "https://pettingzoo.farama.org/index.html",
        "description": "PettingZoo is a simple, pythonic interface capable of representing general multi-agent reinforcement learning (MARL) problems."
      },
      {
        "title": "A Reference Architecture of Reinforcement Learning Frameworks",
        "url": "https://arxiv.org/html/2603.06413v1",
        "description": "(2025) Gymnasium: a standard interface for reinforcement learning environments. Note: GitHub repository: https://github.com/Farama-Foundation/"
      },
      {
        "title": "Installations | The Farama Foundation",
        "url": "https://farama.org/stats/installations",
        "description": "Installations ; PettingZoo5,163,800 ; Minari368,297 ; Gymnasium-Robotics529,442 ; Jumpy38,085 ; Metaworld114,491."
      },
      {
        "title": "Release Notes - PettingZoo Documentation",
        "url": "https://pettingzoo.farama.org/release_notes/index.html",
        "description": "This is a patch release for PettingZoo addressing dependency updates. * Updated Ray dependency in tutorials from 2.7.0 to 2.55.0 (#1329). This is a maintenance release for PettingZoo, adding support for `Python 3.13` and `Python 3.14`. This release also removes the deprecated MPE environments from PettingZoo. MPE environments are now maintained in the package MPE2. * MPE environments were removed from PettingZoo. * Waterworld was removed from PettingZoo. * Support for Python 3.13 and Python 3.14 was added. * Python 3.13 and Python 3.14 were added to CI workflows. * MPE documentation updated to reflect migration to `mpe2`. In this release of PettingZoo, we have introduced support for `python 3.12`, `gymnasium>=1.0.0`, and `agilerl>=2.0.0`. We have deprecated `python 3.8`, and added warnings to prepare the future transfer of the MPE environments to the package MPE2 in a future release. PettingZoo 1.25.0 coincides with a new release of Supersuit. * Update gymnasium to >=1.0.0 (#1272). * AgileRL tutorials updated for version 2.0.0 (#1258)."
      },
      {
        "title": "Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...",
        "url": "https://github.com/Farama-Foundation/Gymnasium/issues/944",
        "description": "This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from"
      },
      {
        "title": "NeurIPS Poster Gymnasium: A Standard Interface for Reinforcement Learning Environments",
        "url": "https://neurips.cc/virtual/2025/poster/121446",
        "description": "# Gymnasium: A Standard Interface for Reinforcement Learning Environments. Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field.Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential."
      },
      {
        "title": "A Standard Interface for Reinforcement Learning Environments",
        "url": "https://www.researchgate.net/publication/382526464_Gymnasium_A_Standard_Interface_for_Reinforcement_Learning_Environments",
        "description": "Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue."
      }
    ]
  }
}

▼✓网页搜索结果数: 10, 查询: "Gymnasium v1.0 release changelog latest version 2025"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"A Standardized Interface for Reinforcement Learning Environments\",\n      \"url\": \"https://arxiv.org/html/2407.17032v4\",\n      \"description\": \"Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025 * **https://\"\n    },\n    {\n      \"title\": \"[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...\",\n      \"url\": \"https://arxiv.org/pdf/2407.17032\",\n      \"description\": \"Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025* and over 18\"\n    },\n    {\n      \"title\": \"Gymnasium v1.0: A Comprehensive Update | The Farama Foundation\",\n      \"url\": \"https://farama.org/Gymnasium-v1.0\",\n      \"description\": \"This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull\"\n    },\n    {\n      \"title\": \"Gymnasium Release Notes\",\n      \"url\": \"https://gymnasium.farama.org/gymnasium_release_notes/index.html\",\n      \"description\": \"Gymnasium v1.3.0. This release brings a new Taxi environment version, a new RepeatAction wrapper, and a range of bug fixes across vector environments and\"\n    },\n    {\n      \"title\": \"Releases · Farama-Foundation/Gymnasium - GitHub\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/releases\",\n      \"description\": \"Gymnasium v1.2.1. Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs. New Features. Added new wrappers to\"\n    },\n    {\n      \"title\": \"gymnasium - PyPI\",\n      \"url\": \"https://pypi.org/project/gymnasium/\",\n      \"description\": \"Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API.\"\n    },\n    {\n      \"title\": \"Release Notes - Gymnasium-Robotics Documentation\",\n      \"url\": \"https://robotics.farama.org/release_notes/index.html\",\n      \"description\": \"Release Notes. This minor release updates MaMuJoCo to follow the latest PettingZoo version 1.23.0 and some minor bug fixes in the Github PyPI publish workflow.\"\n    },\n    {\n      \"title\": \"[ICLR 2025] Robust Gymnasium: A Unified Modular Benchmark for ...\",\n      \"url\": \"https://github.com/SafeRL-Lab/Robust-Gymnasium\",\n      \"description\": \"This benchmark aims to advance robust reinforcement learning (RL) for real-world applications and domain adaptation.\"\n    },\n    {\n      \"title\": \"Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/issues/944\",\n      \"description\": \"This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from\"\n    },\n    {\n      \"title\": \"Gym Documentation\",\n      \"url\": \"https://www.gymlibrary.dev/\",\n      \"description\": \"You can read a migration guide for upgrading older version of Gym to Gymnasium here. Click the button to continue to the Gymnasium documentation website. Go to\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "A Standardized Interface for Reinforcement Learning Environments",
        "url": "https://arxiv.org/html/2407.17032v4",
        "description": "Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025 * **https://"
      },
      {
        "title": "[PDF] Gymnasium: A Standard Interface for Reinforcement Learning ...",
        "url": "https://arxiv.org/pdf/2407.17032",
        "description": "Gymnasium is the maintained successor to OpenAI Gym, which has become widely adopted, with over a million downloads in April 2025* and over 18"
      },
      {
        "title": "Gymnasium v1.0: A Comprehensive Update | The Farama Foundation",
        "url": "https://farama.org/Gymnasium-v1.0",
        "description": "This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull"
      },
      {
        "title": "Gymnasium Release Notes",
        "url": "https://gymnasium.farama.org/gymnasium_release_notes/index.html",
        "description": "Gymnasium v1.3.0. This release brings a new Taxi environment version, a new RepeatAction wrapper, and a range of bug fixes across vector environments and"
      },
      {
        "title": "Releases · Farama-Foundation/Gymnasium - GitHub",
        "url": "https://github.com/Farama-Foundation/Gymnasium/releases",
        "description": "Gymnasium v1.2.1. Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs. New Features. Added new wrappers to"
      },
      {
        "title": "gymnasium - PyPI",
        "url": "https://pypi.org/project/gymnasium/",
        "description": "Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API."
      },
      {
        "title": "Release Notes - Gymnasium-Robotics Documentation",
        "url": "https://robotics.farama.org/release_notes/index.html",
        "description": "Release Notes. This minor release updates MaMuJoCo to follow the latest PettingZoo version 1.23.0 and some minor bug fixes in the Github PyPI publish workflow."
      },
      {
        "title": "[ICLR 2025] Robust Gymnasium: A Unified Modular Benchmark for ...",
        "url": "https://github.com/SafeRL-Lab/Robust-Gymnasium",
        "description": "This benchmark aims to advance robust reinforcement learning (RL) for real-world applications and domain adaptation."
      },
      {
        "title": "Projects updated to v1.0.0 · Issue #944 · Farama-Foundation ...",
        "url": "https://github.com/Farama-Foundation/Gymnasium/issues/944",
        "description": "This issue lists popular RL projects and their status for updating to Gymnasium v1.0.0. Please comment on progress or missing projects from"
      },
      {
        "title": "Gym Documentation",
        "url": "https://www.gymlibrary.dev/",
        "description": "You can read a migration guide for upgrading older version of Gym to Gymnasium here. Click the button to continue to the Gymnasium documentation website. Go to"
      }
    ]
  }
}

▼✓抓取网页正文网址: "https://gymnasium.farama.org/environments/classic_control/"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/environments/classic_control/\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div class=\\\"page\\\">\\n          <a href=\\\"#\\\">\\n            <svg xmlns=\\\"http://www.w3.org/2000/svg\\\" viewBox=\\\"0 0 24 24\\\">\\n              <path d=\\\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\\\"></path>\\n            </svg>\\n            <span>Back to top</span>\\n          </a>\\n          <div><p>\\n  <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/environments/classic_control.md\\\" title=\\\"Edit this page\\\">\\n    \\n    <span>Edit this page</span>\\n  </a>\\n</p>\\n            <p><label for=\\\"__toc\\\">\\n              <p>Toggle table of contents sidebar</p>\\n              <i><svg>\\n                  <use href=\\\"#svg-toc\\\"></use>\\n                </svg></i>\\n            </label>\\n          </p></div>\\n          <article role=\\\"main\\\">\\n            \\n            <section id=\\\"classic-control\\\">\\n<h2>Classic Control<a href=\\\"#classic-control\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n\\n\\n\\n\\n    <p>The unique dependencies for this set of environments can be installed via:</p>\\n<div><pre><span></span>pip<span> </span>install<span> </span>gymnasium<span>[</span>classic-control<span>]</span>\\n</pre></div>\\n<p>There are five classic control environments: Acrobot, CartPole, Mountain Car, Continuous Mountain Car, and Pendulum. All of these environments are stochastic in terms of their initial state, within a given range. In addition, Acrobot has noise applied to the taken action. Also, regarding both mountain car environments, the cars are underpowered to climb the mountain, so it takes some effort to reach the top.</p>\\n<p>Among Gymnasium environments, this set of environments can be considered easier ones to solve by a policy.</p>\\n<p>All environments are highly configurable via arguments specified in each environment’s documentation.</p>\\n</section>\\n\\n          </article>\\n        </div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/environments/classic_control/",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div class=\"page\">\n          <a href=\"#\">\n            <svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 24 24\">\n              <path d=\"M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12z\"></path>\n            </svg>\n            <span>Back to top</span>\n          </a>\n          <div><p>\n  <a href=\"https://github.com/Farama-Foundation/Gymnasium/edit/main/docs/environments/classic_control.md\" title=\"Edit this page\">\n    \n    <span>Edit this page</span>\n  </a>\n</p>\n            <p><label for=\"__toc\">\n              <p>Toggle table of contents sidebar</p>\n              <i><svg>\n                  <use href=\"#svg-toc\"></use>\n                </svg></i>\n            </label>\n          </p></div>\n          <article role=\"main\">\n            \n            <section id=\"classic-control\">\n<h2>Classic Control<a href=\"#classic-control\" title=\"Link to this heading\">¶</a></h2>\n\n\n\n\n    <p>The unique dependencies for this set of environments can be installed via:</p>\n<div><pre><span></span>pip<span> </span>install<span> </span>gymnasium<span>[</span>classic-control<span>]</span>\n</pre></div>\n<p>There are five classic control environments: Acrobot, CartPole, Mountain Car, Continuous Mountain Car, and Pendulum. All of these environments are stochastic in terms of their initial state, within a given range. In addition, Acrobot has noise applied to the taken action. Also, regarding both mountain car environments, the cars are underpowered to climb the mountain, so it takes some effort to reach the top.</p>\n<p>Among Gymnasium environments, this set of environments can be considered easier ones to solve by a policy.</p>\n<p>All environments are highly configurable via arguments specified in each environment’s documentation.</p>\n</section>\n\n          </article>\n        </div></div>"
  }
}

▼thought

Let me get the v1.0 blog post and the release notes for more details.

▼✓抓取网页正文网址: "https://farama.org/Gymnasium-v1.0"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://farama.org/Gymnasium-v1.0\",\n  \"title\": \"Gymnasium v1.0: A Comprehensive Update\",\n  \"summary\": \"After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 20…\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div>\\n        <h2 id=\\\"gymnasium-v10-a-comprehensive-update\\\">Gymnasium v1.0: A Comprehensive Update**</h2>\\n\\n<p>After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull requests have been merged since version 0.29.1, culminating in Gymnasium v1.0, a stable release focused on improving the API (<code>Env</code>, <code>Space</code>, and <code>VectorEnv</code>).</p>\\n\\n<p>For a more detailed summary, see our <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\\\">release notes</a>. We have also published a white paper on Gymnasium you can check out <a href=\\\"https://arxiv.org/abs/2407.17032\\\">here</a> and cite if using in academic work.</p>\\n\\n<p>Let’s dive into some of the key changes!</p>\\n\\n<h2 id=\\\"1-vector-environments-overhaul\\\">1. <strong>Vector Environments Overhaul</strong></h2>\\n<p>Vector environments have been significantly revamped. Previously, <code>VectorEnv</code> inherited from <code>Env</code>, which wasn’t technically valid, and caused various issues with method signatures. In v1.0, <code>Env</code> and <code>VectorEnv</code> are now distinct and fully supported, leading to clearer code and more efficient implementations.</p>\\n\\n<p>Moreover, Gymnasium introduces the <code>gymnasium.make_vec</code> function for easier vectorized environment creation. The new setup allows users to specify a vectorization mode (by default: “sync”, “async” or “vector_entry_point”) making it simpler to create and manage multiple environment instances simultaneously.</p>\\n\\n<div><pre><code><span>envs</span> <span>=</span> <span>gym</span><span>.</span><span>make_vec</span><span>(</span><span>\\\"CartPole-v1\\\"</span><span>,</span> <span>num_envs</span><span>=</span><span>3</span><span>,</span> <span>vectorization_mode</span><span>=</span><span>\\\"vector_entry_point\\\"</span><span>)</span>\\n</code></pre></div>\\n\\n<p>With this, we also support custom vectorization, defined for each environment separately, which allows for significantly more efficient implementations in certain cases.</p>\\n\\n<p>We also changed the auto-reset behavior of vectorized environments. In the past, when an episode terminated (or was truncated), it would immediately reset, with the final observation being passed in the <code>info</code> dict. This was less than ideal, as <code>info</code> was primarily intended for auxiliary information, not the main observations. Now, the observation returned alongside <code>terminated | truncated == True</code> is the final observation of the episode that just finished. The action following that will be ignored and treated as a reset signal, after which the new episode will begin. We will further elaborate on this in a future blog post.</p>\\n\\n<h2 id=\\\"2-improved-wrappers\\\">2. <strong>Improved Wrappers</strong></h2>\\n<p>With the separation of <code>Env</code> and <code>VectorEnv</code>, wrappers have also undergone changes. Wrappers that previously worked for both environments now have distinct variants for each. Standard environment wrappers are housed in <code>gymnasium.wrappers</code>, while vector-specific wrappers are found in <code>gymnasium.wrappers.vector</code>.</p>\\n\\n<p>Several wrappers have been renamed or removed for clarity, such as: <code>AutoResetWrapper</code> -&gt; <code>Autoreset</code> and <code>FrameStack</code> -&gt; <code>FrameStackObservation</code>.\\nNew wrappers have also been introduced, such as <code>DelayObservation</code> and <code>MaxAndSkipObservation</code>, to extend functionality.</p>\\n\\n<h3 id=\\\"3-functional-environments\\\">3. <strong>Functional Environments</strong></h3>\\n<p>To enable a different coding paradigm and allow for easier integration into planning algorithms, Gymnasium introduces <code>FuncEnv</code>, a functional version of <code>Env</code>. This new environment type exposes functions such as <code>reward</code>, <code>observation</code>, and <code>transition</code>, allowing for more flexible control over the environment’s behavior. Furthermore, functional environments implemented in JAX can often be jitted and vmapped, leading to massive performance improvements and easy hardware acceleration.</p>\\n\\n<h3 id=\\\"4-environment-version-changes\\\">4. <strong>Environment Version Changes</strong></h3>\\n<p>Several popular environments have seen updates:</p>\\n<ul>\\n  <li><strong>New MuJoCo v5</strong> have been added to support the latest MuJoCo versions along with new features and bug fixes.</li>\\n  <li><strong>Lunar Lander</strong> and <strong>CarRacing</strong> both received bug fixes, resulting in new environment versions.</li>\\n</ul>\\n\\n<h2 id=\\\"5-removing-the-plugin-system\\\">5. <strong>Removing the Plugin System</strong></h2>\\n<p>In version 1.0 we removed an undocumented plugin system that allowed for registering external environments behind the scenes. In previous versions, users could create environments like Atari or Minigrid without explicitly importing the relevant modules. Now, users will need to import these external libraries directly to register environments.</p>\\n\\n<p>Example:</p>\\n<div><pre><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\\n<span>import</span> <span>ale_py</span>\\n\\n<span>gym</span><span>.</span><span>register_envs</span><span>(</span><span>ale_py</span><span>)</span>  <span># optional\\n</span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\\\"ALE/Pong-v5\\\"</span><span>)</span>\\n</code></pre></div>\\n<p>This change increases security, transparency and ensures a clearer workflow.</p>\\n\\n<h2 id=\\\"moving-ale-out-of-gymnasium\\\">Moving ALE out of Gymnasium</h2>\\n\\n<p>Atari (Arcade Learning Environment / ALE) and Gymnasium (and Gym) have been interlinked over the course of their existence. With v1.0 we decided to properly split them into two separate projects, with a new dedicated <a href=\\\"https://ale.farama.org/\\\">ALE website</a>. Given the Atari environments’ iconic history in reinforcement learning, we wanted to give a bit of backstory on this decision.</p>\\n\\n<p>The interface reinforcement learning researchers use for Atari environments is based on the Arcade Learning Environment (ALE), which was created at the University of Alberta as a C++ library in 2013 by some of the first researchers at DeepMind, including Marc Bellemare. A few years after this release, Python bindings were added to a public version of the ALE by Ben Goodrich, a researcher at the University of Tennessee (https://github.com/bbitmaster/ale_python_interface). This library was then forked by OpenAI, renamed Atari-Py, and released on PyPI as a dependency to enable Atari environments in Gym/Gymnasium.</p>\\n\\n<p>This problem of having Gymnasium depend on a fork of a fork of the ALE meant that any changes (such as new environments) or bug fixes that occurred in the ALE were not available in Atari-Py, which included the benchmark for 56 Atari environments that DeepMind first used in their work showing human level performance in the environments. This was made worse by the fact that DeepMind had made their own major modifications and upgrades to the ALE in this time period (some were published publicly in <a href=\\\"https://github.com/google-deepmind/xitari\\\">Xitari</a>, some were not). This created well founded scientific reproducibility concerns from the original ALE authors, which led to agreements between the original ALE authors, DeepMind and OpenAI to upstream python bindings from Atari-Py and ALE. Then, Gym would depend directly on the ALE, and DeepMind would contribute all their code to ALE so that the public could use it in a centralized and reproducible way. With the release of Gymnasium 1.0, we’re finally completing this process of upstreaming all Atari environments from Gymnasium to ALE.</p>\\n\\n<p>As a result, Atari documentation has been moved to <a href=\\\"https://farama.org/ale.farama.org\\\">ale.farama.org</a> with the Gymnasium links redirecting users. We are maintaining <code>pip install \\\"gymnasium[atari]\\\"</code> (though removing <code>accept-rom-license</code> as it’s unnecessary now) for backward compatibility. Most importantly, the plugin system described previously means that users need to <code>import ale_py</code> in order to register environments.</p>\\n\\n<h2 id=\\\"looking-ahead\\\">Looking Ahead</h2>\\n<p>Gymnasium 1.0 brings a refined, clearer, and more efficient framework for creating and interacting with reinforcement learning environments. With a focus on long-term stability, we intend this to be the final breaking change for the foreseeable future. We hope that this release sets the foundation for future growth in the reinforcement learning community.</p>\\n\\n    </div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://farama.org/Gymnasium-v1.0",
    "title": "Gymnasium v1.0: A Comprehensive Update",
    "summary": "After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 20…",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div>\n        <h2 id=\"gymnasium-v10-a-comprehensive-update\">Gymnasium v1.0: A Comprehensive Update**</h2>\n\n<p>After years of hard work, Gymnasium v1.0 has officially arrived! This release marks a major milestone for the Gymnasium project, refining the core API, addressing bugs, and enhancing features. Over 200 pull requests have been merged since version 0.29.1, culminating in Gymnasium v1.0, a stable release focused on improving the API (<code>Env</code>, <code>Space</code>, and <code>VectorEnv</code>).</p>\n\n<p>For a more detailed summary, see our <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\">release notes</a>. We have also published a white paper on Gymnasium you can check out <a href=\"https://arxiv.org/abs/2407.17032\">here</a> and cite if using in academic work.</p>\n\n<p>Let’s dive into some of the key changes!</p>\n\n<h2 id=\"1-vector-environments-overhaul\">1. <strong>Vector Environments Overhaul</strong></h2>\n<p>Vector environments have been significantly revamped. Previously, <code>VectorEnv</code> inherited from <code>Env</code>, which wasn’t technically valid, and caused various issues with method signatures. In v1.0, <code>Env</code> and <code>VectorEnv</code> are now distinct and fully supported, leading to clearer code and more efficient implementations.</p>\n\n<p>Moreover, Gymnasium introduces the <code>gymnasium.make_vec</code> function for easier vectorized environment creation. The new setup allows users to specify a vectorization mode (by default: “sync”, “async” or “vector_entry_point”) making it simpler to create and manage multiple environment instances simultaneously.</p>\n\n<div><pre><code><span>envs</span> <span>=</span> <span>gym</span><span>.</span><span>make_vec</span><span>(</span><span>\"CartPole-v1\"</span><span>,</span> <span>num_envs</span><span>=</span><span>3</span><span>,</span> <span>vectorization_mode</span><span>=</span><span>\"vector_entry_point\"</span><span>)</span>\n</code></pre></div>\n\n<p>With this, we also support custom vectorization, defined for each environment separately, which allows for significantly more efficient implementations in certain cases.</p>\n\n<p>We also changed the auto-reset behavior of vectorized environments. In the past, when an episode terminated (or was truncated), it would immediately reset, with the final observation being passed in the <code>info</code> dict. This was less than ideal, as <code>info</code> was primarily intended for auxiliary information, not the main observations. Now, the observation returned alongside <code>terminated | truncated == True</code> is the final observation of the episode that just finished. The action following that will be ignored and treated as a reset signal, after which the new episode will begin. We will further elaborate on this in a future blog post.</p>\n\n<h2 id=\"2-improved-wrappers\">2. <strong>Improved Wrappers</strong></h2>\n<p>With the separation of <code>Env</code> and <code>VectorEnv</code>, wrappers have also undergone changes. Wrappers that previously worked for both environments now have distinct variants for each. Standard environment wrappers are housed in <code>gymnasium.wrappers</code>, while vector-specific wrappers are found in <code>gymnasium.wrappers.vector</code>.</p>\n\n<p>Several wrappers have been renamed or removed for clarity, such as: <code>AutoResetWrapper</code> -&gt; <code>Autoreset</code> and <code>FrameStack</code> -&gt; <code>FrameStackObservation</code>.\nNew wrappers have also been introduced, such as <code>DelayObservation</code> and <code>MaxAndSkipObservation</code>, to extend functionality.</p>\n\n<h3 id=\"3-functional-environments\">3. <strong>Functional Environments</strong></h3>\n<p>To enable a different coding paradigm and allow for easier integration into planning algorithms, Gymnasium introduces <code>FuncEnv</code>, a functional version of <code>Env</code>. This new environment type exposes functions such as <code>reward</code>, <code>observation</code>, and <code>transition</code>, allowing for more flexible control over the environment’s behavior. Furthermore, functional environments implemented in JAX can often be jitted and vmapped, leading to massive performance improvements and easy hardware acceleration.</p>\n\n<h3 id=\"4-environment-version-changes\">4. <strong>Environment Version Changes</strong></h3>\n<p>Several popular environments have seen updates:</p>\n<ul>\n  <li><strong>New MuJoCo v5</strong> have been added to support the latest MuJoCo versions along with new features and bug fixes.</li>\n  <li><strong>Lunar Lander</strong> and <strong>CarRacing</strong> both received bug fixes, resulting in new environment versions.</li>\n</ul>\n\n<h2 id=\"5-removing-the-plugin-system\">5. <strong>Removing the Plugin System</strong></h2>\n<p>In version 1.0 we removed an undocumented plugin system that allowed for registering external environments behind the scenes. In previous versions, users could create environments like Atari or Minigrid without explicitly importing the relevant modules. Now, users will need to import these external libraries directly to register environments.</p>\n\n<p>Example:</p>\n<div><pre><code><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\n<span>import</span> <span>ale_py</span>\n\n<span>gym</span><span>.</span><span>register_envs</span><span>(</span><span>ale_py</span><span>)</span>  <span># optional\n</span><span>env</span> <span>=</span> <span>gym</span><span>.</span><span>make</span><span>(</span><span>\"ALE/Pong-v5\"</span><span>)</span>\n</code></pre></div>\n<p>This change increases security, transparency and ensures a clearer workflow.</p>\n\n<h2 id=\"moving-ale-out-of-gymnasium\">Moving ALE out of Gymnasium</h2>\n\n<p>Atari (Arcade Learning Environment / ALE) and Gymnasium (and Gym) have been interlinked over the course of their existence. With v1.0 we decided to properly split them into two separate projects, with a new dedicated <a href=\"https://ale.farama.org/\">ALE website</a>. Given the Atari environments’ iconic history in reinforcement learning, we wanted to give a bit of backstory on this decision.</p>\n\n<p>The interface reinforcement learning researchers use for Atari environments is based on the Arcade Learning Environment (ALE), which was created at the University of Alberta as a C++ library in 2013 by some of the first researchers at DeepMind, including Marc Bellemare. A few years after this release, Python bindings were added to a public version of the ALE by Ben Goodrich, a researcher at the University of Tennessee (https://github.com/bbitmaster/ale_python_interface). This library was then forked by OpenAI, renamed Atari-Py, and released on PyPI as a dependency to enable Atari environments in Gym/Gymnasium.</p>\n\n<p>This problem of having Gymnasium depend on a fork of a fork of the ALE meant that any changes (such as new environments) or bug fixes that occurred in the ALE were not available in Atari-Py, which included the benchmark for 56 Atari environments that DeepMind first used in their work showing human level performance in the environments. This was made worse by the fact that DeepMind had made their own major modifications and upgrades to the ALE in this time period (some were published publicly in <a href=\"https://github.com/google-deepmind/xitari\">Xitari</a>, some were not). This created well founded scientific reproducibility concerns from the original ALE authors, which led to agreements between the original ALE authors, DeepMind and OpenAI to upstream python bindings from Atari-Py and ALE. Then, Gym would depend directly on the ALE, and DeepMind would contribute all their code to ALE so that the public could use it in a centralized and reproducible way. With the release of Gymnasium 1.0, we’re finally completing this process of upstreaming all Atari environments from Gymnasium to ALE.</p>\n\n<p>As a result, Atari documentation has been moved to <a href=\"https://farama.org/ale.farama.org\">ale.farama.org</a> with the Gymnasium links redirecting users. We are maintaining <code>pip install \"gymnasium[atari]\"</code> (though removing <code>accept-rom-license</code> as it’s unnecessary now) for backward compatibility. Most importantly, the plugin system described previously means that users need to <code>import ale_py</code> in order to register environments.</p>\n\n<h2 id=\"looking-ahead\">Looking Ahead</h2>\n<p>Gymnasium 1.0 brings a refined, clearer, and more efficient framework for creating and interacting with reinforcement learning environments. With a focus on long-term stability, we intend this to be the final breaking change for the foreseeable future. We hope that this release sets the foundation for future growth in the reinforcement learning community.</p>\n\n    </div></div>"
  }
}

▼✓抓取网页正文网址: "https://gymnasium.farama.org/gymnasium_release_notes/index.html"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://gymnasium.farama.org/gymnasium_release_notes/index.html\",\n  \"title\": \"Gymnasium Documentation\",\n  \"summary\": \"A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article role=\\\"main\\\">\\n            \\n            <section id=\\\"gymnasium-release-notes\\\">\\n<h2>Gymnasium Release Notes<a href=\\\"#gymnasium-release-notes\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<section id=\\\"release-v1-3-0\\\">\\n<h2>v1.3.0<a href=\\\"#release-v1-3-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2026-04-22 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.3.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.3.0/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.3.0</h2>\\n<p>This release brings a new Taxi environment version, a new <code>RepeatAction</code> wrapper, and a range of bug fixes across vector environments and wrappers.</p>\\n<h2>Core Changes</h2>\\n<ul>\\n<li>The <code>Taxi</code> environment has been updated to <code>v4</code> to correct the <code>is_rainy</code> implementation, which previously did not behave as documented by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"4307782993\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1561\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1561/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1561\\\">#1561</a>)</li>\\n<li><code>pygame</code> has been replaced with <a href=\\\"https://pyga.me/\\\" rel=\\\"nofollow\\\"><code>pygame-ce</code></a>, unlocking Python 3.14 compatibility. The drop-in replacement preserves the existing rendering behaviour by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/mwydmuch/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/mwydmuch\\\">@mwydmuch</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3788673329\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1512\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1512/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1512\\\">#1512</a>)</li>\\n<li>Added <code>RepeatAction</code> wrapper that repeats a given action for a fixed number of steps, useful for frame-skipping and coarser control loops by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Lidang-Jiang/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Lidang-Jiang\\\">@Lidang-Jiang</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"4194777638\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1553\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1553/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1553\\\">#1553</a>)</li>\\n<li>Optimised <code>Box.__init__</code> to reduce overhead through lazy evaluation of variables by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3958776450\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1529\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1529/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1529\\\">#1529</a>)</li>\\n<li>Fix vector <code>RecordVideo</code> wrapper to remove memory leaks across episodes by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/JonahFSD/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/JonahFSD\\\">@JonahFSD</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3914195171\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1527\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1527/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1527\\\">#1527</a>)</li>\\n<li>Fix Vector <code>NormalizeReward</code> wrapper to work identically as the non-vectorized version by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/JonahFSD/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/JonahFSD\\\">@JonahFSD</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3912418439\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1526\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1526/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1526\\\">#1526</a>)</li>\\n</ul>\\n<h2>Bug Fixes</h2>\\n<ul>\\n<li>Removed the <code>VectorEnv</code> destructor, which was causing <code>VectorEnv.close()</code> to be called unintentionally by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/TimSchneider42/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/TimSchneider42\\\">@TimSchneider42</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3832761339\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1522\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1522/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1522\\\">#1522</a>)</li>\\n<li>Fixed <code>NormalizeObservation</code> vectorized wrapper to override the <code>observation_space</code> and a <code>float32</code> cast by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/JonahFSD/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/JonahFSD\\\">@JonahFSD</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3914296741\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1528\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1528/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1528\\\">#1528</a>)</li>\\n<li>Fixed <code>RenderFrame</code>'s typing by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/jorenham/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/jorenham\\\">@jorenham</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"4305182899\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1560\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1560/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1560\\\">#1560</a>)</li>\\n<li>Fixed A2C tutorial's entropy computation when a scalar (rather than tensor) is passed to the loss function by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/epoint95/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/epoint95\\\">@epoint95</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3868402075\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1524\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1524/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1524\\\">#1524</a>)</li>\\n</ul>\\n<h2>Third-Party Environments</h2>\\n<p>10 new community environments have been added to the <a href=\\\"https://gymnasium.farama.org/environments/third_party_environments/\\\" rel=\\\"nofollow\\\">third-party environments list</a>, including a new Cybersecurity environments section.</p>\\n<p><strong>Full Changelog:</strong> <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.3...v1.3.0\\\"><tt>v1.2.3...v1.3.0</tt></a></p></section>\\n<section id=\\\"release-v1-2-3\\\">\\n<h2>v1.2.3<a href=\\\"#release-v1-2-3\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-12-18 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.3\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.3/\\\">PyPI</a></em></p>\\n<h2>Gymnasium V1.2.3</h2>\\n<p>This is a minor release with the most significant being changing the dependency  for \\\"gymnasium[box2d]\\\" from <a href=\\\"https://pypi.org/project/box2d-py\\\" rel=\\\"nofollow\\\">box2d-py</a> to <a href=\\\"https://pypi.org/project/Box2D/\\\" rel=\\\"nofollow\\\">box2d</a>. See <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3580605032\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1474\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1474/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1474\\\">#1474</a> for more detail.</p>\\n<p>Additionally, we're fixed several typos and added a couple of third-party projects. Lastly, we fixed the create an environment tutorial to use NumPy [row, col] in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3704120436\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1490\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1490/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1490\\\">#1490</a></p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.2...v1.2.3\\\"><tt>v1.2.2...v1.2.3</tt></a></p></section>\\n<section id=\\\"release-v1-2-2\\\">\\n<h2>v1.2.2: V1.2.2<a href=\\\"#release-v1-2-2\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-11-04 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.2\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.2/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.2.2</h2>\\n<p>This is a minor update just to add <code>Discrete.dtype</code> and improve <code>DictInfoToList</code></p>\\n<p>These additional changes were made</p>\\n<ul>\\n<li>Add <code>Discrete.dtype</code> parameter by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/VadimBim/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/VadimBim\\\">@VadimBim</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3529955390\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1467\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1467/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1467\\\">#1467</a></li>\\n<li>Improve <code>MultiDiscrete.dtype</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3542266766\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1469\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1469/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1469\\\">#1469</a></li>\\n<li>Improve <code>DictInfoToList</code> to support vector info with missing binary key by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3526222616\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1466\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1466/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1466\\\">#1466</a></li>\\n</ul>\\n<p>Thanks to Wispr for their support of the project</p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.1...v1.2.2\\\"><tt>v1.2.1...v1.2.2</tt></a></p></section>\\n<section id=\\\"release-v1-2-1\\\">\\n<h2>v1.2.1<a href=\\\"#release-v1-2-1\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-09-23 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.1\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.1/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.2.1</h2>\\n<p>Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs.</p>\\n<h2>New Features</h2>\\n<ul>\\n<li>Added new wrappers to discretize observations and actions (<code>gymnasium.wrappers.DiscretizeObservation</code> and <code>gymnasium.wrappers.DiscretizeAction</code>) by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3185358242\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1411\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1411/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1411\\\">#1411</a>)</li>\\n<li>Add vector-based <code>RecordVideo</code> wrapper that allows recording all sub-environments at the same time by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3241889899\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1418\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1418/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1418\\\">#1418</a>)</li>\\n<li>Add new customization to Frozen Lake for reward schedule and success rate by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Tayzn/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Tayzn\\\">@Tayzn</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3021051457\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1363\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1363/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1363\\\">#1363</a>)</li>\\n<li>Add tutorial on action masking by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/dantp-ai/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/dantp-ai\\\">@dantp-ai</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2958306932\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1345\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1345/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1345\\\">#1345</a>)</li>\\n<li>Optimize several environment functions for Blackjack and Mujoco-Humanoid by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/aseembits93/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/aseembits93\\\">@aseembits93</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3298033175\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1426\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1426/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1426\\\">#1426</a>, <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3320254150\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1429\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1429/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1429\\\">#1429</a>, <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3320248227\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1428\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1428/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1428\\\">#1428</a> and <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3356742105\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1436\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1436/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1436\\\">#1436</a>)</li>\\n</ul>\\n<h2>Fixes</h2>\\n<ul>\\n<li>Fixes memory leak in recording by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pascal-roth/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pascal-roth\\\">@pascal-roth</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3394001149\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1444\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1444/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1444\\\">#1444</a></li>\\n<li>Fix \\\"pygame is not responding\\\" when rendering the pygame window in the Taxi environment by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/RizkySugiharto/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/RizkySugiharto\\\">@RizkySugiharto</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3182695017\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1410\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1410/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1410\\\">#1410</a>)</li>\\n<li>Fix the vector wrapper warnings by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3205200502\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1415\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1415/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1415\\\">#1415</a>)</li>\\n<li>Fix <code>AsyncVectorEnv.step</code> from hanging forever by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/matinmoezzi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/matinmoezzi\\\">@matinmoezzi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3243762778\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1419\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1419/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1419\\\">#1419</a>)</li>\\n<li>Clear markers after rendering for MuJoCo OffscreenViewers by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3248714510\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1420\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1420/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1420\\\">#1420</a>)</li>\\n<li>Fix array-conversion for torch cuda tensors by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3363046088\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1439\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1439/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1439\\\">#1439</a>)</li>\\n<li>Add missing dependency on packaging to array-api feature by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3361105679\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1437\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1437/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1437\\\">#1437</a>)</li>\\n</ul>\\n<p>Thank you to all the contributors</p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.0...v1.2.1\\\"><tt>v1.2.0...v1.2.1</tt></a></p></section>\\n<section id=\\\"release-v1-2-0\\\">\\n<h2>v1.2.0<a href=\\\"#release-v1-2-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-06-27 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.2.0/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.2.0</h2>\\n<p>In Gym <a href=\\\"https://github.com/openai/gym/releases/tag/0.24.0\\\">v0.24</a>, v4 MuJoCo environments were added that used a different simulator (mujoco, not mujoco-py).<br>\\nHaving been 3 years since v0.24 and with mujoco-py being unmaintained, it is limiting our ability to support Python 3.13.<br>\\nTherefore, in this release, we have moved the MuJoCo v2 and v3 to the <a href=\\\"https://github.com/Farama-Foundation/gymnasium-robotics\\\">Gymnasium-Robotics</a> project, meaning the users who cannot upgrade to the v4 or v5 MuJoCo environments should update their code to:</p>\\n<div data-snippet-clipboard-copy-content=\\\"import gymnasium as gym\\nimport gymnasium_robotics  # `pip install &quot;gymnasium-robotics[mujoco-py]&quot;`\\n\\ngym.register_envs(gymnasium_robotics)  # optional\\n\\nenv = gym.make(&quot;Humanoid-v3&quot;)\\\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\\n<span>import</span> <span>gymnasium_robotics</span>  <span># `pip install \\\"gymnasium-robotics[mujoco-py]\\\"`</span>\\n\\n<span>gym</span>.<span>register_envs</span>(<span>gymnasium_robotics</span>)  <span># optional</span>\\n\\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\\\"Humanoid-v3\\\"</span>)</pre></div>\\n<p>In addition, we have added support for Python 3.13 (and dropped Python 3.8 and 3.9 following NumPy and other projects).</p>\\n<h2>New Features</h2>\\n<ul>\\n<li>Add <code>AddWhiteNoise</code> and <code>ObstructView</code> wrappers that add noise to RGB renderings either across the whole image or sections by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2645643074\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1243\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1243/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1243\\\">#1243</a>)</li>\\n<li>Adds rainy and fickle options for Taxi environment that were implemented in the original paper by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/foreverska/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/foreverska\\\">@foreverska</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2867689892\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1315\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1315/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1315\\\">#1315</a>)</li>\\n<li>Add <code>wrappers.ArrayConversion</code>, a generic conversion wrapper between Array API compatible frameworks (like NumPy, Torch, Jax, etc) by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2930111508\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1333\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1333/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1333\\\">#1333</a>)</li>\\n<li>Add garbage collector on RecordVideo wrapper to prevent out of memory problems by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/vicbentu/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/vicbentu\\\">@vicbentu</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3054955909\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1378\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1378/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1378\\\">#1378</a>)</li>\\n</ul>\\n<h2>Bug Fixes</h2>\\n<ul>\\n<li>Fix MuJoCo add_markers for mujoco&gt;=3.2 by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2900883178\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1329\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1329/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1329\\\">#1329</a>)</li>\\n<li>Update CliffWalking version in envs registry by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/abouelsaadat/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/abouelsaadat\\\">@abouelsaadat</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2936052417\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1334\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1334/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1334\\\">#1334</a>)</li>\\n<li>Fix jax 0.6.0 regression in jax_to_numpy by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/GaetanLepage/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/GaetanLepage\\\">@GaetanLepage</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3004879973\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1361\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1361/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1361\\\">#1361</a>)</li>\\n<li>Fix Atari Preprocessing to only check env has NOOP action when <code>noop_max&gt;0</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3098488051\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1393\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1393/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1393\\\">#1393</a>)</li>\\n<li>Fix all warnings in tests/spaces by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3126115146\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1396\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1396/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1396\\\">#1396</a>)</li>\\n<li>Don't override SDL's audio driver on Linux by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/AnErrupTion/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/AnErrupTion\\\">@AnErrupTion</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"3079104537\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1387\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1387/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1387\\\">#1387</a>)</li>\\n</ul>\\n<p>In addition, this release includes numerous updates to the documentation, most importantly to the introductory pages, with an aim to make them easier for new users of Gymnasium or RL to understand.</p>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.1...v1.2.0\\\"><tt>v1.1.1...v1.2.0</tt></a></p></section>\\n<section id=\\\"release-v1-1-1\\\">\\n<h2>Gymnasium v1.1.1<a href=\\\"#release-v1-1-1\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-03-06 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.1\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.1.1/\\\">PyPI</a></em></p>\\n<h2>Changes</h2>\\n<ul>\\n<li>Remove assert on metadata render modes for MuJoCo-based environments in <code>mujoco_env.py</code> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2900123613\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1328\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1328/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1328\\\">#1328</a>)</li>\\n<li>Update <code>wrappers.vector.NumpyToTorch</code> to refer to <code>numpy</code> instead of <code>jax</code>  by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pkuderov/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pkuderov\\\">@pkuderov</a> in <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2883844632\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1319\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1319/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1319\\\">#1319</a></li>\\n</ul>\\n<p><strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.0...v1.1.1\\\"><tt>v1.1.0...v1.1.1</tt></a></p></section>\\n<section id=\\\"release-v1-1-0\\\">\\n<h2>v1.1.0<a href=\\\"#release-v1-1-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2025-02-26 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.1.0/\\\">PyPI</a></em></p>\\n<h2>Gymnasium v1.1</h2>\\n<p>In this release, we fix several bugs with Gymnasium v1.0 along with new features to improve the changes made.</p>\\n<p>With the release of Gymnasium v1.0, one of the major changes we made was to the vector environment implementation, improving how users interface with it and extend it. Some users noted that they required backward compatibility with how vector environments automatically reset sub-environments when they terminated or truncated, referred to as the autoreset mode or API. As a result, in v1.1, we have added support to the implemented vector environments (<code>SyncVectorEnv</code> and <code>AsyncVectorEnv</code>) and wrappers for all three possible modes: next-step, same-step and disabled. To know the type of autoreset mode used, it should be specified in the vector environment metadata, <code>metadata[\\\"autoreset_mode\\\"]</code> as a <code>gymnasium.vectors.AutoresetMode</code> enum. For more information on the differences between the autoreset modes and how to use them, read <a href=\\\"https://farama.org/Vector-Autoreset-Mode\\\" rel=\\\"nofollow\\\">https://farama.org/Vector-Autoreset-Mode</a>.</p>\\n<p>In addition, we've added several helpful features</p>\\n<ul>\\n<li>In Gymnasium v0.25, we added masking for spaces, <code>space.sample(mask=...)</code>, to logically mask out possible samples from spaces. <a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2862826251\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\\\">#1310</a> adds probability masking for each space with <code>space.sample(probability=...)</code> to specify the probability of each sample, which is helpful for RL policies that output a probability distribution of actions.</li>\\n<li>For MuJoCo environments, users have the option to render the robot with RGB images or depth-based images. Previously, it was only possible to access only the RGB or depth rendering. Gymnasium v1.1 adds RGBD rendering that outputs both the RGB and depth-based image as a single output.</li>\\n<li>In Gymnasium v1.0, we improve the conversion wrappers between <code>jax</code>, <code>torch</code> and <code>numpy</code> with <code>JaxToTorch</code>, <code>JaxToNumpy</code>, <code>NumpyToTorch</code>, etc. In v1.1, we've improved the wrappers to work with rendering and to be compatible with the full dlpack API.</li>\\n</ul>\\n<h2>New Features</h2>\\n<ul>\\n<li>Space sampling with probability masks by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/mariojerez/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/mariojerez\\\">@mariojerez</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2862826251\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\\\">#1310</a>)</li>\\n<li>Add rgb + depth rendering to mujoco by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/DavidPL1/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/DavidPL1\\\">@DavidPL1</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2609027305\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1229\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1229/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1229\\\">#1229</a>)</li>\\n<li>Add Render Support for <code>JaxToNumpy</code>, <code>JaxToTorch</code> and <code>NumpyToTorch</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2848895679\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1306\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1306/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1306\\\">#1306</a>)</li>\\n<li>Add full support for dlpack API by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/amacati/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/amacati\\\">@amacati</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2814347019\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1299\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1299/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1299\\\">#1299</a>)</li>\\n<li>Enable the StickyAction wrapper to repeat the old action for more than 1 step by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/sparisi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/sparisi\\\">@sparisi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2640457710\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1240\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1240/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1240\\\">#1240</a>)</li>\\n<li>Add <code>wrappers.vector.TransformObs/Action</code> support for a single obs/action space argument by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/howardh/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/howardh\\\">@howardh</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2774346221\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1288\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1288/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1288\\\">#1288</a>)</li>\\n<li>Improve 'set_wrapper_attr' for wrappers to modify how variables are updated by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/duburcqa/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/duburcqa\\\">@duburcqa</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2796910732\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1294\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1294/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1294\\\">#1294</a>)</li>\\n<li>Allow <code>AtariPreprocessing</code> to support non-square observations by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/li-plus/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/li-plus\\\">@li-plus</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2865365152\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1312\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1312/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1312\\\">#1312</a>)</li>\\n<li>Improve <code>Wrapper</code> and <code>VectorWrapper</code> error checking by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2682633843\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1260\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1260/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1260\\\">#1260</a>)</li>\\n</ul>\\n<h2>Bug Fixes</h2>\\n<ul>\\n<li>Fix <code>get_wrapper_attr</code> / <code>set_wrapper_attr</code> for edge case by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/duburcqa/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/duburcqa\\\">@duburcqa</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2796262300\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1293\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1293/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1293\\\">#1293</a>)</li>\\n<li>Add width and height check for MujocoRenderer to help error checking by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Beanpow/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Beanpow\\\">@Beanpow</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2609101033\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1230\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1230/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1230\\\">#1230</a>)</li>\\n<li>Fix CarRacing-v3 with discrete actions by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/Sebastian-Griesbach/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/Sebastian-Griesbach\\\">@Sebastian-Griesbach</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2661510536\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1253\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1253/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1253\\\">#1253</a>)</li>\\n<li>Fix Mujoco Rendering for <code>rgb_array</code> that is upside down by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2701582796\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1264\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1264/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1264\\\">#1264</a>)</li>\\n<li>Revert <code>OrderedDict</code> key ordering in <code>Dict</code> space by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2776425048\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1291\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1291/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1291\\\">#1291</a>)</li>\\n<li><code>wrappers.vector.NumpyToTorch</code> doesn't require <code>jax</code> to work by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2852244816\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1308\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1308/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1308\\\">#1308</a>)</li>\\n<li>Prevented accidental alteration of <code>env_spec_kwargs</code> in <code>make_vec</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/TimSchneider42/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/TimSchneider42\\\">@TimSchneider42</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2838283083\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1304\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1304/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1304\\\">#1304</a>)</li>\\n<li>Quit Pygame when cliffwalking environment is closed by @sjmikler (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2807765801\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1297\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1297/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1297\\\">#1297</a>)</li>\\n<li>Add <code>TimeAwareObservation</code> support for environments without a <code>spec</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/pseudo-rnd-thoughts/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/pseudo-rnd-thoughts\\\">@pseudo-rnd-thoughts</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2775663588\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1289\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1289/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1289\\\">#1289</a>)</li>\\n<li>Improve the error message when ending the MuJoCo renderer without calling <code>env.close</code> by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/a-ayesh/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/a-ayesh\\\">@a-ayesh</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2761798816\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1283\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1283/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1283\\\">#1283</a>)</li>\\n</ul>\\n<h2>Documentation Changes</h2>\\n<ul>\\n<li>Fixed typo in blackjack for the values sampled, 2-10 instead of 2-9 by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/artemmiyy/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/artemmiyy\\\">@artemmiyy</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2656508334\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1250\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1250/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1250\\\">#1250</a>)</li>\\n<li>Added some popular third-party beginner tutorials by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/godsboy404/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/godsboy404\\\">@godsboy404</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2737913912\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1271\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1271/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1271\\\">#1271</a>)</li>\\n<li>Update documentation for reward scaling wrappers by <a data-hovercard-type=\\\"user\\\" data-hovercard-url=\\\"/users/keraJLi/hovercard\\\" data-octo-click=\\\"hovercard-link-click\\\" data-octo-dimensions=\\\"link_type:self\\\" href=\\\"https://github.com/keraJLi\\\">@keraJLi</a> (<a data-error-text=\\\"Failed to load title\\\" data-id=\\\"2765903176\\\" data-permission-text=\\\"Title is private\\\" data-url=\\\"https://github.com/Farama-Foundation/Gymnasium/issues/1285\\\" data-hovercard-type=\\\"pull_request\\\" data-hovercard-url=\\\"/Farama-Foundation/Gymnasium/pull/1285/hovercard\\\" href=\\\"https://github.com/Farama-Foundation/Gymnasium/pull/1285\\\">#1285</a>)</li>\\n</ul>\\n<p>Thanks to the 31 new contributors that contributed to this release with the <strong>Full Changelog</strong>: <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.0.0...v1.1.0\\\"><tt>v1.0.0...v1.1.0</tt></a></p></section>\\n<section id=\\\"release-v1-0-0\\\">\\n<h2>v1.0.0<a href=\\\"#release-v1-0-0\\\" title=\\\"Link to this heading\\\">¶</a></h2>\\n<p><em>Released on 2024-10-08 - <a href=\\\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\\\">GitHub</a> - <a href=\\\"https://pypi.org/project/gymnasium/v1.0.0/\\\">PyPI</a></em></p>\\n<h2>v1.0.0 release notes</h2>\\n<p>Over the last few years, the volunteer team behind Gym and Gymnasium has worked to fix bugs, improve the documentation, add new features, and change the API where appropriate so that the benefits outweigh the co\"\n}"
    }
  ],
  "details": {
    "url": "https://gymnasium.farama.org/gymnasium_release_notes/index.html",
    "title": "Gymnasium Documentation",
    "summary": "A standard API for reinforcement learning and a diverse set of reference environments (formerly Gym)",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article role=\"main\">\n            \n            <section id=\"gymnasium-release-notes\">\n<h2>Gymnasium Release Notes<a href=\"#gymnasium-release-notes\" title=\"Link to this heading\">¶</a></h2>\n<section id=\"release-v1-3-0\">\n<h2>v1.3.0<a href=\"#release-v1-3-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2026-04-22 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.3.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.3.0/\">PyPI</a></em></p>\n<h2>Gymnasium v1.3.0</h2>\n<p>This release brings a new Taxi environment version, a new <code>RepeatAction</code> wrapper, and a range of bug fixes across vector environments and wrappers.</p>\n<h2>Core Changes</h2>\n<ul>\n<li>The <code>Taxi</code> environment has been updated to <code>v4</code> to correct the <code>is_rainy</code> implementation, which previously did not behave as documented by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"4307782993\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1561\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1561/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1561\">#1561</a>)</li>\n<li><code>pygame</code> has been replaced with <a href=\"https://pyga.me/\" rel=\"nofollow\"><code>pygame-ce</code></a>, unlocking Python 3.14 compatibility. The drop-in replacement preserves the existing rendering behaviour by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/mwydmuch/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/mwydmuch\">@mwydmuch</a> (<a data-error-text=\"Failed to load title\" data-id=\"3788673329\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1512\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1512/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1512\">#1512</a>)</li>\n<li>Added <code>RepeatAction</code> wrapper that repeats a given action for a fixed number of steps, useful for frame-skipping and coarser control loops by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Lidang-Jiang/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Lidang-Jiang\">@Lidang-Jiang</a> (<a data-error-text=\"Failed to load title\" data-id=\"4194777638\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1553\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1553/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1553\">#1553</a>)</li>\n<li>Optimised <code>Box.__init__</code> to reduce overhead through lazy evaluation of variables by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"3958776450\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1529\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1529/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1529\">#1529</a>)</li>\n<li>Fix vector <code>RecordVideo</code> wrapper to remove memory leaks across episodes by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/JonahFSD/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/JonahFSD\">@JonahFSD</a> (<a data-error-text=\"Failed to load title\" data-id=\"3914195171\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1527\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1527/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1527\">#1527</a>)</li>\n<li>Fix Vector <code>NormalizeReward</code> wrapper to work identically as the non-vectorized version by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/JonahFSD/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/JonahFSD\">@JonahFSD</a> (<a data-error-text=\"Failed to load title\" data-id=\"3912418439\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1526\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1526/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1526\">#1526</a>)</li>\n</ul>\n<h2>Bug Fixes</h2>\n<ul>\n<li>Removed the <code>VectorEnv</code> destructor, which was causing <code>VectorEnv.close()</code> to be called unintentionally by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/TimSchneider42/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/TimSchneider42\">@TimSchneider42</a> (<a data-error-text=\"Failed to load title\" data-id=\"3832761339\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1522\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1522/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1522\">#1522</a>)</li>\n<li>Fixed <code>NormalizeObservation</code> vectorized wrapper to override the <code>observation_space</code> and a <code>float32</code> cast by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/JonahFSD/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/JonahFSD\">@JonahFSD</a> (<a data-error-text=\"Failed to load title\" data-id=\"3914296741\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1528\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1528/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1528\">#1528</a>)</li>\n<li>Fixed <code>RenderFrame</code>'s typing by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/jorenham/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/jorenham\">@jorenham</a> (<a data-error-text=\"Failed to load title\" data-id=\"4305182899\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1560\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1560/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1560\">#1560</a>)</li>\n<li>Fixed A2C tutorial's entropy computation when a scalar (rather than tensor) is passed to the loss function by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/epoint95/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/epoint95\">@epoint95</a> (<a data-error-text=\"Failed to load title\" data-id=\"3868402075\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1524\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1524/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1524\">#1524</a>)</li>\n</ul>\n<h2>Third-Party Environments</h2>\n<p>10 new community environments have been added to the <a href=\"https://gymnasium.farama.org/environments/third_party_environments/\" rel=\"nofollow\">third-party environments list</a>, including a new Cybersecurity environments section.</p>\n<p><strong>Full Changelog:</strong> <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.3...v1.3.0\"><tt>v1.2.3...v1.3.0</tt></a></p></section>\n<section id=\"release-v1-2-3\">\n<h2>v1.2.3<a href=\"#release-v1-2-3\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-12-18 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.3\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.3/\">PyPI</a></em></p>\n<h2>Gymnasium V1.2.3</h2>\n<p>This is a minor release with the most significant being changing the dependency  for \"gymnasium[box2d]\" from <a href=\"https://pypi.org/project/box2d-py\" rel=\"nofollow\">box2d-py</a> to <a href=\"https://pypi.org/project/Box2D/\" rel=\"nofollow\">box2d</a>. See <a data-error-text=\"Failed to load title\" data-id=\"3580605032\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1474\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1474/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1474\">#1474</a> for more detail.</p>\n<p>Additionally, we're fixed several typos and added a couple of third-party projects. Lastly, we fixed the create an environment tutorial to use NumPy [row, col] in <a data-error-text=\"Failed to load title\" data-id=\"3704120436\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1490\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1490/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1490\">#1490</a></p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.2...v1.2.3\"><tt>v1.2.2...v1.2.3</tt></a></p></section>\n<section id=\"release-v1-2-2\">\n<h2>v1.2.2: V1.2.2<a href=\"#release-v1-2-2\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-11-04 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.2\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.2/\">PyPI</a></em></p>\n<h2>Gymnasium v1.2.2</h2>\n<p>This is a minor update just to add <code>Discrete.dtype</code> and improve <code>DictInfoToList</code></p>\n<p>These additional changes were made</p>\n<ul>\n<li>Add <code>Discrete.dtype</code> parameter by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/VadimBim/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/VadimBim\">@VadimBim</a> in <a data-error-text=\"Failed to load title\" data-id=\"3529955390\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1467\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1467/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1467\">#1467</a></li>\n<li>Improve <code>MultiDiscrete.dtype</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> in <a data-error-text=\"Failed to load title\" data-id=\"3542266766\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1469\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1469/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1469\">#1469</a></li>\n<li>Improve <code>DictInfoToList</code> to support vector info with missing binary key by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> in <a data-error-text=\"Failed to load title\" data-id=\"3526222616\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1466\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1466/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1466\">#1466</a></li>\n</ul>\n<p>Thanks to Wispr for their support of the project</p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.1...v1.2.2\"><tt>v1.2.1...v1.2.2</tt></a></p></section>\n<section id=\"release-v1-2-1\">\n<h2>v1.2.1<a href=\"#release-v1-2-1\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-09-23 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.1\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.1/\">PyPI</a></em></p>\n<h2>Gymnasium v1.2.1</h2>\n<p>Minor update that adds new wrappers, optimizes several environment functions and fixes several bugs.</p>\n<h2>New Features</h2>\n<ul>\n<li>Added new wrappers to discretize observations and actions (<code>gymnasium.wrappers.DiscretizeObservation</code> and <code>gymnasium.wrappers.DiscretizeAction</code>) by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"3185358242\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1411\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1411/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1411\">#1411</a>)</li>\n<li>Add vector-based <code>RecordVideo</code> wrapper that allows recording all sub-environments at the same time by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"3241889899\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1418\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1418/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1418\">#1418</a>)</li>\n<li>Add new customization to Frozen Lake for reward schedule and success rate by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Tayzn/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Tayzn\">@Tayzn</a> (<a data-error-text=\"Failed to load title\" data-id=\"3021051457\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1363\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1363/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1363\">#1363</a>)</li>\n<li>Add tutorial on action masking by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/dantp-ai/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/dantp-ai\">@dantp-ai</a> (<a data-error-text=\"Failed to load title\" data-id=\"2958306932\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1345\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1345/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1345\">#1345</a>)</li>\n<li>Optimize several environment functions for Blackjack and Mujoco-Humanoid by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/aseembits93/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/aseembits93\">@aseembits93</a> (<a data-error-text=\"Failed to load title\" data-id=\"3298033175\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1426\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1426/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1426\">#1426</a>, <a data-error-text=\"Failed to load title\" data-id=\"3320254150\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1429\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1429/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1429\">#1429</a>, <a data-error-text=\"Failed to load title\" data-id=\"3320248227\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1428\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1428/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1428\">#1428</a> and <a data-error-text=\"Failed to load title\" data-id=\"3356742105\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1436\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1436/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1436\">#1436</a>)</li>\n</ul>\n<h2>Fixes</h2>\n<ul>\n<li>Fixes memory leak in recording by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pascal-roth/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pascal-roth\">@pascal-roth</a> in <a data-error-text=\"Failed to load title\" data-id=\"3394001149\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1444\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1444/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1444\">#1444</a></li>\n<li>Fix \"pygame is not responding\" when rendering the pygame window in the Taxi environment by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/RizkySugiharto/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/RizkySugiharto\">@RizkySugiharto</a> (<a data-error-text=\"Failed to load title\" data-id=\"3182695017\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1410\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1410/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1410\">#1410</a>)</li>\n<li>Fix the vector wrapper warnings by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"3205200502\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1415\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1415/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1415\">#1415</a>)</li>\n<li>Fix <code>AsyncVectorEnv.step</code> from hanging forever by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/matinmoezzi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/matinmoezzi\">@matinmoezzi</a> (<a data-error-text=\"Failed to load title\" data-id=\"3243762778\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1419\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1419/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1419\">#1419</a>)</li>\n<li>Clear markers after rendering for MuJoCo OffscreenViewers by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3248714510\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1420\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1420/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1420\">#1420</a>)</li>\n<li>Fix array-conversion for torch cuda tensors by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3363046088\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1439\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1439/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1439\">#1439</a>)</li>\n<li>Add missing dependency on packaging to array-api feature by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3361105679\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1437\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1437/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1437\">#1437</a>)</li>\n</ul>\n<p>Thank you to all the contributors</p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.2.0...v1.2.1\"><tt>v1.2.0...v1.2.1</tt></a></p></section>\n<section id=\"release-v1-2-0\">\n<h2>v1.2.0<a href=\"#release-v1-2-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-06-27 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.2.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.2.0/\">PyPI</a></em></p>\n<h2>Gymnasium v1.2.0</h2>\n<p>In Gym <a href=\"https://github.com/openai/gym/releases/tag/0.24.0\">v0.24</a>, v4 MuJoCo environments were added that used a different simulator (mujoco, not mujoco-py).<br>\nHaving been 3 years since v0.24 and with mujoco-py being unmaintained, it is limiting our ability to support Python 3.13.<br>\nTherefore, in this release, we have moved the MuJoCo v2 and v3 to the <a href=\"https://github.com/Farama-Foundation/gymnasium-robotics\">Gymnasium-Robotics</a> project, meaning the users who cannot upgrade to the v4 or v5 MuJoCo environments should update their code to:</p>\n<div data-snippet-clipboard-copy-content=\"import gymnasium as gym\nimport gymnasium_robotics  # `pip install &quot;gymnasium-robotics[mujoco-py]&quot;`\n\ngym.register_envs(gymnasium_robotics)  # optional\n\nenv = gym.make(&quot;Humanoid-v3&quot;)\"><pre><span>import</span> <span>gymnasium</span> <span>as</span> <span>gym</span>\n<span>import</span> <span>gymnasium_robotics</span>  <span># `pip install \"gymnasium-robotics[mujoco-py]\"`</span>\n\n<span>gym</span>.<span>register_envs</span>(<span>gymnasium_robotics</span>)  <span># optional</span>\n\n<span>env</span> <span>=</span> <span>gym</span>.<span>make</span>(<span>\"Humanoid-v3\"</span>)</pre></div>\n<p>In addition, we have added support for Python 3.13 (and dropped Python 3.8 and 3.9 following NumPy and other projects).</p>\n<h2>New Features</h2>\n<ul>\n<li>Add <code>AddWhiteNoise</code> and <code>ObstructView</code> wrappers that add noise to RGB renderings either across the whole image or sections by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"2645643074\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1243\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1243/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1243\">#1243</a>)</li>\n<li>Adds rainy and fickle options for Taxi environment that were implemented in the original paper by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/foreverska/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/foreverska\">@foreverska</a> (<a data-error-text=\"Failed to load title\" data-id=\"2867689892\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1315\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1315/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1315\">#1315</a>)</li>\n<li>Add <code>wrappers.ArrayConversion</code>, a generic conversion wrapper between Array API compatible frameworks (like NumPy, Torch, Jax, etc) by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2930111508\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1333\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1333/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1333\">#1333</a>)</li>\n<li>Add garbage collector on RecordVideo wrapper to prevent out of memory problems by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/vicbentu/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/vicbentu\">@vicbentu</a> (<a data-error-text=\"Failed to load title\" data-id=\"3054955909\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1378\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1378/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1378\">#1378</a>)</li>\n</ul>\n<h2>Bug Fixes</h2>\n<ul>\n<li>Fix MuJoCo add_markers for mujoco&gt;=3.2 by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2900883178\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1329\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1329/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1329\">#1329</a>)</li>\n<li>Update CliffWalking version in envs registry by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/abouelsaadat/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/abouelsaadat\">@abouelsaadat</a> (<a data-error-text=\"Failed to load title\" data-id=\"2936052417\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1334\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1334/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1334\">#1334</a>)</li>\n<li>Fix jax 0.6.0 regression in jax_to_numpy by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/GaetanLepage/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/GaetanLepage\">@GaetanLepage</a> (<a data-error-text=\"Failed to load title\" data-id=\"3004879973\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1361\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1361/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1361\">#1361</a>)</li>\n<li>Fix Atari Preprocessing to only check env has NOOP action when <code>noop_max&gt;0</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"3098488051\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1393\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1393/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1393\">#1393</a>)</li>\n<li>Fix all warnings in tests/spaces by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"3126115146\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1396\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1396/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1396\">#1396</a>)</li>\n<li>Don't override SDL's audio driver on Linux by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/AnErrupTion/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/AnErrupTion\">@AnErrupTion</a> (<a data-error-text=\"Failed to load title\" data-id=\"3079104537\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1387\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1387/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1387\">#1387</a>)</li>\n</ul>\n<p>In addition, this release includes numerous updates to the documentation, most importantly to the introductory pages, with an aim to make them easier for new users of Gymnasium or RL to understand.</p>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.1...v1.2.0\"><tt>v1.1.1...v1.2.0</tt></a></p></section>\n<section id=\"release-v1-1-1\">\n<h2>Gymnasium v1.1.1<a href=\"#release-v1-1-1\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-03-06 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.1\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.1.1/\">PyPI</a></em></p>\n<h2>Changes</h2>\n<ul>\n<li>Remove assert on metadata render modes for MuJoCo-based environments in <code>mujoco_env.py</code> (<a data-error-text=\"Failed to load title\" data-id=\"2900123613\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1328\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1328/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1328\">#1328</a>)</li>\n<li>Update <code>wrappers.vector.NumpyToTorch</code> to refer to <code>numpy</code> instead of <code>jax</code>  by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pkuderov/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pkuderov\">@pkuderov</a> in <a data-error-text=\"Failed to load title\" data-id=\"2883844632\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1319\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1319/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1319\">#1319</a></li>\n</ul>\n<p><strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.1.0...v1.1.1\"><tt>v1.1.0...v1.1.1</tt></a></p></section>\n<section id=\"release-v1-1-0\">\n<h2>v1.1.0<a href=\"#release-v1-1-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2025-02-26 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.1.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.1.0/\">PyPI</a></em></p>\n<h2>Gymnasium v1.1</h2>\n<p>In this release, we fix several bugs with Gymnasium v1.0 along with new features to improve the changes made.</p>\n<p>With the release of Gymnasium v1.0, one of the major changes we made was to the vector environment implementation, improving how users interface with it and extend it. Some users noted that they required backward compatibility with how vector environments automatically reset sub-environments when they terminated or truncated, referred to as the autoreset mode or API. As a result, in v1.1, we have added support to the implemented vector environments (<code>SyncVectorEnv</code> and <code>AsyncVectorEnv</code>) and wrappers for all three possible modes: next-step, same-step and disabled. To know the type of autoreset mode used, it should be specified in the vector environment metadata, <code>metadata[\"autoreset_mode\"]</code> as a <code>gymnasium.vectors.AutoresetMode</code> enum. For more information on the differences between the autoreset modes and how to use them, read <a href=\"https://farama.org/Vector-Autoreset-Mode\" rel=\"nofollow\">https://farama.org/Vector-Autoreset-Mode</a>.</p>\n<p>In addition, we've added several helpful features</p>\n<ul>\n<li>In Gymnasium v0.25, we added masking for spaces, <code>space.sample(mask=...)</code>, to logically mask out possible samples from spaces. <a data-error-text=\"Failed to load title\" data-id=\"2862826251\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\">#1310</a> adds probability masking for each space with <code>space.sample(probability=...)</code> to specify the probability of each sample, which is helpful for RL policies that output a probability distribution of actions.</li>\n<li>For MuJoCo environments, users have the option to render the robot with RGB images or depth-based images. Previously, it was only possible to access only the RGB or depth rendering. Gymnasium v1.1 adds RGBD rendering that outputs both the RGB and depth-based image as a single output.</li>\n<li>In Gymnasium v1.0, we improve the conversion wrappers between <code>jax</code>, <code>torch</code> and <code>numpy</code> with <code>JaxToTorch</code>, <code>JaxToNumpy</code>, <code>NumpyToTorch</code>, etc. In v1.1, we've improved the wrappers to work with rendering and to be compatible with the full dlpack API.</li>\n</ul>\n<h2>New Features</h2>\n<ul>\n<li>Space sampling with probability masks by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/mariojerez/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/mariojerez\">@mariojerez</a> (<a data-error-text=\"Failed to load title\" data-id=\"2862826251\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1310\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1310/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1310\">#1310</a>)</li>\n<li>Add rgb + depth rendering to mujoco by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/DavidPL1/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/DavidPL1\">@DavidPL1</a> (<a data-error-text=\"Failed to load title\" data-id=\"2609027305\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1229\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1229/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1229\">#1229</a>)</li>\n<li>Add Render Support for <code>JaxToNumpy</code>, <code>JaxToTorch</code> and <code>NumpyToTorch</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2848895679\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1306\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1306/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1306\">#1306</a>)</li>\n<li>Add full support for dlpack API by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/amacati/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/amacati\">@amacati</a> (<a data-error-text=\"Failed to load title\" data-id=\"2814347019\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1299\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1299/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1299\">#1299</a>)</li>\n<li>Enable the StickyAction wrapper to repeat the old action for more than 1 step by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/sparisi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/sparisi\">@sparisi</a> (<a data-error-text=\"Failed to load title\" data-id=\"2640457710\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1240\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1240/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1240\">#1240</a>)</li>\n<li>Add <code>wrappers.vector.TransformObs/Action</code> support for a single obs/action space argument by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/howardh/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/howardh\">@howardh</a> (<a data-error-text=\"Failed to load title\" data-id=\"2774346221\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1288\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1288/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1288\">#1288</a>)</li>\n<li>Improve 'set_wrapper_attr' for wrappers to modify how variables are updated by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/duburcqa/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/duburcqa\">@duburcqa</a> (<a data-error-text=\"Failed to load title\" data-id=\"2796910732\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1294\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1294/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1294\">#1294</a>)</li>\n<li>Allow <code>AtariPreprocessing</code> to support non-square observations by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/li-plus/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/li-plus\">@li-plus</a> (<a data-error-text=\"Failed to load title\" data-id=\"2865365152\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1312\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1312/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1312\">#1312</a>)</li>\n<li>Improve <code>Wrapper</code> and <code>VectorWrapper</code> error checking by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2682633843\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1260\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1260/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1260\">#1260</a>)</li>\n</ul>\n<h2>Bug Fixes</h2>\n<ul>\n<li>Fix <code>get_wrapper_attr</code> / <code>set_wrapper_attr</code> for edge case by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/duburcqa/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/duburcqa\">@duburcqa</a> (<a data-error-text=\"Failed to load title\" data-id=\"2796262300\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1293\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1293/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1293\">#1293</a>)</li>\n<li>Add width and height check for MujocoRenderer to help error checking by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Beanpow/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Beanpow\">@Beanpow</a> (<a data-error-text=\"Failed to load title\" data-id=\"2609101033\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1230\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1230/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1230\">#1230</a>)</li>\n<li>Fix CarRacing-v3 with discrete actions by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/Sebastian-Griesbach/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/Sebastian-Griesbach\">@Sebastian-Griesbach</a> (<a data-error-text=\"Failed to load title\" data-id=\"2661510536\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1253\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1253/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1253\">#1253</a>)</li>\n<li>Fix Mujoco Rendering for <code>rgb_array</code> that is upside down by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2701582796\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1264\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1264/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1264\">#1264</a>)</li>\n<li>Revert <code>OrderedDict</code> key ordering in <code>Dict</code> space by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2776425048\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1291\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1291/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1291\">#1291</a>)</li>\n<li><code>wrappers.vector.NumpyToTorch</code> doesn't require <code>jax</code> to work by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2852244816\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1308\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1308/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1308\">#1308</a>)</li>\n<li>Prevented accidental alteration of <code>env_spec_kwargs</code> in <code>make_vec</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/TimSchneider42/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/TimSchneider42\">@TimSchneider42</a> (<a data-error-text=\"Failed to load title\" data-id=\"2838283083\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1304\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1304/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1304\">#1304</a>)</li>\n<li>Quit Pygame when cliffwalking environment is closed by @sjmikler (<a data-error-text=\"Failed to load title\" data-id=\"2807765801\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1297\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1297/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1297\">#1297</a>)</li>\n<li>Add <code>TimeAwareObservation</code> support for environments without a <code>spec</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/pseudo-rnd-thoughts/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/pseudo-rnd-thoughts\">@pseudo-rnd-thoughts</a> (<a data-error-text=\"Failed to load title\" data-id=\"2775663588\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1289\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1289/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1289\">#1289</a>)</li>\n<li>Improve the error message when ending the MuJoCo renderer without calling <code>env.close</code> by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/a-ayesh/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/a-ayesh\">@a-ayesh</a> (<a data-error-text=\"Failed to load title\" data-id=\"2761798816\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1283\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1283/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1283\">#1283</a>)</li>\n</ul>\n<h2>Documentation Changes</h2>\n<ul>\n<li>Fixed typo in blackjack for the values sampled, 2-10 instead of 2-9 by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/artemmiyy/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/artemmiyy\">@artemmiyy</a> (<a data-error-text=\"Failed to load title\" data-id=\"2656508334\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1250\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1250/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1250\">#1250</a>)</li>\n<li>Added some popular third-party beginner tutorials by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/godsboy404/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/godsboy404\">@godsboy404</a> (<a data-error-text=\"Failed to load title\" data-id=\"2737913912\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1271\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1271/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1271\">#1271</a>)</li>\n<li>Update documentation for reward scaling wrappers by <a data-hovercard-type=\"user\" data-hovercard-url=\"/users/keraJLi/hovercard\" data-octo-click=\"hovercard-link-click\" data-octo-dimensions=\"link_type:self\" href=\"https://github.com/keraJLi\">@keraJLi</a> (<a data-error-text=\"Failed to load title\" data-id=\"2765903176\" data-permission-text=\"Title is private\" data-url=\"https://github.com/Farama-Foundation/Gymnasium/issues/1285\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/Farama-Foundation/Gymnasium/pull/1285/hovercard\" href=\"https://github.com/Farama-Foundation/Gymnasium/pull/1285\">#1285</a>)</li>\n</ul>\n<p>Thanks to the 31 new contributors that contributed to this release with the <strong>Full Changelog</strong>: <a href=\"https://github.com/Farama-Foundation/Gymnasium/compare/v1.0.0...v1.1.0\"><tt>v1.0.0...v1.1.0</tt></a></p></section>\n<section id=\"release-v1-0-0\">\n<h2>v1.0.0<a href=\"#release-v1-0-0\" title=\"Link to this heading\">¶</a></h2>\n<p><em>Released on 2024-10-08 - <a href=\"https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0\">GitHub</a> - <a href=\"https://pypi.org/project/gymnasium/v1.0.0/\">PyPI</a></em></p>\n<h2>v1.0.0 release notes</h2>\n<p>Over the last few years, the volunteer team behind Gym and Gymnasium has worked to fix bugs, improve the documentation, add new features, and change the API where appropriate so that the benefits outweigh the co"
  }
}

▼thought

Now I have a comprehensive picture of Gymnasium. Let me also get the Farama Foundation stats page and the third-party environments page for completeness.

▼✓抓取网页正文网址: "https://farama.org/stats/installations"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://farama.org/stats/installations\",\n  \"title\": \"Installations\",\n  \"summary\": \"Maintaining The World’s Open Source Reinforcement Learning Tools\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><div id=\\\"farama-stats-container\\\"><div><p><span>427,777,793</span></p></div><div id=\\\"collapse_downloads\\\"><div><p><span></span><span>403,904,777</span></p><div id=\\\"collapse_downloads_Gymnasium\\\"><p><span>Gymnasium</span><span>70,396,473</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym</span><span>333,508,304</span></p></div></div><div><p><span>PettingZoo</span><span>5,249,005</span></p></div><div><p><span>Minari</span><span>374,042</span></p></div><div><p><span>Gymnasium-Robotics</span><span>543,608</span></p></div><div><p><span>Jumpy</span><span>38,110</span></p></div><div><p><span></span><span>201,991</span></p><div id=\\\"collapse_downloads_MAgent2\\\"><p><span>MAgent2</span><span>43,479</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>magent</span><span>158,512</span></p></div></div><div><p><span>Metaworld</span><span>120,947</span></p></div><div><p><span></span><span>2,426,372</span></p><div id=\\\"collapse_downloads_MiniGrid\\\"><p><span>Minigrid</span><span>1,072,585</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym-minigrid</span><span>1,353,787</span></p><p><span>babyai</span><span>0</span></p></div></div><div><p><span>MiniWoB++</span><span>13,724</span></p></div><div><p><span>MOMAland</span><span>7,082</span></p></div><div><p><span></span><span>269,147</span></p><div id=\\\"collapse_downloads_MO-Gymnasium\\\"><p><span>MO-Gymnasium</span><span>259,496</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>mo-gym</span><span>9,651</span></p></div></div><div><p><span>Shimmy</span><span>4,785,290</span></p></div><div><p><span>Arcade Learning Environment</span><span>9,192,090</span></p></div><div><p><span>HighwayEnv</span><span>231,619</span></p></div><div><p><span>MPE2</span><span>13,092</span></p></div><div><p><span>Procgen2</span><span>1,670</span></p></div><div><p><span>Stable-Retro</span><span>58,217</span></p></div><div><p><span>ViZDoom</span><span>347,010</span></p></div></div></div></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://farama.org/stats/installations",
    "title": "Installations",
    "summary": "Maintaining The World’s Open Source Reinforcement Learning Tools",
    "content": "<div id=\"readability-page-1\" class=\"page\"><div id=\"farama-stats-container\"><div><p><span>427,777,793</span></p></div><div id=\"collapse_downloads\"><div><p><span></span><span>403,904,777</span></p><div id=\"collapse_downloads_Gymnasium\"><p><span>Gymnasium</span><span>70,396,473</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym</span><span>333,508,304</span></p></div></div><div><p><span>PettingZoo</span><span>5,249,005</span></p></div><div><p><span>Minari</span><span>374,042</span></p></div><div><p><span>Gymnasium-Robotics</span><span>543,608</span></p></div><div><p><span>Jumpy</span><span>38,110</span></p></div><div><p><span></span><span>201,991</span></p><div id=\"collapse_downloads_MAgent2\"><p><span>MAgent2</span><span>43,479</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>magent</span><span>158,512</span></p></div></div><div><p><span>Metaworld</span><span>120,947</span></p></div><div><p><span></span><span>2,426,372</span></p><div id=\"collapse_downloads_MiniGrid\"><p><span>Minigrid</span><span>1,072,585</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>gym-minigrid</span><span>1,353,787</span></p><p><span>babyai</span><span>0</span></p></div></div><div><p><span>MiniWoB++</span><span>13,724</span></p></div><div><p><span>MOMAland</span><span>7,082</span></p></div><div><p><span></span><span>269,147</span></p><div id=\"collapse_downloads_MO-Gymnasium\"><p><span>MO-Gymnasium</span><span>259,496</span></p><p><span>Deprecated Libraries Subsumed:</span></p><p><span>mo-gym</span><span>9,651</span></p></div></div><div><p><span>Shimmy</span><span>4,785,290</span></p></div><div><p><span>Arcade Learning Environment</span><span>9,192,090</span></p></div><div><p><span>HighwayEnv</span><span>231,619</span></p></div><div><p><span>MPE2</span><span>13,092</span></p></div><div><p><span>Procgen2</span><span>1,670</span></p></div><div><p><span>Stable-Retro</span><span>58,217</span></p></div><div><p><span>ViZDoom</span><span>347,010</span></p></div></div></div></div>"
  }
}

▼✓网页搜索结果数: 10, 查询: "Gymnasium wrappers API spaces Env observation action reward step reset"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Env - Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/api/env/\",\n      \"description\": \"* `step()` - Updates an environment with actions returning the next agent observation, the reward for taking that actions, if the environment has terminated or truncated due to the latest action and information from the environment about the step, i.e. metrics, debug info. :   * **observation** (*ObsType*) – An element of the environment’s `observation_space` as the next observation due to the agent actions. :   Resets the environment to an initial internal state, returning an initial observation and info. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset and the env’s `np_random_seed` will *not* be altered. The environment’s `metadata` render modes (env.metadata[“render\\\\_modes”]) should contain the possible ways to implement the render modes. Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., `gymnasium.make(\\\"CartPole-v1\\\",render_mode=\\\"human\\\")`. When implementing an environment, the `Env.reset()` and `Env.step()` functions must be created to describe the dynamics of the environment.\"\n    },\n    {\n      \"title\": \"Core - Gym Documentation\",\n      \"url\": \"https://www.gymlibrary.dev/api/core/\",\n      \"description\": \"## gym.Env¶. gym.Env.step(*self*, *action: ActType*) → Tuple[ObsType, float, bool, bool, dict]¶. When end of episode is reached, you are responsible for calling `reset()` to reset this environment’s state. Accepts an action and returns either a tuple (observation, reward, terminated, truncated, info). :   * **observation** (*object*) – this will be an element of the environment’s `observation_space`. * **reward** (*float*) – The amount of reward returned as a result of taking the action. If the environment already has a random number generator and `reset()` is called with `seed=None`, the RNG should not be reset. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset. This will be an element of `observation_space` (typically a numpy array) and is analogous to the observation returned by `step()`. It is of datatype `Space` provided by Gym. For example, if the observation space is of type `Box` and the shape of the object is `(4,)`, this denotes a valid observation will be an array of 4 numbers.\"\n    },\n    {\n      \"title\": \"Wrappers - Gym Documentation\",\n      \"url\": \"https://www.gymlibrary.dev/api/wrappers/\",\n      \"description\": \"When using the AutoResetWrapper to collect rollouts, note that the when `self.env.step()` returns `done`, a new observation from after calling `self.env.reset()` is returned by `self.step()` alongside the terminal reward and done state from the previous episode . | `AutoResetWrapper` | `gym.Wrapper` | `env` | The wrapped environment will automatically reset when the done state is reached. | `FilterObservation` | `gym.ObservationWrapper` | `env`, `filter_keys=None` | If you have an environment that returns dictionaries as observations, but you would like to only keep a subset of the entries, you can use this wrapper. | `FlattenObservation` | `gym.ObservationWrapper` | `env` | Observation wrapper that flattens the observation |. | `ResizeObservation` | `gym.ObservationWrapper` | `env`, `shape` | This wrapper works on environments with image observations (or more generally observations of shape AxBxC) and resizes the observation to the shape given by the tuple `shape`. |  | `gym.ObservationWrapper` | `env`, `f` | This wrapper will apply `f` to observations |.\"\n    },\n    {\n      \"title\": \"Gymnasium/gymnasium/core.py at main - GitHub\",\n      \"url\": \"https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py\",\n      \"description\": \"\\\"\\\"\\\"Wraps a :class:`gymnasium.Env` to allow a modular transformation of the :meth:`step` and :meth:`reset` methods.\"\n    },\n    {\n      \"title\": \"Wrappers - Gymnasium Documentation - Farama Foundation\",\n      \"url\": \"https://gymnasium.farama.org/v0.27.0/api/experimental/wrappers/\",\n      \"description\": \"A passive environment checker wrapper that surrounds the step, reset and render functions to check they follow the gymnasium API. Initialises the wrapper\"\n    },\n    {\n      \"title\": \"Gym Wrappers | alexandervandekleut.github.io\",\n      \"url\": \"https://alexandervandekleut.github.io/gym-wrappers/\",\n      \"description\": \"The `gym.Wrapper` class inherits from the `gym.Env` class, which defines environments according to the OpenAI API for reinforcement learning. Implementing the `gym.Wrapper` class requires defining an `__init__` method that accepts the environment to be extended as a parameter. Wrapper): def __init__(self, env): super(). env = env def step(self, action): next_state, reward, done, info = self. We can modify specific aspects of the environment by using subclasses of `gym.Wrapper` that override how the environment processes observations, rewards, and action. 1. `gym.ObservationWrapper`: Used to modify the observations returned by the environment. To do this, override the `observation` method of the environment. This method accepts a single parameter (the observation to be modified) and returns the modified observation. 2. `gym.RewardWrapper`: Used to modify the rewards returned by the environment. To do this, override the `reward` method of the environment. This method accepts a single parameter (the reward to be modified) and returns the modified reward. To do this, override the `action` method of the environment.\"\n    },\n    {\n      \"title\": \"Mastering Gymnasium: RL Environments That Click | by Thinking Loop\",\n      \"url\": \"https://medium.com/@ThinkingLoop/mastering-gymnasium-rl-environments-that-click-02590ac1c533\",\n      \"description\": \"# Mastering Gymnasium: RL Environments That Click | by Thinking Loop | Medium. # Mastering Gymnasium: RL Environments That Click. ## A practical, code-first guide to building, wrapping, vectorizing, and debugging reinforcement learning environments with Gymnasium — without the usual confusion. _Master OpenAI Gymnasium for reinforcement learning: reset/step API, wrappers, vector envs, seeding, recording, and custom environments with practical code._. It’s usually not the algorithm. It’s the environment loop. Gymnasium is where that loop becomes predictable. And once the loop is predictable, RL becomes… almost fun. If you’ve ever fought mysterious `done` logic, broken wrappers, or training runs that “worked yesterday,” this guide is for you. Let’s make Gymnasium feel like a tool you control—not a roulette wheel. ## What Gymnasium actually gives you (and why it matters). Gymnasium is the modern, maintained standard interface for reinforcement learning environments, designed as a successor-style drop-in for the older Gym ecosystem. That matters because the environment API is not a small detail.\"\n    },\n    {\n      \"title\": \"Observation Wrappers - Gymnasium Documentation\",\n      \"url\": \"https://gymnasium.farama.org/v0.28.0/api/wrappers/observation_wrappers/\",\n      \"description\": \"# Observation Wrappers#. The transformation defined in that method must be reflected by the `env` observation space. Otherwise, you need to specify the new observation space of the wrapper by setting `self.observation_space` in the `__init__()` method of your wrapper. The function `f` should be defined on the observation space of the base environment, `env`, and should, ideally, return values in the same space. If the transformation you wish to apply to observations returns values in a *different* space, you should subclass `ObservationWrapper`, implement the transformation, and set the new observation space accordingly. make(\\\"CartPole-v1\\\")>>> env = TransformObservation(env, lambda obs: obs +0.1* np. >>> import gymnasium as gym>>> fromgymnasium.wrappers import TransformObservation>>> env = gym. make(\\\"CartPole-v1\\\")>>> env = TransformObservation(env, lambda obs:{'obs': obs, 'time': 0})>>> env. * **ValueError** – If any of the filter\\\\_keys are not included in the original env’s observation space. *class* gymnasium.wrappers.FlattenObservation(*env: Env*)#. >>> import gymnasium as gym>>> fromgymnasium.wrappers import FlattenObservation>>> env = gym.\"\n    },\n    {\n      \"title\": \"OpenAI Gym Step Function : r/reinforcementlearning - Reddit\",\n      \"url\": \"https://www.reddit.com/r/reinforcementlearning/comments/1dezvei/openai_gym_step_function/\",\n      \"description\": \"The environment gets reset, then the step function is called every timestep until it returns a terminal indicator. So you need to handle all of\"\n    },\n    {\n      \"title\": \"Understanding action & observation spaces in gym for custom ...\",\n      \"url\": \"https://stackoverflow.com/questions/76289764/understanding-action-observation-spaces-in-gym-for-custom-environments-and-age\",\n      \"description\": \"The oddity is in the use of gym's observation spaces. In many examples, the custom environment includes initializing a gym observation space.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Env - Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/api/env/",
        "description": "* `step()` - Updates an environment with actions returning the next agent observation, the reward for taking that actions, if the environment has terminated or truncated due to the latest action and information from the environment about the step, i.e. metrics, debug info. :   * **observation** (*ObsType*) – An element of the environment’s `observation_space` as the next observation due to the agent actions. :   Resets the environment to an initial internal state, returning an initial observation and info. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset and the env’s `np_random_seed` will *not* be altered. The environment’s `metadata` render modes (env.metadata[“render\\_modes”]) should contain the possible ways to implement the render modes. Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., `gymnasium.make(\"CartPole-v1\",render_mode=\"human\")`. When implementing an environment, the `Env.reset()` and `Env.step()` functions must be created to describe the dynamics of the environment."
      },
      {
        "title": "Core - Gym Documentation",
        "url": "https://www.gymlibrary.dev/api/core/",
        "description": "## gym.Env¶. gym.Env.step(*self*, *action: ActType*) → Tuple[ObsType, float, bool, bool, dict]¶. When end of episode is reached, you are responsible for calling `reset()` to reset this environment’s state. Accepts an action and returns either a tuple (observation, reward, terminated, truncated, info). :   * **observation** (*object*) – this will be an element of the environment’s `observation_space`. * **reward** (*float*) – The amount of reward returned as a result of taking the action. If the environment already has a random number generator and `reset()` is called with `seed=None`, the RNG should not be reset. However, if the environment already has a PRNG and `seed=None` is passed, the PRNG will *not* be reset. This will be an element of `observation_space` (typically a numpy array) and is analogous to the observation returned by `step()`. It is of datatype `Space` provided by Gym. For example, if the observation space is of type `Box` and the shape of the object is `(4,)`, this denotes a valid observation will be an array of 4 numbers."
      },
      {
        "title": "Wrappers - Gym Documentation",
        "url": "https://www.gymlibrary.dev/api/wrappers/",
        "description": "When using the AutoResetWrapper to collect rollouts, note that the when `self.env.step()` returns `done`, a new observation from after calling `self.env.reset()` is returned by `self.step()` alongside the terminal reward and done state from the previous episode . | `AutoResetWrapper` | `gym.Wrapper` | `env` | The wrapped environment will automatically reset when the done state is reached. | `FilterObservation` | `gym.ObservationWrapper` | `env`, `filter_keys=None` | If you have an environment that returns dictionaries as observations, but you would like to only keep a subset of the entries, you can use this wrapper. | `FlattenObservation` | `gym.ObservationWrapper` | `env` | Observation wrapper that flattens the observation |. | `ResizeObservation` | `gym.ObservationWrapper` | `env`, `shape` | This wrapper works on environments with image observations (or more generally observations of shape AxBxC) and resizes the observation to the shape given by the tuple `shape`. |  | `gym.ObservationWrapper` | `env`, `f` | This wrapper will apply `f` to observations |."
      },
      {
        "title": "Gymnasium/gymnasium/core.py at main - GitHub",
        "url": "https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py",
        "description": "\"\"\"Wraps a :class:`gymnasium.Env` to allow a modular transformation of the :meth:`step` and :meth:`reset` methods."
      },
      {
        "title": "Wrappers - Gymnasium Documentation - Farama Foundation",
        "url": "https://gymnasium.farama.org/v0.27.0/api/experimental/wrappers/",
        "description": "A passive environment checker wrapper that surrounds the step, reset and render functions to check they follow the gymnasium API. Initialises the wrapper"
      },
      {
        "title": "Gym Wrappers | alexandervandekleut.github.io",
        "url": "https://alexandervandekleut.github.io/gym-wrappers/",
        "description": "The `gym.Wrapper` class inherits from the `gym.Env` class, which defines environments according to the OpenAI API for reinforcement learning. Implementing the `gym.Wrapper` class requires defining an `__init__` method that accepts the environment to be extended as a parameter. Wrapper): def __init__(self, env): super(). env = env def step(self, action): next_state, reward, done, info = self. We can modify specific aspects of the environment by using subclasses of `gym.Wrapper` that override how the environment processes observations, rewards, and action. 1. `gym.ObservationWrapper`: Used to modify the observations returned by the environment. To do this, override the `observation` method of the environment. This method accepts a single parameter (the observation to be modified) and returns the modified observation. 2. `gym.RewardWrapper`: Used to modify the rewards returned by the environment. To do this, override the `reward` method of the environment. This method accepts a single parameter (the reward to be modified) and returns the modified reward. To do this, override the `action` method of the environment."
      },
      {
        "title": "Mastering Gymnasium: RL Environments That Click | by Thinking Loop",
        "url": "https://medium.com/@ThinkingLoop/mastering-gymnasium-rl-environments-that-click-02590ac1c533",
        "description": "# Mastering Gymnasium: RL Environments That Click | by Thinking Loop | Medium. # Mastering Gymnasium: RL Environments That Click. ## A practical, code-first guide to building, wrapping, vectorizing, and debugging reinforcement learning environments with Gymnasium — without the usual confusion. _Master OpenAI Gymnasium for reinforcement learning: reset/step API, wrappers, vector envs, seeding, recording, and custom environments with practical code._. It’s usually not the algorithm. It’s the environment loop. Gymnasium is where that loop becomes predictable. And once the loop is predictable, RL becomes… almost fun. If you’ve ever fought mysterious `done` logic, broken wrappers, or training runs that “worked yesterday,” this guide is for you. Let’s make Gymnasium feel like a tool you control—not a roulette wheel. ## What Gymnasium actually gives you (and why it matters). Gymnasium is the modern, maintained standard interface for reinforcement learning environments, designed as a successor-style drop-in for the older Gym ecosystem. That matters because the environment API is not a small detail."
      },
      {
        "title": "Observation Wrappers - Gymnasium Documentation",
        "url": "https://gymnasium.farama.org/v0.28.0/api/wrappers/observation_wrappers/",
        "description": "# Observation Wrappers#. The transformation defined in that method must be reflected by the `env` observation space. Otherwise, you need to specify the new observation space of the wrapper by setting `self.observation_space` in the `__init__()` method of your wrapper. The function `f` should be defined on the observation space of the base environment, `env`, and should, ideally, return values in the same space. If the transformation you wish to apply to observations returns values in a *different* space, you should subclass `ObservationWrapper`, implement the transformation, and set the new observation space accordingly. make(\"CartPole-v1\")>>> env = TransformObservation(env, lambda obs: obs +0.1* np. >>> import gymnasium as gym>>> fromgymnasium.wrappers import TransformObservation>>> env = gym. make(\"CartPole-v1\")>>> env = TransformObservation(env, lambda obs:{'obs': obs, 'time': 0})>>> env. * **ValueError** – If any of the filter\\_keys are not included in the original env’s observation space. *class* gymnasium.wrappers.FlattenObservation(*env: Env*)#. >>> import gymnasium as gym>>> fromgymnasium.wrappers import FlattenObservation>>> env = gym."
      },
      {
        "title": "OpenAI Gym Step Function : r/reinforcementlearning - Reddit",
        "url": "https://www.reddit.com/r/reinforcementlearning/comments/1dezvei/openai_gym_step_function/",
        "description": "The environment gets reset, then the step function is called every timestep until it returns a terminal indicator. So you need to handle all of"
      },
      {
        "title": "Understanding action & observation spaces in gym for custom ...",
        "url": "https://stackoverflow.com/questions/76289764/understanding-action-observation-spaces-in-gym-for-custom-environments-and-age",
        "description": "The oddity is in the use of gym's observation spaces. In many examples, the custom environment includes initializing a gym observation space."
      }
    ]
  }
}

▼thought

Gymnasium 深度研究报告

一句话定位

1. 起源与历史：从 OpenAI Gym 到 Gymnasium

1.1 OpenAI Gym 的诞生与停滞

然而，OpenAI 在 2021 年前后逐渐停止了对 Gym 的维护。Gym 的最后一个官方版本停留在 v0.21（旧 API）和 v0.26（引入了部分 breaking changes 但未完成过渡）。

1.2 Farama Foundation 接手

关键时间线：

时间	事件
2016	OpenAI 发布 Gym
2021	OpenAI 停止维护，Farama Foundation 接手
2022.10	Gymnasium 作为独立项目正式发布
2024.07	Gymnasium 白皮书发布（arXiv:2407.17032）
2024.10	Gymnasium v1.0.0 发布——首个稳定 API
2025.11	白皮书被 NeurIPS 2025 Datasets & Benchmarks 接收
2026.04	最新版本 v1.3.0 发布

2. Farama Foundation：背后的非营利组织

Farama Foundation 的使命是以中立、非营利的方式维护全球开源 RL 工具。它目前维护着超过 15 个项目，构成了 RL 研究的基础设施层。

2.1 核心项目矩阵

项目	定位	累计下载量
Gymnasium	单智能体 RL 环境标准 API	7000 万+
PettingZoo	多智能体 RL（MARL）环境 API	520 万+
Arcade Learning Environment (ALE)	Atari 2600 模拟器	919 万+
Minigrid	网格世界环境	107 万+
Gymnasium-Robotics	机器人仿真环境	54 万+
Shimmy	跨 API 兼容层	478 万+
MO-Gymnasium	多目标 RL	26 万+
HighwayEnv	自动驾驶仿真	23 万+
ViZDoom	Doom 游戏环境	35 万+

Farama Foundation 项目页

2.2 治理模式

3. 核心 API 设计

Gymnasium 的核心哲学是：环境是一个简单的 Python 类，通过 reset() 和 step() 两个方法与智能体交互。

3.1 最小运行示例

import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="human")
observation, info = env.reset(seed=42)

for _ in range(1000):
    action = env.action_space.sample()  # 随机策略
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

3.2 `step()` 的五元组返回值

这是 Gymnasium 与旧 Gym 最核心的区别：

返回值	类型	含义
`observation`	`ObsType`	环境的新观测
`reward`	`float`	即时奖励
`terminated`	`bool`	任务自然结束（成功/失败）
`truncated`	`bool`	因外部约束结束（超时/步数限制）
`info`	`dict`	辅助诊断信息

# 正确的 bootstrap 逻辑
if terminated:
    next_value = 0           # 自然结束，没有未来价值
elif truncated:
    next_value = value_fn(next_obs)  # 超时截断，需要估计剩余价值

Gymnasium 迁移指南

3.3 空间系统（Spaces）

Gymnasium 提供了丰富的空间类型来描述观测和动作：

空间类型	描述	示例
`Box`	连续张量空间	`Box(low=-1, high=1, shape=(4,))`
`Discrete`	离散整数空间	`Discrete(2)` — 左/右
`MultiDiscrete`	多维离散空间	`MultiDiscrete([3, 2])`
`MultiBinary`	多维二值空间	`MultiBinary(5)`
`Text`	字符串空间	`Text(max_length=10)`
`Dict`	字典复合空间	`Dict({"pos": Box(...), "vel": Box(...)})`
`Tuple`	元组复合空间	`Tuple((Discrete(2), Box(...)))`
`Graph`	图结构空间	节点+边+全局属性
`Sequence`	变长序列空间	`Sequence(Box(...))`

v1.1 还新增了 概率掩码采样（space.sample(probability=...)），允许策略直接输出概率分布进行采样 Gymnasium v1.1 Release Notes。

4. 内置环境家族

Gymnasium 内置了五大类环境，覆盖从入门教学到前沿研究的各种复杂度：

4.1 Classic Control（经典控制）

4.2 Box2D

4.3 Toy Text

4.4 MuJoCo

4.5 Atari（已独立为 ALE）

5. 关键特性深度解析

5.1 Wrapper 系统

Wrapper 是 Gymnasium 最强大的扩展机制。通过嵌套包装，可以在不修改原始环境代码的情况下变换观测、奖励、动作等：

gymnasium.wrappers
├── ObservationWrapper     # 变换观测
├── RewardWrapper          # 变换奖励
├── ActionWrapper          # 变换动作
├── RecordVideo            # 录制视频
├── RecordEpisodeStatistics # 记录回合统计
├── NormalizeObservation   # 观测归一化
├── NormalizeReward        # 奖励归一化
├── TimeLimit              # 时间限制
├── FrameStackObservation  # 帧堆叠
├── FlattenObservation     # 展平观测
├── ResizeObservation      # 缩放图像观测
├── ClipReward             # 裁剪奖励
├── TransformObservation   # 自定义观测变换
├── TransformReward        # 自定义奖励变换
├── TransformAction        # 自定义动作变换
├── StickyAction           # 动作粘滞（模拟随机延迟）
├── AddWhiteNoise          # 添加白噪声（v1.2.0）
├── ObstructView           # 遮挡视野（v1.2.0）
├── DiscretizeObservation  # 离散化观测（v1.2.1）
├── DiscretizeAction       # 离散化动作（v1.2.1）
├── RepeatAction           # 重复动作（v1.3.0）
└── DelayObservation       # 延迟观测（v1.0）

v1.0 的重大变化：Env 和 VectorEnv 的 Wrapper 被分离到不同命名空间（gymnasium.wrappers vs gymnasium.wrappers.vector），因为两者不再共享继承关系。

5.2 Vector Environments（向量化环境）

v1.0 对向量化环境进行了彻底重构：

Env 和 VectorEnv 现在是两个独立的基类（旧版本中 VectorEnv 继承自 Env，这在语义上不正确）
新增 gymnasium.make_vec() 函数，一行代码创建多个并行环境
支持三种向量化模式：sync（同步）、async（异步）、vector_entry_point（环境自定义）
v1.1 新增了三种 autoreset 模式：next-step、same-step、disabled，兼容旧版行为

# 一行创建 3 个并行环境
envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="async")

Gymnasium v1.0 博客

5.3 Functional Environments（函数式环境）

v1.0 引入了 FuncEnv——函数式环境范式。它将环境的动态分解为纯函数：

transition(state, action) → 状态转移
observation(state) → 观测
reward(state, action, next_state) → 奖励
terminal(state) → 终止判断

5.4 跨框架互操作

v1.0 和 v1.1 大幅改进了跨深度学习框架的支持：

JaxToTorch、JaxToNumpy、NumpyToTorch 等转换 Wrapper
完整的 DLPack API 支持（零拷贝张量转换）
v1.2.0 新增 ArrayConversion 通用转换 Wrapper，支持所有 Array API 兼容框架

6. 版本演进与最新动态

6.1 v1.0.0（2024-10-08）—— 里程碑

超过 200 个 PR 合并，核心变化：

Env / VectorEnv 分离
make_vec() 函数
FuncEnv 函数式环境
MuJoCo v5 环境
移除插件系统（需显式 import 外部库）
ALE 独立为单独项目
承诺：这是可预见未来最后一次 breaking change

6.2 v1.1.0（2025-02-26）

概率掩码采样
RGBD 渲染
Autoreset 模式支持
DLPack 完整支持

6.3 v1.2.0（2025-06-27）

MuJoCo v2/v3 迁移至 Gymnasium-Robotics
Python 3.13 支持（放弃 3.8/3.9）
AddWhiteNoise、ObstructView Wrapper
ArrayConversion 通用转换

6.4 v1.3.0（2026-04-22）—— 当前最新

Taxi-v4 环境
RepeatAction Wrapper
pygame → pygame-ce 迁移（解锁 Python 3.14 兼容性）
向量环境内存泄漏修复
新增 10 个第三方环境（含网络安全类别）

Gymnasium Release Notes

7. 生态系统全景

7.1 训练库兼容性

几乎所有主流 RL 训练库都支持 Gymnasium API：

训练库	类型	说明
CleanRL	单文件实现	官方推荐新手使用
Stable-Baselines3	经典算法集	PPO、A2C、SAC、TD3 等
RLlib	分布式	Ray 生态
Tianshou	PyTorch	清华大学维护
SB3-Contrib	SB3 扩展	TRPO、QR-DQN 等
AgileRL	进化 RL	结合进化算法

7.2 第三方环境生态

Gymnasium 的第三方环境列表持续增长，覆盖领域包括：

机器人：Gymnasium-Robotics、panda-gym、Metaworld、Robo-gym
自动驾驶：HighwayEnv、BlueSky-Gym（空中交通）
游戏：ViZDoom（Doom）、Craftium（Minecraft-like）、Procgen2
金融交易：AnyTrading
网络安全（v1.3.0 新增类别）
多目标 RL：MO-Gymnasium
安全 RL：Safety-Gymnasium、Robust-Gymnasium（ICLR 2025）

第三方环境列表

7.3 多智能体扩展：PettingZoo

PettingZoo 是 Gymnasium 的多智能体版本，支持两种 API 风格：

AEC API（Agent Environment Cycle）：顺序决策
Parallel API：同步并行决策

PettingZoo 累计下载量超过 520 万次，支持 Python 3.10-3.14。

8. 从 OpenAI Gym 迁移

如果你还在使用旧版 OpenAI Gym（import gym），以下是迁移的核心变化：

组件	旧 Gym v0.21	新 Gymnasium
导入	`import gym`	`import gymnasium as gym`
重置	`obs = env.reset()`	`obs, info = env.reset()`
种子	`env.seed(42)`	`env.reset(seed=42)`
步进	`obs, reward, done, info = env.step(a)`	`obs, reward, terminated, truncated, info = env.step(a)`
渲染	`env.render(mode="human")`	`gym.make(id, render_mode="human")`
Bootstrap	`(1-done) * gamma * next_val`	`(1-terminated) * gamma * next_val`

对于无法升级的旧环境，可以使用 Shimmy 兼容层包装。

9. 学术影响与引用

Gymnasium 白皮书 "Gymnasium: A Standard Interface for Reinforcement Learning Environments" 已被 NeurIPS 2025 Datasets & Benchmarks 接收 NeurIPS 2025 Poster。引用格式：

@article{towers2024gymnasium,
  title={Gymnasium: A Standard Interface for Reinforcement Learning Environments},
  author={Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and
          De Cola, Gianluca and Deleu, Tristan and Goul{\~a}o, Manuel and
          Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},
  journal={arXiv preprint arXiv:2407.17032},
  year={2024}
}

白皮书指出，截至 2025 年 4 月，Gymnasium 月下载量已超过 100 万次，累计超过 1800 万次安装（该数字在 2026 年 5 月已增长至 7000 万+）。

10. 总结与展望

核心价值

Gymnasium 解决了 RL 研究中三个根本问题：

标准化：统一的 reset/step 接口使得算法和环境可以任意组合
可复现性：严格的版本控制（-v0、-v1 后缀）和种子机制
生态效应：作为"API 标准"，它连接了环境开发者和算法开发者两个社区

未来方向

根据 v1.0 发布博客，Gymnasium 的核心 API 已进入长期稳定阶段。未来的发展重点可能是：

函数式环境（JAX 原生）的进一步推广
更多垂直领域的第三方环境
Python 版本跟进（已支持到 3.14）
与 LLM-based agent 的接口适配（站内已有论文如 "Healthcare AI GYM for Medical Agents" 使用 gymnasium-compatible 环境训练医疗 AI arXiv:2605.02943）

补充说明

依据范围：本报告基于 Gymnasium 官方文档、GitHub 仓库、Farama Foundation 官网、arXiv 白皮书及发布说明，所有关键论断均可通过内联链接一键核实。
尚需核实的点：Gymnasium 的具体财务数据（捐赠金额）、各训练库对 v1.3.0 的兼容性更新进度未逐一验证，建议查阅各训练库的 release notes。

282195 tokens · $0.0437

Gymnasium 深度研究报告

一句话定位

1. 起源与历史：从 OpenAI Gym 到 Gymnasium

1.1 OpenAI Gym 的诞生与停滞

1.2 Farama Foundation 接手

2. Farama Foundation：背后的非营利组织

2.1 核心项目矩阵

2.2 治理模式

3. 核心 API 设计

3.1 最小运行示例

3.2 step() 的五元组返回值

3.3 空间系统（Spaces）

4. 内置环境家族

4.1 Classic Control（经典控制）

4.2 Box2D

4.3 Toy Text

4.4 MuJoCo

4.5 Atari（已独立为 ALE）

5. 关键特性深度解析

5.1 Wrapper 系统

5.2 Vector Environments（向量化环境）

5.3 Functional Environments（函数式环境）

5.4 跨框架互操作

6. 版本演进与最新动态

6.1 v1.0.0（2024-10-08）—— 里程碑

6.2 v1.1.0（2025-02-26）

6.3 v1.2.0（2025-06-27）

6.4 v1.3.0（2026-04-22）—— 当前最新

7. 生态系统全景

7.1 训练库兼容性

7.2 第三方环境生态

7.3 多智能体扩展：PettingZoo

8. 从 OpenAI Gym 迁移

9. 学术影响与引用

10. 总结与展望

核心价值

未来方向

补充说明

Gymnasium 深度研究报告

一句话定位

1. 起源与历史：从 OpenAI Gym 到 Gymnasium

1.1 OpenAI Gym 的诞生与停滞

1.2 Farama Foundation 接手

2. Farama Foundation：背后的非营利组织

2.1 核心项目矩阵

2.2 治理模式

3. 核心 API 设计

3.1 最小运行示例

3.2 step() 的五元组返回值

3.3 空间系统（Spaces）

4. 内置环境家族

4.1 Classic Control（经典控制）

4.2 Box2D

4.3 Toy Text

4.4 MuJoCo

4.5 Atari（已独立为 ALE）

5. 关键特性深度解析

5.1 Wrapper 系统

5.2 Vector Environments（向量化环境）

5.3 Functional Environments（函数式环境）

5.4 跨框架互操作

6. 版本演进与最新动态

6.1 v1.0.0（2024-10-08）—— 里程碑

6.2 v1.1.0（2025-02-26）

6.3 v1.2.0（2025-06-27）

6.4 v1.3.0（2026-04-22）—— 当前最新

7. 生态系统全景

7.1 训练库兼容性

7.2 第三方环境生态

7.3 多智能体扩展：PettingZoo

8. 从 OpenAI Gym 迁移

9. 学术影响与引用

10. 总结与展望

核心价值

未来方向

补充说明

3.2 `step()` 的五元组返回值

3.2 `step()` 的五元组返回值