Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) environments. These environments are dynamic, governed by task-specific rules and stochasticity, making them difficult to capture using only pretrained knowledge. Standard reinforcement learning (RL) fine-tuning frequently suffers from overfitting: models tend to exploit one narrow trajectory (high Pass@1) while failing to maintain diverse solution coverage (low Pass@k). This brittleness prevents agents from generalizing and scaling to more complex scenarios.
SPA (Self-Play Agent) addresses this limitation by equipping LLM agents with an internal world model that explicitly represents both environment states and their transitions. By grounding reasoning in structured representations, SPA provides agents with the tools to understand, predict, and plan effectively. This world model is acquired through self-play supervised finetuning (SFT), and then leveraged in downstream PPO optimization.
SPA integrates three stages into a unified pipeline:
The key innovation of SPA lies in internalizing environment dynamics into the agent itself. World modeling is broken into two complementary components:
Example (Sokoban prompt with structured coordinates):
You are solving the Sokoban puzzle. Push all boxes to targets.
State:
######
#__O#
#__X#
###P#
######
Player at (3,3); Box at (2,3); Goal at (1,4).
SPA jointly optimizes two objectives:
<think> and <answer> spans), ensuring the agent learns state grounding and transition prediction.<answer> tokens, allowing RL to focus on action quality while leveraging the internalized world model for stability and efficiency.This separation enforces a clean division: SFT captures the environment’s rules, PPO learns how to act optimally within them.
SPA scales consistently across sizes and families:
Structured states dramatically reduce perplexity:
Transition-model learning is central to RL scaling. When the SFT loss on current and next states is masked out, PPO training shows no improvement. This confirms that the ability to predict future states is indispensable for effective policy learning.
Ground-truth coordinates are critical. Without explicit spatial grounding, the model struggles to align its predictions with environment dynamics. Experiments show that randomizing coordinates leads to collapse in performance, highlighting the necessity of structured state descriptions.
Effective exploration depends heavily on the initial policy. Replacing the RL policy with random actions to generate world-modeling trajectories degrades downstream learning. Self-play generates higher-quality data aligned with the agent’s reasoning and exploration trajectory, producing more robust world models.
SPA-trained agents sustain higher Pass@k scores over time, suggesting improved coverage of diverse solution paths while still exploiting efficient strategies.
Training a world model on simple tasks (FrozenLake 4×4) significantly accelerates convergence and improves asymptotic performance on more difficult variants (FrozenLake 6×6).
While intra-domain transfer works well (easy-to-hard), cross-environment transfer (e.g., Sokoban → FrozenLake) remains limited, showing that environment-specific modeling is still necessary.
Together, these findings explain why SPA consistently outperforms vanilla RL and online world-modeling baselines.
<think>
<observation>
######
#__O#
#_X_#
###P#
######
Player at (3,3); Box at (2,3); Goal at (1,4).
</observation>
<prediction>
######
#__O#
#___#
###XP
######
</prediction>
</think>
<answer>Up</answer>
@misc{chen2025spa,
title={Internalizing World Models via Self-Play Finetuning for Agentic RL},
author={Shiqi Chen and Tongyao Zhu and Zian Wang and Jinghan Zhang and Kangrui Wang and Siyang Gao and Teng Xiao and Yee Whye Teh and Junxian He and Manling Li},
year={2025},
eprint={2510.15047},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.15047},
}