Discovering State-of-the-art Reinforcement Learning Algorithms

Junhyuk Oh^*, Iurii Kemaev^*†, Greg Farquhar^*, Dan A. Calian^*, Matteo Hessel,
Luisa Zintgraf, Satinder Singh, Hado van Hasselt, David Silver

Google DeepMind
^*Equal contribution. ^†Engineering lead.

Published in Nature (2025)

Paper Code Citation

Intro

The field of artificial intelligence (AI) has been revolutionized by replacing hand-crafted components with those learned from data and experience. The next natural step is to allow the learning algorithms themselves to be learned, from experience.

Many of the most successful AI agents are based on reinforcement learning (RL), in which agents learn by interacting with environments, achieving numerous landmarks including the mastery of complex competitive games such as Go, chess, and StarCraft.

Traditional RL algorithms can be written down with equations and implemented in code. They are designed by human experts in a laborious process of trial and error, guided by experiment, theory, and human intuitions.

On the other hand, our discovered rule, which we call DiscoRL, is represented by a neural network which can be much more flexible than simple mathematical equations. Instead of being hand-crafted, it is learned by an automated process using the experience of many agents interacting with diverse and complex environments.

DiscoRL outperforms many existing RL algorithms on a variety of benchmarks and becomes stronger with more environments used for discovery.

Method

RL agents typically make predictions that are useful for learning. The semantics of them are determined by their update rules such as the value of a certain action. In our framework, the agent makes additional predictions without pre-defined semantics to open up the possibility to discover entirely new prediction semantics.

RL agents optimise their policies and predictions using a loss function. This function depends on the agent’s own predictions, as well as the rewards it receives while interacting with its environment.

Instead of manually defining the loss function with equations, we use a neural network, called ‘meta-network’, to define the loss function for the prediction and policy. The meta-network is randomly initialised, which in turn initially acts as a random update rule. We let the meta-learning process optimise the meta-network to gradually discover more efficient update rules.

In order to discover a strong update rule from experience, we create a large population of agents, each of which interacts with its own environment. Each agent uses the shared meta-network to update their predictions and policies. We then estimate the performance of them to calculate a meta-gradient, which gives how to adjust the meta-network that would lead to a better performance. Over time, the discovered rule becomes a stronger and faster RL algorithm. After the discovery process is complete, the discovered rule can be used to train new agents in unseen environments.

In order to scale up the discovery process, we developed a meta-learning framework that can handle hundreds of parallel agents. To improve fault-tolerance, and to facilitate our research, we ensured that all aspects of agents, environments, and meta-learning were deterministic and checkpointable thus providing full reproducibility. We also implemented a number of optimisations to handle the compute intense meta-gradients, including mixed-mode differentiation, recursive gradient checkpointing, mixed precision training, and pre-emptive parameter offloading⁽⁺⁾.

(⁺) Kemaev, I., Calian, D. A., Zintgraf, L. M., Farquhar, G. & van Hasselt, H. Scalable meta-learning via mixed-mode differentiation. International conference on machine learning (2025)

Results

After large-scale meta-learning, we find that DiscoRL outperforms the state-of-the-art or performs competitively on a number of challenging benchmarks.

1. Meta-learn Disco57 on Atari57

We train Disco57 update rule on a diverse set of standard Atari57 environments.

2. Evaluate Disco57

Disco57 is evaluated zero-shot on unseen ProcGen and DMLab-30 domains to measure generalization capabilities.

3. Expand training set: meta-learn Disco103

We expand the training domains with the more challenging ProcGen and DMLab-30 to meta-learn Disco103.

4. Evaluate on unseen domains

We evaluate both Disco57 and Disco103 on unseen domains: Crafter, Nethack, Sokoban.

DiscoRL generalises, performing well in environments which were not used for discovery, and which have radically different observations and action spaces. DiscoRL also generalises when used to train agents with much more parameters and data than those used for discovery.

The discovery process scales, increasing performance as we increase the number, diversity, and complexity of training environments, as well as the overall amount of experience consumed.

We find that the discovered predictions capture novel semantics, identifying important features about upcoming events on moderate time-scales, such as future policy entropies and large-reward events. See the manuscript for more details.

The overall results suggest that the design of RL algorithms may, in the future, be led by automated methods that can scale effectively with data and compute.

Citation


@Article{DiscoRL2025,
  author  = {Oh, Junhyuk and Farquhar, Greg and Kemaev, Iurii and Calian, Dan A. and Hessel, Matteo and Zintgraf, Luisa and Singh, Satinder and van Hasselt, Hado and Silver, David},
  journal = {Nature},
  title   = {Discovering State-of-the-art Reinforcement Learning Algorithms},
  year    = {2025},
  doi     = {10.1038/s41586-025-09761-x}
}

Code Availability

We provide the meta-training and evaluation code, with the meta-parameters of Disco103, under an open source Apache 2.0 licence, on GitHub.

Acknowledgements

We thank Sebastian Flennerhag, Zita Marinho, Angelos Filos, Surya Bhupatiraju, Andras György, and Andrei A. Rusu for their feedback and discussions about related ideas. We also thank Blanca Huergo Muñoz, Manuel Kroiss, and Dan Horgan for their help with the engineering infrastructure. Finally, we thank Raia Hadsell, Koray Kavukcuoglu, Nando de Freitas, and Oriol Vinyals for their high-level feedback on the project, and Simon Osindero and Doina Precup for their feedback on the early version of this work.

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.