Metamon enables reinforcement learning (RL) research on Pokémon Showdown by providing:
- A standardized suite of teams and opponents for evaluation.
- A large dataset of RL trajectories "reconstructed" from real human battles.
- Starting points for training imitation learning (IL) and RL policies.
Metamon is the codebase behind "Human-Level Competitive Pokémon via Scalable Offline RL and Transformers" (RLC, 2025). Please check out our project website for an overview of our results. This README documents the dataset, pretrained models, training, and evaluation details to help you get battling!
Pokémon Showdown hosts many different rulesets spanning nine generations of the video game franchise. Metamon initially focused on the most popular singles ruleset ("OverUsed") for Generations 1, 2, 3, and 4. However, we are gradually expanding to Gen 9 to support the NeurIPS 2025 PokéAgent Challenge. This is a large project that will not be finalized in time for the competition launch; please stay tuned for updates.
The current status is:
Gen 1 OU | Gen 2 OU | Gen 3 OU | Gen 4 OU | Gen 9 OU | |
---|---|---|---|---|---|
Datasets | ✅ | ✅ | ✅ | ✅ | 🟠 (beta) |
Teams | ✅ | ✅ | ✅ | ✅ | ✅ |
Heuristic Baselines | ✅ | ✅ | ✅ | ✅ | ✅ |
Learned Baselines | ✅ | ✅ | ✅ | ✅ | 🟠 (beta) |
We also support the UnderUsed (UU), NeverUsed (NU), and Ubers tiers for Generations 1, 2, 3, and 4 —-- though constant rule changes and small dataset sizes have always made these a bit of an afterthought.
Metamon is written and tested for linux and python 3.10+. We recommend creating a fresh virtual environment or conda environment:
conda create -n metamon python==3.10
conda activate metamon
Then, install with:
git clone --recursive git@github.com:UT-Austin-RPL/metamon.git
cd metamon
pip install -e .
To install Pokémon Showdown, we'll need a modern version of npm
/ Node.js (instructions here). Note that Showdown undergoes constant updates... breaking changes are rare, but do happen. The version that downloads with this repo (metamon/server
) is always supported.
cd server/pokemon-showdown
npm install
We will need to have the Showdown server running in the background while using Metamon:
# in the background (`screen`, etc.)
node pokemon-showdown start --no-security
# no-security removes battle speed throttling and password requirements on your local server
If necessary, we can customize the server settings (config/config.js
) or the rules for each game mode.
Verify that installation has gone smoothly with:
# run a few test battles on the local server
python -m metamon.env
Metamon provides large datasets of Pokémon team files, human battles, and other statistics that will automatically download when requested. Specify a path with:
# add to ~/.bashrc
export METAMON_CACHE_DIR=/path/to/plenty/of/disk/space
Metamon makes it easy to turn Pokémon into an RL research problem. Pick a set of Pokémon teams to play with, an observation space, an action space, and a reward function:
from metamon.env import get_metamon_teams
from metamon.interface import DefaultObservationSpace, DefaultShapedReward, DefaultActionSpace
team_set = get_metamon_teams("gen1ou", "competitive")
obs_space = DefaultObservationSpace()
reward_fn = DefaultShapedReward()
action_space = DefaultActionSpace()
Then, battle against built-in baselines (or any poke_env.Player
):
from metamon.env import BattleAgainstBaseline
from metamon.baselines import get_baseline
env = BattleAgainstBaseline(
battle_format="gen1ou",
observation_space=obs_space,
action_space=action_space,
reward_function=reward_fn,
team_set=team_set,
opponent_type=get_baseline("Gen1BossAI"),
)
# standard `gymnasium` environment
obs, info = env.reset()
next_obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
The more flexible option is to request battles on our local Showdown server and battle anyone else who is online (humans, pretrained agents, or other Pokémon AI projects). If it plays Showdown, we can battle against it!
from metamon.env import QueueOnLocalLadder
env = QueueOnLocalLadder(
battle_format="gen1ou",
player_username="my_scary_username",
num_battles=10,
observation_space=obs_space,
action_space=action_space,
reward_function=reward_fn,
player_team_set=team_set,
)
Metamon's main feature is that it creates a dataset of "reconstructed" human demonstrations for these environments:
from metamon.data import ParsedReplayDataset
# pytorch dataset. examples are converted to
# the chosen obs/actions/rewards on-the-fly.
offline_dset = ParsedReplayDataset(
observation_space=obs_space,
action_space=action_space,
reward_function=reward_func,
formats=["gen1ou"],
)
obs_seq, action_seq, reward_seq, done_seq = offline_dset[0]
We can save our own agents' experience in the same format:
env = QueueOnLocalLadder(
.., # rest of args
save_trajectories_to="my_data_path",
)
online_dset = ParsedReplayDataset(
dset_root="my_data_path",
observation_space=obs_space,
action_space=action_space,
reward_function=reward_func,
)
terminated = False
while not terminated:
*_, terminated, _, _ = env.step(env.action_space.sample())
# find completed battles before loading examples
online_dset.refresh_files()
You are free to use this data to train an agent however you'd like, but we provide starting points for smaller-scale IL (python -m metamon.il.train
) and RL (python -m metamon.rl.train
), and a large set of pretrained models from our paper.
To run agents on the PokéAgent Challenge ladder:
-
Go to the link above and click "Choose name" in the top right corner. Pick a username that begins with
"PAC"
. -
Click the gear icon, then "register", and create a password.
-
Use
metamon.env.PokeAgentLadder
exactly how you useQueueOnLocalLadder
in local tests. Provide your account details withplayer_username
andplayer_password
args.
We have made every checkpoint of 20 models available on huggingface at jakegrigsby/metamon
. Pretrained models can run without research GPUs, but you will need to install amago
, which is an RL codebase by the same authors. Follow instructions here.
Load and run pretrained models with metamon.rl.evaluate
. For example:
python -m metamon.rl.evaluate --eval_type heuristic --agent SyntheticRLV2 --gens 1 --formats ou --total_battles 100
Will run the default checkpoint of the best model for 100 battles against a set of heuristic baselines highlighted in the paper.
Or to battle against whatever is logged onto the local Showdown server (including other pretrained models that are already waiting):
python -m metamon.rl.evaluate --eval_type ladder --agent SyntheticRLV2 --gens 1 --formats ou --total_battles 50 --username <pick unique username> --team_set competitive
Deploy pretrained agents on the PokéAgent Challenge ladder:
python -m metamon.rl.evaluate --eval_type pokeagent --agent SyntheticRLV2 --gens 1 --formats ou --total_battles 10 --username <your username> --password <your password> --team_set competitive
Some model sizes have several variants testing different RL objectives. See metamon/rl/pretrained.py
for a complete list; Agents trained after the paper have some
discussion in their docstring there.
Model Name (--agent ) |
Description |
---|---|
SmallIL (2 variants) |
15M imitation learning model trained on 1M human battles |
SmallRL (5 variants) |
15M actor-critic model trained on 1M human battles |
MediumIL |
50M imitation learning model trained on 1M human battles |
MediumRL (3 variants) |
50M actor-critic model trained on 1M human battles |
LargeIL |
200M imitation learning model trained on 1M human battles |
LargeRL |
200M actor-critic model trained on 1M human battles |
SyntheticRLV0 |
200M actor-critic model trained on 1M human + 1M diverse self-play battles |
SyntheticRLV1 |
200M actor-critic model trained on 1M human + 2M diverse self-play battles |
SyntheticRLV1_SelfPlay |
SyntheticRLV1 fine-tuned on 2M extra battles against itself |
SyntheticRLV1_PlusPlus |
SyntheticRLV1 finetuned on 2M extra battles against diverse opponents |
SyntheticRLV2 |
Final 200M actor-critic model with value classification trained on 1M human + 4M diverse self-play battles. |
SmallRLGen9Beta |
Prototype 15M actor-critic model trained after the dataset was expanded to include Gen9OU |
Abra |
57M actor-critic trained on parsed-replays v3 (Gen9!) and a small set of synthetic battles. First of a new series of Gen9OU-compatible policies trained in a similar style to the paper's "Synthetic" agents. |
Here is a reference of human evals for key models according to our paper:
Tip
Most these policies predate our expansion to Gen 9. They can play Gen 9 OU, but won't play it well. Gen 9 training runs are ongoing.
Showdown creates "replays" of battles that players can choose to upload to the website before they expire. We gathered all surviving historical replays for Gen 1-4 OU/NU/UU/Ubers and Gen 9 OU, and continuously save new battles to grow the dataset.
Datasets are stored on huggingface in two formats:
Name | Size | Description |
---|---|---|
metamon-raw-replays |
2M Battles | Our curated set of Pokémon Showdown replay .json files... to save the Showdown API some download requests and to maintain an official reference of our training data. Will be regularly updated as new battles are played and collected. |
metamon-parsed-replays |
4M Trajectories | The RL-compatible version of the dataset as reconstructed by the replay parser. This dataset has been significantly expanded and improved since the original paper. |
Parsed replays will download automatically when requested by the ParsedReplayDataset
, but these datasets are large. Use python -m metamon.data.download parsed-replays
to download them in advance.
In Showdown RL, we have to embrace a mismatch between the trajectories we observe in our own battles and those we gather from other player's replays. In short, replays are saved from the point-of-view of a spectator rather than the point-of-view of a player. The server sends info to the players that it does not save to its replay, and we need to try and simulate that missing info. Metamon goes to great lengths to handle this, and is always improving (more info here), but there is no way to be perfect.
Therefore, replay data is perhaps best viewed as pretraining data for an offline-to-online finetuning problem. Self-collected data from the online env fixes inaccuracies and can help concentrate on teams we'll be using on the ladder. The whole project is now set up to do this (see Quick Start).
Team sets are dirs of Showdown team files that are randomly sampled between episodes. They are stored on huggingface at jakegrigsby/metamon-teams
and can be downloaded in advance with python -m metamon.data.download teams
metamon.env.get_metamon_teams(battle_format : str, set_name : str)
set_name |
Teams Per Battle Format | Description |
---|---|---|
"competitive" |
Varies (< 30) | Human-made teams scraped from forum threads. These are usually official "sample teams" designed by experts for beginners, but we are less selective for non-OU tiers. This is the set used for human ladder evaluations in the paper. |
"paper_variety" |
(Gen 1-4 Only) 1k | Procedurally generated teams with unrealistic OOD lead-off Pokémon. The paper calls this the "variety set". Movesets were generated by sampling from all-time usage stats. |
"paper_replays" |
1k (Gen 1-4 OU Only) | Predicted teams from replays. The paper calls this the "replay set". Surpassed by the "modern_replays" set below. Used the original prediction strategy of sampling from all-time usage stats. |
"modern_replays" |
8k-20k (OU Only) |
Predicted teams based on recent replays using the best prediction strategy we have available for each generation. The result is a diverse set representing the recent metagame with blanks filled by a mixture of historical trends. |
The HF readme has more information.
We can also use our own directory of team files with, for example:
from metamon.env import TeamSet
team_set = TeamSet("/path/to/your/team/dir", battle_format: str) # e.g. gen3ou
But note that files would need to have the extension ".{battle_format}_team"
(e.g., .gen3nu_team).
baselines/
contains baseline opponents that we can battle against via BattleAgainstBasline
. baselines/heuritics
provides more than a dozen heuristic opponents and starter code for developing new ones (or mixing ground-truth Pokémon knowledge into ML agents). baselines/model_based
ties the simple il
model checkpoints to poke-env
(with CPU inference).
Here is an overview of the opponents mentioned in the paper:
from metamon.baselines import get_baseline, get_all_baseline_names
opponent = get_baseline(name) # Get specific baseline
available = get_all_baseline_names() # List all available baselines
name |
Description |
---|---|
BugCatcher |
An actively bad trainer that always picks the least damaging move. When forced to switch, picks the pokemon in its party with the worst type matchup vs the player. |
RandomBaseline |
Selects a legal move (or switch) uniformly at random and measures the most basic level of learning early in training runs. |
Gen1BossAI |
Emulates opponents in the original Pokémon Generation 1 games. Usually chooses random moves. However, it prefers using stat-boosting moves on the second turn and “super effective” moves when available. |
Grunt |
A maximally offensive player that selects the move that will deal the greatest damage against the current opposing Pokémon using Pokémon’s damage equation and a type chart and selects the best matchup by type when forced to switch. |
GymLeader |
Improves upon Grunt by additionally taking into account factors such as health. It prioritizes using stat boosts when the current Pokémon is very healthy, and heal moves when unhealthy. |
PokeEnvHeuristic |
The SimpleHeuristicsPlayer baseline provided by poke-env with configurable difficulty (shortcuts like EasyPokeEnvHeuristic ). |
EmeraldKaizo |
An adaptation of the AI in a Pokémon Emerald ROM hack intended to be as difficult as possible. It selects actions by scoring the available options against a rule set that includes handwritten conditional statements for a large portion of the moves in the game. |
BaseRNN |
A simple RNN IL policy trained on an early version of our parsed replay dataset. Runs inference on CPU. |
Compare baselines with:
python -m metamon.baselines.compete --battle_format gen2ou --player GymLeader --opponent RandomBaseline --battles 10
Here is a reference for the relative strength of some heuristic baselines from the paper:
Metamon tries to separate the RL from Pokémon. All we need to do is pick an ObservationSpace
, ActionSpace
, and RewardFunction
:
- The environment outputs a
UniversalState
- Our
ObservationSpace
maps theUniversalState
to the input of our agent. - Our agent outputs an action however we'd like.
- Our
ActionSpace
converts the agent's choice to aUniversalAction
. - The environment takes the current (
UniversalState
,UniversalAction
) and outputs the nextUniversalState
. OurRewardFunction
gives the agent a scalar reward. - Repeat until victory.
UniversalState
defines all the features we have access to at each timestep.
The ObservationSpace
packs those features into a policy input.
We could create a custom version with more/less features by inheriting from metamon.interface.ObservationSpace
.
Observation Space | Description |
---|---|
DefaultObservationSpace |
The text/numerical observation space used in our paper. |
ExpandedObservationSpace |
A slight improvement based on lessons learned from the paper. It also adds tera types for Gen 9. |
TeamPreviewObeservationSpace |
Further extends ExpandedObservationSpace with a preview of the opponent's team (for Gen 9). |
Text features have inconsistent length, but we can translate to int IDs from a list of known vocab words. The built-in observation spaces are designed such that the "tokenized" version will have fixed length.
from metamon.interface import TokenizedObservationSpace, DefaultObservationSpace
from metamon.tokenizer import get_toknenizer
base_obs = DefaultObservationSpace()
tokenized_space = TokenizedObservationSpace(
base_obs_space=base_obs,
tokenizer=get_tokenizer("DefaultObservationSpace-v0"),
)
The vocabs are in metamon/tokenizer
; they are generated by tracking unique
words across the entire replay dataset, with an unknown token for rare cases we may have missed.
Tokenizer Name | Description |
---|---|
allreplays-v3 |
Legacy version for pre-release models. |
DefaultObservationSpace-v0 |
Updated post-release vocabulary as of metamon-parsed-replays dataset v2 . |
DefaultObservationSpace-v1 |
Updated vocabulary as of metamon-parsed-replays dataset v3-beta (adds ~1k words for Gen 9). |
Metamon uses a fixed UniversalAction
space of 13 discrete choices:
{0, 1, 2, 3}
use the active Pokémon's moves in alphabetical order.{4, 5, 6, 7, 8}
switch to the other Pokémon in the party in alphabetical order.{9, 10, 11, 12}
are wildcards for generation-specific gimmicks. Currently, they only apply to Gen 9, where they pick moves (in alphabetical order) with terastallization.
That might not be how we want to set up our agent. The ActionSpace
converts between whatever the output of the policy might be and the UniversalAction
.
Action Space | Description |
---|---|
DefaultActionSpace |
Standard discrete space of 13 and supports Gen 9. |
MinimalActionSpace |
The original space of 9 choices (4 moves + 5 switches) --- which is all we need for Gen 1-4. |
Any new action spaces would be added to metamon.interface.ALL_ACTION_SPACES
. A text action space (for LLM-Agents) is on the short-term roadmap.
Reward functions assign a scalar reward based on consecutive states (R(s, s')). -
Reward Function | Description |
---|---|
DefaultShapedReward |
Shaped reward used by the paper. +/- 100 for win/loss, light shaping for damage dealt, health recovered, status received/inflicted. |
BinaryReward |
Removes the smaller shaping terms and simply provides +/- 100 for win/loss. |
Any new reward functions would be added to metamon.interface.ALL_REWARD_FUNCTIONS
, and we can implement a new one by inheriting from metamon.interface.RewardFunction
.
We trained all of our main RL & IL models with amago
. Everything you need to train your own model on metamon data and evaluate against Pokémon baselines is provided in metamon/rl/
.
cd metamon/rl/
export METAMON_WANDB_PROJECT="my_wandb_project_name"
export METAMON_WANDB_ENTITY="my_wandb_username"
See python train.py --help
for options. The training script implements offline RL on the human battle dataset and an optional extra dataset of self-play battles you may have collected.
We might retrain the "SmallIL
" model like this:
python -m metamon.rl.train --run_name AnyNameHere --model_gin_config small_agent.gin --train_gin_config il.gin --save_dir ~/my_checkpoint_path/ --log
"SmallRL
" would be the same command with --train_gin_config exp_rl.gin
. Scan rl/pretrained.py
to see the configs used by each pretrained agent. Larger training runs take days to complete and can (optionally) use mulitple GPUs (link). An example of a smaller RNN config is provided in small_rnn.gin
.
See python finetune_from_hf.py --help
to finetune an existing model to a new dataset, training objective, or reward function!
Provides the same setup as the main train
script but takes care of downloading and matching the config details of our public models. Finetuning will inherit the architecture of the base model but allows for changes to the --train_gin_config
and --reward_function
. Note that the best settings for quick finetuning runs are likely different from the original run!
We might finetune "SmallRL
" to the new gen 9 replay dataset and custom battles like this:
python -m metamon.rl.finetune_from_hf --finetune_from_model SmallRL --run_name MyCustomSmallRL --save_dir ~/metamon_finetunes/ --custom_replay_dir /my/custom/parsed_replay_dataset --custom_replay_sample_weight .25 --epochs 10 --steps_per_epoch 10000 --log --formats gen9ou --eval_gens 9
You can start from any checkpoint number with --finetune_from_ckpt
. See the huggingface for a full list. Defaults to the official eval checkpoint.
Customize the agent architecture by creating new rl/configs/models/
.gin
files. Customize the RL hyperparameters by creating new rl/configs/training/
files. Here is a link to a lot more information about configuring training runs. amago
is modular, and you can swap just about any piece of the agent with your own ideas. Here is a link to more information about custom components.
metamon.rl.evaluate
provides quick-setup evals (pretrained_vs_baselines
, pretrained_vs_local_ladder
, and pretrained_vs_pokeagent_ladder
). Full explanations are provided in the source file.
To eval a custom agent trained from scratch (rl.train
) we'd create a LocalPretrainedModel
. LocalFinetunedModel
provides some quick setup for models finetuned with rl.finetune_from_hf
. examples/evaluate_custom_models.py
shows an example for each, and deploys them on the PokéAgent Ladder!
il/
is old toy code that does basic behavior cloning with RNNs. We used it to train early learning-based baselines (BaseRNN
, WinsOnlyRNN
, and MiniRNN
) that you can play against with the BattleAgainstBaseline
env. We may add more of these as the dataset grows/improves and more architectures are tried. Playing around with this code might be an easier way to get started, but note that the main rl/train
script can also be configured to do RNN BC... but faster and on multiple GPUs.
Get started with something like:
cd metamon/il/
python train.py --run_name any_name_will_do --model_config configs/transformer_embedding.gin --gpu 0
To support the main raw-replays, parsed-replays, and teams datasets, metamon creates a few resources that may be useful for other purposes:
Showdown records the frequency of team choices (items, moves, abilities, etc.) brought to battles in a given month. The community mainly uses this data to consider rule changes, but we use it to help predict missing details of partially revealed teams. We load data for an arbitrary window of history around the date a battle was played, and fall back to all-time stats for rare Pokémon where data is limited:
from metamon.backend.team_prediction.usage_stats import get_usage_stats
from datetime import date
usage_stats = get_usage_stats("gen1ou",
start_date=date(2017, 12, 1),
end_date=date(2018, 3, 30)
)
alakazam_info: dict = usage_stats["Alakazam"] # non alphanum chars and case are flexible
Download usage stats in advance with:
python -m metamon.data.download usage-stats
The data is stored on huggingface at jakegrigsby/metamon-usage-stats
.
One of the main problems the replay parser has to solve is predicting a player's full team based on the "partially revealed" team at the end of the battle. As part of this, we record the revealed team in the standard Showdown team builder format, but with some magic keywords for missing elements. For example:
Tyranitar @ Custap Berry
Ability: Sand Stream
EVs: $missing_ev$ HP / $missing_ev$ Atk / $missing_ev$ Def / $missing_ev$ SpA / $missing_ev$ SpD / $missing_ev$ Spe
$missing_nature$ Nature
IVs: 31 HP / 31 Atk / 31 Def / 31 SpA / 31 SpD / 31 Spe
- Stealth Rock
- Stone Edge
- Pursuit
- $missing_move$
Given the size of our replay dataset, this creates a massive set of real (but incomplete) human team choices. The files are stored alongside the parsed-replay dataset and downloaded with:
python -m metamon.data.download revealed-teams
metamon/backend/team_prediction
contains tools for filling in the blanks of these files, but this is all poorly documented and changes frequently, so we'll leave it at that for now.
Originally, metamon handled reconstruction of training data from old replays but used poke-env
to play new battles:
In an experimental new feature, we now allow all of the Pokémon logic (the "battle backend") to switch to the replay version. Pass battle_backend="metamon"
to any of the environments. poke-env
still handles all communication with Showdown, but this helps reduce the sim2sim gap. Because it is focused on Gens 1-4 without privileged player info --- and tested on every replay ever saved --- the metamon backend is more accurate for our use case and will be significantly easier to maintain. However, the "poke-env"
option is still faster and more stable and will only be deprecated after the PokéAgent Challenge.
This project owes a huge debt to the amazing poke-env
, as well Pokémon resources like Bulbapedia, Smogon, and of course Pokémon Showdown.
@misc{grigsby2025metamon,
title={Human-Level Competitive Pok\'emon via Scalable Offline Reinforcement Learning with Transformers},
author={Jake Grigsby and Yuqi Xie and Justin Sasek and Steven Zheng and Yuke Zhu},
year={2025},
eprint={2504.04395},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.04395},
}