GitHub - cornell-zhang/heurigym: Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

📙About • 📚Problems • 🔥Quick Start • 🚀LLM Solver Agent • 🤝Contribute • 📜Citation

📘 About

HeuriGym is a benchmark for evaluating how well LLMs generate and refine heuristics for real-world combinatorial optimization (CO) tasks through agentic, code-driven interaction.

🔍 Why HeuriGym?

Existing LLM benchmarks fall short:

🎯 Closed-form tasks (e.g., AIME, HumanEval): Saturated, too simplistic for real-world reasoning.
🤖 Subjective evaluations (e.g., Chatbot Arena): Noisy, inconsistent, and unreliable for technical tasks.

HeuriGym fills this gap with:

🧩 Open-ended problems: Well-defined objectives with large solution spaces.
🤖 Agentic interaction: LLMs improve heuristics through feedback-driven code execution.
📏 Expert comparison metrics: Measure both pass rate and quality relative to expert solutions.

Let LLMs think, code, and improve—just like real solvers.

📚 Problems

The initial release of the HeuriGym benchmark includes nine distinct optimization problems spanning four scientific and engineering domains.

Domain	Problem	Difficulty
EDA	Operator scheduling	★
EDA	Technology mapping	★★
EDA	Global routing	★★★
Compilers	E-graph extraction	★
Compilers	Intra-operator parallelism	★★
CompBio	Protein sequence design	★
CompBio	Mendelian error detection	★★
Logistics	Airline crew pairing	★★
Logistics	Pickup and delivery w/ time windows	★★★

🔥 Quick Start

Install the required dependencies:

pip install -r requirements.txt

Clone the repository:

git clone https://github.com/cornell-zhang/heurigym.git
cd heurigym

Setup API keys:

# you need to have a HuggingFace token to download the dataset.
export HUGGINGFACE_TOKEN=<your_huggingface_key_here>
# If you are using Google models, you need to have a Google API key. 
export GOOGLE_API_KEY=<your_google_key_here>

Run the agent to solve the operator scheduling problem with Gemini 2.5 Pro:

python llm_solver_agent.py --problem operator_scheduling \
                           --models gemini-2.5-pro-preview-05-06

Check the results in the llm_solutions directory.

Best results are saved in best_results.json and error analysis is saved in error_summary.json.

🚀 LLM Solver Agent

Create a .env file in the root directory with the API keys for the models you want to use:

# Required only if using models from OpenAI (e.g., o4-mini:high)
OPENAI_API_KEY=your_openai_key_here

# Required only if using models from Anthropic (e.g., claude-3-7-sonnet-20250219)
ANTHROPIC_API_KEY=your_anthropic_key_here

# Required only if using models from DeepSeek (e.g., deepseek-chat, deepseek-coder)
DEEPSEEK_API_KEY=your_deepseek_key_here

# Required only if using models from Google (e.g., gemini-2.5-flash-preview-04-17, gemini-2.5-pro-preview-05-06)
GOOGLE_API_KEY=your_google_key_here

# Required only if using models from OpenRouter (e.g., openrouter/meta-llama/llama-4-maverick)
OPENROUTER_API_KEY=your_openrouter_key_here

# Required only if using models from Alibaba (e.g., qwen3-235b-a22b)
DASHSCOPE_API_KEY=your_alibaba_key_here

Also note that you need to have a HuggingFace token to download the dataset.

HUGGINGFACE_TOKEN=your_huggingface_key_here

Usage

Run the agent to solve the operator scheduling problem with Gemini 2.5 Pro:

# Requires GOOGLE_API_KEY
python llm_solver_agent.py --problem operator_scheduling \
                           --models gemini-2.5-pro-preview-05-06

Run the agent to solve egraph extraction problem with Claude 3.7 Sonnet:

# Requires ANTHROPIC_API_KEY
python llm_solver_agent.py --problem egraph_extraction \
                           --models claude-3-7-sonnet-20250219

Run the agent to solve the airline crew pairing problem with o4-mini:high:

# Requires OPENAI_API_KEY
python llm_solver_agent.py --problem crew_pairing \
                           --models o4-mini:high

Command Line Arguments

The agent supports the following command line arguments:

python llm_solver_agent.py [options]

Options:

--models MODEL1 MODEL2 ...: List of models to use (default: all supported models)
--iterations N: Maximum number of iterations for each model (default: 3)
--problem PROBLEM_NAME: Specific problem to solve (folder name)
--timeout TIMEOUT: Timeout in seconds for program execution (default: 10)
--temperature TEMPERATURE: Temperature for LLM generation (default: 0.0)
--stream: Enable streaming output from LLM (default: False, but True for Qwen models)
--history_rounds H: Number of previous rounds to keep in conversation history (default: None, keep all history)
--num_cores C: Number of CPU cores to use for program execution (default: 8)
--few_shots S: Number of training examples to provide to LLMs (default: None, use all examples)

The agent will:

Scan all directories in the workspace for README.md files
Parse the problem descriptions
Request solutions from configured LLMs with iterative improvement
Save solutions in the llm_solutions directory
Collect results, analyze all solutions, finds the best results, and performs error analysis. Best results are saved in best_results.json and error analysis is saved in error_summary.json.

🤝 Contribute

We welcome contributions to the HeuriGym benchmark!

To add a new problem to the benchmark suite, you need to create a new folder in the problems directory. The folder should have two subfolders:

dataset: A folder for problem instances
program: A folder for the program template

You can copy the template folder as a starting point. There are several files you need to implement or include:

README.md: Problem description, formalization, and input/output format
solver.py: A template solver function for LLM to fill in. Feel free overload the solve function by copying it to your problem folder.
verifier.py: After LLM provides a solution, the verifier will check if the solution is valid. Please implement the verify function in this file.
evaluator.py: After the solution is verified, the evaluator will calculate the cost of the solution. Please implement the evaluate function in this file.

📜 Citation

@article{chen2025heurigym,
    title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization}, 
    author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang},
    journal={arXiv preprint arXiv:2506.07972},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 288 Commits
crew_pairing		crew_pairing
egraph_extraction		egraph_extraction
frequency_assignment		frequency_assignment
global_routing		global_routing
intra_op_parallel		intra_op_parallel
operator_scheduling		operator_scheduling
pedigree		pedigree
pickup_delivery_time_windows		pickup_delivery_time_windows
protein_sequence_design		protein_sequence_design
scripts		scripts
technology_mapping		technology_mapping
template		template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
llm_solver_agent.py		llm_solver_agent.py
prompt.md		prompt.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📘 About

🔍 Why HeuriGym?

📚 Problems

🔥 Quick Start

🚀 LLM Solver Agent

Usage

Command Line Arguments

🤝 Contribute

📜 Citation

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

cornell-zhang/heurigym

Folders and files

Latest commit

History

Repository files navigation

📘 About

🔍 Why HeuriGym?

📚 Problems

🔥 Quick Start

🚀 LLM Solver Agent

Usage

Command Line Arguments

🤝 Contribute

📜 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages