This package extends the popular inspect_ai
framework to provide flexible MCQ evaluation capabilities for agentic RAG systems.
This project reproduces and extends the evaluation of PaperQA2's performance on the LitQA benchmark, as described in the paper "Language agents achieve superhuman synthesis of scientific knowledge". The framework provides tools for:
- Reproducible Evaluation: Systematic testing of PaperQA2 configurations
- Multi-Agent Integration: Custom wrapper systems for structured evaluation
- Comprehensive Analysis: Hyperparameter studies and performance comparisons
- Standardized Metrics: Accuracy, precision, recall, and F1-score calculations
- Inspect AI Integration: Easy integration into Inspect Ai for multi-agent evaluation.
- Custom PaperQA Agent: For ease of use and integration with Inspect Ai.
- Python 3.11 or higher
- OpenAI API key (for GPT models, optional)
- Google API key (for Gemini models, optional)
# Clone the repository
git clone https://github.com/yourusername/paperQA2_analysis.git
cd paperQA2_analysis
# Install in development mode
pip install -e .
The package will automatically install required dependencies:
inspect-ai
: Evaluation frameworkpaper-qa>=5
: PaperQA2 implementationag2[openai]
: Agent frameworkpydantic
: Data validationpandas
: Data manipulation
from paperqa2_analysis.agents.paperqa_agent import PaperQAAgent
from paperqa2_analysis.evaluate import evaluate_agent
# Initialize PaperQA agent
agent = PaperQAAgent(
model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
max_sources=5,
evidence_k=15
)
# Evaluate on test data
results = evaluate_agent(agent, test_data)
print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Precision: {results['precision']:.2%}")
from paperqa2_analysis.agents.paperqa_gemini_embed_agent import PaperQAGeminiEmbedAgent
# Use Google's advanced embedding model
agent = PaperQAGeminiEmbedAgent(
model="gpt-4o-mini",
embedding_model="text-embedding-004",
max_sources=10,
evidence_k=20,
use_rcs=True # Enable re-ranking and contextual summarization
)
The framework expects data in the following format:
import pandas as pd
# Required columns
data = pd.DataFrame({
'question': ['What is the main finding of the study?', ...],
'ideal': ['A', 'B', 'C', 'D'], # Correct answers
'distractors': [['B', 'C', 'D'], ['A', 'C', 'D'], ...] # Incorrect options
})
Standard PaperQA2 implementation with OpenAI models.
from paperqa2_analysis.agents.paperqa_agent import PaperQAAgent
agent = PaperQAAgent(
model="gpt-4o-mini",
embedding_model="text-embedding-3-small"
)
PaperQA2 with Google's advanced embedding models.
from paperqa2_analysis.agents.paperqa_gemini_embed_agent import PaperQAGeminiEmbedAgent
agent = PaperQAGeminiEmbedAgent(
model="gpt-4o-mini",
embedding_model="text-embedding-004"
)
Multi-agent wrapper for structured evaluation.
from paperqa2_analysis.agents.bridge_agent import BridgeAgent
agent = BridgeAgent(
primary_agent=paperqa_agent,
parser_agent=parser_agent
)
The framework provides comprehensive evaluation metrics:
- Accuracy: Overall correctness across all questions
- Precision: Correctness of answered questions
- Recall: Coverage of questions attempted
- F1-Score: Harmonic mean of precision and recall
- PaperQA Score: Custom metric from original paper
- Answered Recall: Accuracy on attempted questions
Our reproduction study achieved:
- Superhuman Precision: All RAG configurations achieved >77.3% precision (human benchmark: 73.8%)
- Best Performance: GPT-4o-Mini + text-embedding-004 achieved 89.5% precision
- Robust Performance: Near-superhuman results even with suboptimal hyperparameters
- Reproducibility Challenges: Identified performance fluctuations due to API load and hardware
paperQA2_analysis/
βββ paperqa2_analysis/ # Main package
β βββ agents/ # Agent implementations
β βββ inspect_ai_custom/ # Custom evaluation components
β βββ evaluate.py # Evaluation utilities
βββ demo/ # Example scripts and notebooks
βββ data/ # LitQA dataset
βββ logs/ # Evaluation logs
βββ report/ # Research paper
βββ summary/ # Executive summary
from demo.paperqa_single_demo import evaluate_single_question
result = evaluate_single_question(
question="What is the main finding?",
options=["A", "B", "C", "D"],
correct_answer="A",
agent=agent
)
from demo.full_demo import run_full_evaluation
results = run_full_evaluation(
test_data=test_df,
agent_configs=configurations,
num_runs=3
)
from demo.answer_cutoff import study_answer_cutoff
results = study_answer_cutoff(
agent=agent,
test_data=test_df,
cutoff_values=[5, 10, 15]
)
export OPENAI_API_KEY="your-openai-key"
export GOOGLE_API_KEY="your-google-key" # Optional
# Retrieval parameters
max_sources = 5 # Number of sources to include in answer
evidence_k = 15 # Number of evidence chunks to retrieve
# Search parameters
top_k = 30 # Number of chunks to retrieve initially
use_rcs = True # Enable re-ranking and contextual summarization
# Model selection
model = "gpt-4o-mini" # LLM for reasoning
embedding_model = "text-embedding-3-small" # Embedding model
Configuration | Accuracy | Precision | Model |
---|---|---|---|
GPT-4o-Mini + text-embedding-004 | 82.1% | 89.5% | Best |
GPT-4-Turbo + text-embedding-3-small | 78.9% | 85.2% | Baseline |
GPT-4.1 + text-embedding-3-small | 71.2% | 77.3% | Subpar |
Human Benchmark | 73.8% | 73.8% | Reference |
- API Rate Limits: Use sequential execution instead of parallel
- Timeout Errors: Increase timeout settings or reduce batch size
- Memory Issues: Reduce
max_sources
orevidence_k
parameters - Embedding Errors: Verify API keys and model availability
import logging
logging.basicConfig(level=logging.DEBUG)
# Enable detailed logging
agent = PaperQAAgent(debug=True)
We welcome contributions! Please see our Contributing Guidelines for details.
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black paperqa2_analysis/
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this code in your research, please cite:
@article{skarlinski_language_2024,
title={Language agents achieve superhuman synthesis of scientific knowledge},
author={Skarlinski, Michael D. and Cox, Sam and Laurent, Jon M. and Braza, James D. and Hinks, Michaela and Hammerling, Michael J. and Ponnapati, Manvitha and Rodriques, Samuel G. and White, Andrew D.},
journal={arXiv preprint arXiv:2409.13740},
year={2024}
}
- Author: Phong-Anh Nguyen Trinh
- Email: pan31@cam.ac.uk
- Institution: Department of Physics, University of Cambridge
- Original PaperQA2 authors for the foundational work
- Inspect AI team for the evaluation framework
- OpenAI and Google for providing API access
Note: This is a research reproduction project. Results may vary depending on API availability, model updates, and system configurations.