Skip to content

jacknguyen0810/inspect-agentic_mcq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

87 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PaperQA2 Analysis

Python 3.11+ License arXiv

This package extends the popular inspect_ai framework to provide flexible MCQ evaluation capabilities for agentic RAG systems.

Overview

This project reproduces and extends the evaluation of PaperQA2's performance on the LitQA benchmark, as described in the paper "Language agents achieve superhuman synthesis of scientific knowledge". The framework provides tools for:

  • Reproducible Evaluation: Systematic testing of PaperQA2 configurations
  • Multi-Agent Integration: Custom wrapper systems for structured evaluation
  • Comprehensive Analysis: Hyperparameter studies and performance comparisons
  • Standardized Metrics: Accuracy, precision, recall, and F1-score calculations

Key Features

  • Inspect AI Integration: Easy integration into Inspect Ai for multi-agent evaluation.
  • Custom PaperQA Agent: For ease of use and integration with Inspect Ai.

Installation

Prerequisites

  • Python 3.11 or higher
  • OpenAI API key (for GPT models, optional)
  • Google API key (for Gemini models, optional)

Install from Source

# Clone the repository
git clone https://github.com/yourusername/paperQA2_analysis.git
cd paperQA2_analysis

# Install in development mode
pip install -e .

Install Dependencies

The package will automatically install required dependencies:

  • inspect-ai: Evaluation framework
  • paper-qa>=5: PaperQA2 implementation
  • ag2[openai]: Agent framework
  • pydantic: Data validation
  • pandas: Data manipulation

πŸ”§ Quick Start

Basic Usage

from paperqa2_analysis.agents.paperqa_agent import PaperQAAgent
from paperqa2_analysis.evaluate import evaluate_agent

# Initialize PaperQA agent
agent = PaperQAAgent(
    model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
    max_sources=5,
    evidence_k=15
)

# Evaluate on test data
results = evaluate_agent(agent, test_data)
print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Precision: {results['precision']:.2%}")

Advanced Configuration

from paperqa2_analysis.agents.paperqa_gemini_embed_agent import PaperQAGeminiEmbedAgent

# Use Google's advanced embedding model
agent = PaperQAGeminiEmbedAgent(
    model="gpt-4o-mini",
    embedding_model="text-embedding-004",
    max_sources=10,
    evidence_k=20,
    use_rcs=True  # Enable re-ranking and contextual summarization
)

πŸ“Š Data Format

The framework expects data in the following format:

import pandas as pd

# Required columns
data = pd.DataFrame({
    'question': ['What is the main finding of the study?', ...],
    'ideal': ['A', 'B', 'C', 'D'],  # Correct answers
    'distractors': [['B', 'C', 'D'], ['A', 'C', 'D'], ...]  # Incorrect options
})

πŸ§ͺ Available Agents

PaperQAAgent

Standard PaperQA2 implementation with OpenAI models.

from paperqa2_analysis.agents.paperqa_agent import PaperQAAgent

agent = PaperQAAgent(
    model="gpt-4o-mini",
    embedding_model="text-embedding-3-small"
)

PaperQAGeminiEmbedAgent

PaperQA2 with Google's advanced embedding models.

from paperqa2_analysis.agents.paperqa_gemini_embed_agent import PaperQAGeminiEmbedAgent

agent = PaperQAGeminiEmbedAgent(
    model="gpt-4o-mini",
    embedding_model="text-embedding-004"
)

BridgeAgent

Multi-agent wrapper for structured evaluation.

from paperqa2_analysis.agents.bridge_agent import BridgeAgent

agent = BridgeAgent(
    primary_agent=paperqa_agent,
    parser_agent=parser_agent
)

πŸ“ˆ Evaluation Metrics

The framework provides comprehensive evaluation metrics:

  • Accuracy: Overall correctness across all questions
  • Precision: Correctness of answered questions
  • Recall: Coverage of questions attempted
  • F1-Score: Harmonic mean of precision and recall
  • PaperQA Score: Custom metric from original paper
  • Answered Recall: Accuracy on attempted questions

πŸ”¬ Research Results

Our reproduction study achieved:

  • Superhuman Precision: All RAG configurations achieved >77.3% precision (human benchmark: 73.8%)
  • Best Performance: GPT-4o-Mini + text-embedding-004 achieved 89.5% precision
  • Robust Performance: Near-superhuman results even with suboptimal hyperparameters
  • Reproducibility Challenges: Identified performance fluctuations due to API load and hardware

πŸ“ Project Structure

paperQA2_analysis/
β”œβ”€β”€ paperqa2_analysis/          # Main package
β”‚   β”œβ”€β”€ agents/                 # Agent implementations
β”‚   β”œβ”€β”€ inspect_ai_custom/      # Custom evaluation components
β”‚   └── evaluate.py            # Evaluation utilities
β”œβ”€β”€ demo/                      # Example scripts and notebooks
β”œβ”€β”€ data/                      # LitQA dataset
β”œβ”€β”€ logs/                      # Evaluation logs
β”œβ”€β”€ report/                    # Research paper
└── summary/                   # Executive summary

🎯 Examples

Single Question Evaluation

from demo.paperqa_single_demo import evaluate_single_question

result = evaluate_single_question(
    question="What is the main finding?",
    options=["A", "B", "C", "D"],
    correct_answer="A",
    agent=agent
)

Full Benchmark Evaluation

from demo.full_demo import run_full_evaluation

results = run_full_evaluation(
    test_data=test_df,
    agent_configs=configurations,
    num_runs=3
)

Hyperparameter Study

from demo.answer_cutoff import study_answer_cutoff

results = study_answer_cutoff(
    agent=agent,
    test_data=test_df,
    cutoff_values=[5, 10, 15]
)

πŸ”§ Configuration

Environment Variables

export OPENAI_API_KEY="your-openai-key"
export GOOGLE_API_KEY="your-google-key"  # Optional

Model Parameters

# Retrieval parameters
max_sources = 5      # Number of sources to include in answer
evidence_k = 15      # Number of evidence chunks to retrieve

# Search parameters
top_k = 30          # Number of chunks to retrieve initially
use_rcs = True      # Enable re-ranking and contextual summarization

# Model selection
model = "gpt-4o-mini"           # LLM for reasoning
embedding_model = "text-embedding-3-small"  # Embedding model

πŸ“Š Performance Comparison

Configuration Accuracy Precision Model
GPT-4o-Mini + text-embedding-004 82.1% 89.5% Best
GPT-4-Turbo + text-embedding-3-small 78.9% 85.2% Baseline
GPT-4.1 + text-embedding-3-small 71.2% 77.3% Subpar
Human Benchmark 73.8% 73.8% Reference

πŸ› Troubleshooting

Common Issues

  1. API Rate Limits: Use sequential execution instead of parallel
  2. Timeout Errors: Increase timeout settings or reduce batch size
  3. Memory Issues: Reduce max_sources or evidence_k parameters
  4. Embedding Errors: Verify API keys and model availability

Debug Mode

import logging
logging.basicConfig(level=logging.DEBUG)

# Enable detailed logging
agent = PaperQAAgent(debug=True)

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black paperqa2_analysis/

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“š Citation

If you use this code in your research, please cite:

@article{skarlinski_language_2024,
    title={Language agents achieve superhuman synthesis of scientific knowledge},
    author={Skarlinski, Michael D. and Cox, Sam and Laurent, Jon M. and Braza, James D. and Hinks, Michaela and Hammerling, Michael J. and Ponnapati, Manvitha and Rodriques, Samuel G. and White, Andrew D.},
    journal={arXiv preprint arXiv:2409.13740},
    year={2024}
}

πŸ“ž Contact

  • Author: Phong-Anh Nguyen Trinh
  • Email: pan31@cam.ac.uk
  • Institution: Department of Physics, University of Cambridge

πŸ™ Acknowledgments

  • Original PaperQA2 authors for the foundational work
  • Inspect AI team for the evaluation framework
  • OpenAI and Google for providing API access

Note: This is a research reproduction project. Results may vary depending on API availability, model updates, and system configurations.

About

Custom Inspect AI package for Multiple Choice Question Evaluation using Multi-Agent Systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published