Skip to content

This repository contains code and data for the paper "TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering."

License

Notifications You must be signed in to change notification settings

wenge-research/TableEval

Repository files navigation

  TableEval

This repository contains code and data for the paper "TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering."

🤗 • 📄

🔍 Overview

TableEval is the first cross-language tabular question-answering benchmark supporting Simplified Chinese, Traditional Chinese, and English. It features:

  • Real-World Domains: Financial Disclosures, Academic Papers, Administrative Records, and Industry Reports.
  • Comprehensive Data:
    • 617 carefully inspected Excel spreadsheets with diverse structures including hierarchical headers, nested cells, and merged layouts.
    • 2,325 QA pairs across 6 major tasks & 16 fine-grained sub-tasks, assessesing various capabilities(e.g., information retrieval, reasoning, data analysis, multi-turn conversations).

We also introduce SEAT(Structured Evaluation for Answers in TableQA), a novel evaluation framework that

  • Provides fine-grained evaluation at the sub-question level.
  • Leverages LLMs to extract final answers from model responses, comparing them with reference answers one by one and clearly visualizing correctness.
  • Uses F1-score as the evaluation metric and achieves high consistency with human judgments.

🔥 Latest News

  • [2025-06-05]: We released the benchmark and the code! Please feel free to open an issue or contact us for any questions.

🏆 Leaderboard

Models Avg Information Retrieval Numerical Analysis Reasoning Data Analysis Multi-turn Conversation Table Structure Understanding
o1-preview 83.43 88.30 87.08 82.88 77.89 83.38 81.03
claude-3-5-sonnet-20241022 83.32 89.62 91.06 85.76 84.01 87.94 61.51
deepseek-r1 82.46 90.15 88.56 87.91 77.79 78.29 72.05
gpt-4o-2024-11-20 78.79 88.24 86.00 83.05 81.47 83.20 50.79
QwQ-32B-Preview 78.14 89.33 85.75 81.37 71.69 82.15 58.53
deepseek-chat 77.95 91.20 82.61 81.72 77.45 85.83 48.89
Qwen2.5-32B-Instruct 75.50 86.32 84.10 76.09 77.60 82.25 46.61
Qwen2.5-72B-Instruct 74.23 82.68 81.53 74.85 78.94 81.90 45.50
qwen-max-2024-09-19 73.34 84.42 81.35 72.64 78.09 80.18 43.35
DeepSeek-V2.5-1210 73.27 87.41 79.10 71.49 77.97 78.72 44.94
Llama-3.3-70B-Instruct 72.94 87.42 76.70 73.38 81.27 80.62 38.24
Qwen2.5-Coder-32B-Instruct 70.75 79.82 77.00 73.03 76.33 74.89 43.44
Qwen2.5-14B-Instruct 70.02 84.72 78.93 68.65 75.06 75.05 37.72
gpt-4o-mini-2024-07-18 68.47 82.64 76.15 73.13 70.70 73.66 34.56
Qwen2.5-7B-Instruct 59.60 69.23 64.29 59.38 69.71 68.67 26.35
glm-4-9b-chat 53.61 66.19 51.09 55.09 62.47 64.36 22.44
Llama-3.1-8B-Instruct 49.26 67.40 53.35 48.82 57.06 53.15 15.76
DeepSeek-Coder-V2-Lite-Instruct 48.30 60.40 56.39 50.03 51.51 50.62 20.83
DeepSeek-V2-Lite-Chat 36.75 48.52 35.43 35.97 51.80 41.61 7.15

(Updated: 25/03/06)

🛠️ Installation & Setup

Step 1: Clone this repository

git clone https://github.com/wenge-research/TableEval.git
cd TableEval

Step 2: Create a virtual environment(optional) and install dependencies:

conda create -n tableeval python=3.11
conda activate tableeval
pip install -r requirements.txt

🤖 Evaluate

This section outlines how to configure API keys, generate model responses, and run evaluations. It is designed to help you get started with TableEval quickly.

Step 1. Configure API keys

Currently, We support OpenAI-compatible API servers. Please create or update the config/api_config.yaml file with your API settings. Ensure your API keys are kept secure and not exposed in public repositories. Below is an example configuration:

gpt-4o-2024-11-20:
  model_name: gpt-4o-2024-11-20
  api_key: YOUR_OPENAI_API_KEY
  base_url: YOUR_OPENAI_API_URL
  temperature: 0.0
  max_tokens: 8192
  top_p: 1.0
  seed: 33
  timeout: 600    # Request timeout in seconds
  max_retries: 3  # Maximum number of retries if a request fails

YOUR_CUSTOM_API_MODEL:
  model_name: YOUR_CUSTOM_API_MODEL_NAME  
  api_key: YOUR_CUSTOM_API_KEY
  base_url: YOUR_CUSTOM_API_URL
  temperature: 0.6
  max_tokens: 8192
  top_p: 0.95
  seed: 33
  timeout: 600   
  max_retries: 3 

If you are using open-source LLMs, we recommend vLLM and sglang, which provides an HTTP server compatible with OpenAI’s Chat APIs.

Step 2: Generate Model Responses

Run the following command to generate model responses using the specified model:

python run_prediction.py --model_name gpt-4o-mini

This command will call the model to generate responses for the questions and store the results in the outputs directory.

Step 3: Run Evaluation

After generating responses, execute the following command to run the evaluation:

python run_evaluation.py --llm_judge_name gpt-4o-2024-11-20 --prediction_file_path ./outputs/20250604_212820/gpt-4o-mini_cot_markdown.json
  • --llm_judge_name: Specifies the judge model name used for evaluation.
  • --prediction_file_path: Path to the file containing prediction results. Both relative and absolute paths are supported.

Upon successful execution, you can review the evaluation results in the specified output directory.

🔧 Advanced Parameter Guide

run_prediction.py and run_evaluation.py support more configurable parameters, enabling users to adjust the prediction and evaluation settings:

To view the full list of parameters of run_prediction.py, use:

python run_prediction.py --help
parameter details:
Parameter Type Default Description
--model_name str "gpt-4o-mini" Name of the model to use for inference
--test_data_filepath str "./data/TableEval-test.jsonl" File path containing evaluation dataset in JSONL format
--config_file str "./config/api_config.yaml" YAML configuration file for API authentication parameters
--prompt_file str "./config/prompts.yaml" YAML file containing prompt templates for evaluation tasks
--output_directory str "./outputs" Directory path to store prediction and evaluation results
--max_workers int 5 Maximum number of parallel worker processes
--context_type str "markdown" Table formatting syntax for model input (markdown, html, latex
--specific_tasks list[str] None Filter evaluation tasks by name (Default None=all tasks), e.g. 信息查询
--specific_ids list[str] None Filter dataset samples by ID (Default None=all ids), e.g. 1 2

To view the full list of parameters of run_evaluation.py, use:

python run_evaluation.py --help
parameter details:
Parameter Type Default Description
--llm_judge_name str "gpt-4o-2024-11-20" Name of the LLM judge to use for evaluation
--prediction_file_path str File path containing prediction result in JSON format
--model_name str "gpt-4o-mini" Name of the model to use for inference
--config_file str "./config/api_config.yaml" YAML configuration file for API authentication parameters
--prompt_file str "./config/prompts.yaml" YAML file containing prompt templates for evaluation tasks
--max_workers int 5 Maximum number of parallel worker processes

📁 Repository Structure

TableEval-main/
├── assets/                  # Static resources (images, diagrams, etc.)
├── config/                  # Configuration files
│   ├── api_config.yaml      # Step 1: Set up api key and other configs
│   ├── prompts.yaml         # LLM prompt templates
│   └── logging.yaml         # Logging format and handlers
├── data/                    # Dataset storage
│   ├── tables/              # Tables in excel format
│   ├── TableEval-meta.jsonl # Table metadata (context, source, size, etc.)
│   └── TableEval-test.jsonl # Evaluation dataset with ground truths
├── outputs/                 # Output directory
│   ├── evaluation/          # Evaluation results
│   ├── logs/                # Log files
│   ├── prediction/          # Model prediction outputs
│   └── scores/              # Final evaluation scores
├── openai_client.py         # OpenAI API client wrapper
├── README.md                
├── requirements.txt         # Python dependencies
├── run_evaluation.py        # Step 3: LlM Evaluation &  Calculate metrics
├── run_prediction.py        # Step 2: Generate model predictions
└── utils.py                 # Helper functions                     

📚 Citation

If you find TableEval useful, please consider citing our paper:

@misc{zhu2025tableevalrealworldbenchmarkcomplex,
      title={TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering}, 
      author={Junnan Zhu and Jingyi Wang and Bohan Yu and Xiaoyu Wu and Junbo Li and Lei Wang and Nan Xu},
      year={2025},
      eprint={2506.03949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.03949}, 
}

About

This repository contains code and data for the paper "TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering."

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages