This repository contains code and data for the paper "TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering."
TableEval is the first cross-language tabular question-answering benchmark supporting Simplified Chinese, Traditional Chinese, and English. It features:
- Real-World Domains: Financial Disclosures, Academic Papers, Administrative Records, and Industry Reports.
- Comprehensive Data:
- 617 carefully inspected Excel spreadsheets with diverse structures including hierarchical headers, nested cells, and merged layouts.
- 2,325 QA pairs across 6 major tasks & 16 fine-grained sub-tasks, assessesing various capabilities(e.g., information retrieval, reasoning, data analysis, multi-turn conversations).
We also introduce SEAT(Structured Evaluation for Answers in TableQA), a novel evaluation framework that
- Provides fine-grained evaluation at the sub-question level.
- Leverages LLMs to extract final answers from model responses, comparing them with reference answers one by one and clearly visualizing correctness.
- Uses F1-score as the evaluation metric and achieves high consistency with human judgments.
[2025-06-05]
: We released the benchmark and the code! Please feel free to open an issue or contact us for any questions.
Models | Avg | Information Retrieval | Numerical Analysis | Reasoning | Data Analysis | Multi-turn Conversation | Table Structure Understanding |
---|---|---|---|---|---|---|---|
o1-preview | 83.43 | 88.30 | 87.08 | 82.88 | 77.89 | 83.38 | 81.03 |
claude-3-5-sonnet-20241022 | 83.32 | 89.62 | 91.06 | 85.76 | 84.01 | 87.94 | 61.51 |
deepseek-r1 | 82.46 | 90.15 | 88.56 | 87.91 | 77.79 | 78.29 | 72.05 |
gpt-4o-2024-11-20 | 78.79 | 88.24 | 86.00 | 83.05 | 81.47 | 83.20 | 50.79 |
QwQ-32B-Preview | 78.14 | 89.33 | 85.75 | 81.37 | 71.69 | 82.15 | 58.53 |
deepseek-chat | 77.95 | 91.20 | 82.61 | 81.72 | 77.45 | 85.83 | 48.89 |
Qwen2.5-32B-Instruct | 75.50 | 86.32 | 84.10 | 76.09 | 77.60 | 82.25 | 46.61 |
Qwen2.5-72B-Instruct | 74.23 | 82.68 | 81.53 | 74.85 | 78.94 | 81.90 | 45.50 |
qwen-max-2024-09-19 | 73.34 | 84.42 | 81.35 | 72.64 | 78.09 | 80.18 | 43.35 |
DeepSeek-V2.5-1210 | 73.27 | 87.41 | 79.10 | 71.49 | 77.97 | 78.72 | 44.94 |
Llama-3.3-70B-Instruct | 72.94 | 87.42 | 76.70 | 73.38 | 81.27 | 80.62 | 38.24 |
Qwen2.5-Coder-32B-Instruct | 70.75 | 79.82 | 77.00 | 73.03 | 76.33 | 74.89 | 43.44 |
Qwen2.5-14B-Instruct | 70.02 | 84.72 | 78.93 | 68.65 | 75.06 | 75.05 | 37.72 |
gpt-4o-mini-2024-07-18 | 68.47 | 82.64 | 76.15 | 73.13 | 70.70 | 73.66 | 34.56 |
Qwen2.5-7B-Instruct | 59.60 | 69.23 | 64.29 | 59.38 | 69.71 | 68.67 | 26.35 |
glm-4-9b-chat | 53.61 | 66.19 | 51.09 | 55.09 | 62.47 | 64.36 | 22.44 |
Llama-3.1-8B-Instruct | 49.26 | 67.40 | 53.35 | 48.82 | 57.06 | 53.15 | 15.76 |
DeepSeek-Coder-V2-Lite-Instruct | 48.30 | 60.40 | 56.39 | 50.03 | 51.51 | 50.62 | 20.83 |
DeepSeek-V2-Lite-Chat | 36.75 | 48.52 | 35.43 | 35.97 | 51.80 | 41.61 | 7.15 |
(Updated: 25/03/06)
Step 1: Clone this repository
git clone https://github.com/wenge-research/TableEval.git
cd TableEval
Step 2: Create a virtual environment(optional) and install dependencies:
conda create -n tableeval python=3.11
conda activate tableeval
pip install -r requirements.txt
This section outlines how to configure API keys, generate model responses, and run evaluations. It is designed to help you get started with TableEval quickly.
Step 1. Configure API keys
Currently, We support OpenAI-compatible API servers. Please create or update the config/api_config.yaml
file with your API settings. Ensure your API keys are kept secure and not exposed in public repositories. Below is an example configuration:
gpt-4o-2024-11-20:
model_name: gpt-4o-2024-11-20
api_key: YOUR_OPENAI_API_KEY
base_url: YOUR_OPENAI_API_URL
temperature: 0.0
max_tokens: 8192
top_p: 1.0
seed: 33
timeout: 600 # Request timeout in seconds
max_retries: 3 # Maximum number of retries if a request fails
YOUR_CUSTOM_API_MODEL:
model_name: YOUR_CUSTOM_API_MODEL_NAME
api_key: YOUR_CUSTOM_API_KEY
base_url: YOUR_CUSTOM_API_URL
temperature: 0.6
max_tokens: 8192
top_p: 0.95
seed: 33
timeout: 600
max_retries: 3
If you are using open-source LLMs, we recommend vLLM and sglang, which provides an HTTP server compatible with OpenAI’s Chat APIs.
Step 2: Generate Model Responses
Run the following command to generate model responses using the specified model:
python run_prediction.py --model_name gpt-4o-mini
This command will call the model to generate responses for the questions and store the results in the outputs
directory.
Step 3: Run Evaluation
After generating responses, execute the following command to run the evaluation:
python run_evaluation.py --llm_judge_name gpt-4o-2024-11-20 --prediction_file_path ./outputs/20250604_212820/gpt-4o-mini_cot_markdown.json
--llm_judge_name
: Specifies the judge model name used for evaluation.--prediction_file_path
: Path to the file containing prediction results. Both relative and absolute paths are supported.
Upon successful execution, you can review the evaluation results in the specified output directory.
🔧 Advanced Parameter Guide
run_prediction.py
and run_evaluation.py
support more configurable parameters, enabling users to adjust the prediction and evaluation settings:
To view the full list of parameters of run_prediction.py
, use:
python run_prediction.py --help
parameter details:
Parameter | Type | Default | Description |
---|---|---|---|
--model_name |
str |
"gpt-4o-mini" |
Name of the model to use for inference |
--test_data_filepath |
str |
"./data/TableEval-test.jsonl" |
File path containing evaluation dataset in JSONL format |
--config_file |
str |
"./config/api_config.yaml" |
YAML configuration file for API authentication parameters |
--prompt_file |
str |
"./config/prompts.yaml" |
YAML file containing prompt templates for evaluation tasks |
--output_directory |
str |
"./outputs" |
Directory path to store prediction and evaluation results |
--max_workers |
int |
5 |
Maximum number of parallel worker processes |
--context_type |
str |
"markdown" |
Table formatting syntax for model input (markdown, html, latex |
--specific_tasks |
list[str] |
None |
Filter evaluation tasks by name (Default None=all tasks), e.g. 信息查询 |
--specific_ids |
list[str] |
None |
Filter dataset samples by ID (Default None=all ids), e.g. 1 2 |
To view the full list of parameters of run_evaluation.py
, use:
python run_evaluation.py --help
parameter details:
Parameter | Type | Default | Description |
---|---|---|---|
--llm_judge_name |
str |
"gpt-4o-2024-11-20" |
Name of the LLM judge to use for evaluation |
--prediction_file_path |
str |
File path containing prediction result in JSON format | |
--model_name |
str |
"gpt-4o-mini" |
Name of the model to use for inference |
--config_file |
str |
"./config/api_config.yaml" |
YAML configuration file for API authentication parameters |
--prompt_file |
str |
"./config/prompts.yaml" |
YAML file containing prompt templates for evaluation tasks |
--max_workers |
int |
5 |
Maximum number of parallel worker processes |
TableEval-main/
├── assets/ # Static resources (images, diagrams, etc.)
├── config/ # Configuration files
│ ├── api_config.yaml # Step 1: Set up api key and other configs
│ ├── prompts.yaml # LLM prompt templates
│ └── logging.yaml # Logging format and handlers
├── data/ # Dataset storage
│ ├── tables/ # Tables in excel format
│ ├── TableEval-meta.jsonl # Table metadata (context, source, size, etc.)
│ └── TableEval-test.jsonl # Evaluation dataset with ground truths
├── outputs/ # Output directory
│ ├── evaluation/ # Evaluation results
│ ├── logs/ # Log files
│ ├── prediction/ # Model prediction outputs
│ └── scores/ # Final evaluation scores
├── openai_client.py # OpenAI API client wrapper
├── README.md
├── requirements.txt # Python dependencies
├── run_evaluation.py # Step 3: LlM Evaluation & Calculate metrics
├── run_prediction.py # Step 2: Generate model predictions
└── utils.py # Helper functions
If you find TableEval useful, please consider citing our paper:
@misc{zhu2025tableevalrealworldbenchmarkcomplex,
title={TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering},
author={Junnan Zhu and Jingyi Wang and Bohan Yu and Xiaoyu Wu and Junbo Li and Lei Wang and Nan Xu},
year={2025},
eprint={2506.03949},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.03949},
}