TableEval

This repository contains code and data for the paper "TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering."

🤗 • 📄

🔍 Overview

TableEval is the first cross-language tabular question-answering benchmark supporting Simplified Chinese, Traditional Chinese, and English. It features:

Real-World Domains: Financial Disclosures, Academic Papers, Administrative Records, and Industry Reports.
Comprehensive Data:
- 617 carefully inspected Excel spreadsheets with diverse structures including hierarchical headers, nested cells, and merged layouts.
- 2,325 QA pairs across 6 major tasks & 16 fine-grained sub-tasks, assessesing various capabilities(e.g., information retrieval, reasoning, data analysis, multi-turn conversations).

We also introduce SEAT(Structured Evaluation for Answers in TableQA), a novel evaluation framework that

Provides fine-grained evaluation at the sub-question level.
Leverages LLMs to extract final answers from model responses, comparing them with reference answers one by one and clearly visualizing correctness.
Uses F1-score as the evaluation metric and achieves high consistency with human judgments.

🔥 Latest News

[2025-06-05]: We released the benchmark and the code! Please feel free to open an issue or contact us for any questions.

🏆 Leaderboard

Models	Avg	Information Retrieval	Numerical Analysis	Reasoning	Data Analysis	Multi-turn Conversation	Table Structure Understanding
o1-preview	83.43	88.30	87.08	82.88	77.89	83.38	81.03
claude-3-5-sonnet-20241022	83.32	89.62	91.06	85.76	84.01	87.94	61.51
deepseek-r1	82.46	90.15	88.56	87.91	77.79	78.29	72.05
gpt-4o-2024-11-20	78.79	88.24	86.00	83.05	81.47	83.20	50.79
QwQ-32B-Preview	78.14	89.33	85.75	81.37	71.69	82.15	58.53
deepseek-chat	77.95	91.20	82.61	81.72	77.45	85.83	48.89
Qwen2.5-32B-Instruct	75.50	86.32	84.10	76.09	77.60	82.25	46.61
Qwen2.5-72B-Instruct	74.23	82.68	81.53	74.85	78.94	81.90	45.50
qwen-max-2024-09-19	73.34	84.42	81.35	72.64	78.09	80.18	43.35
DeepSeek-V2.5-1210	73.27	87.41	79.10	71.49	77.97	78.72	44.94
Llama-3.3-70B-Instruct	72.94	87.42	76.70	73.38	81.27	80.62	38.24
Qwen2.5-Coder-32B-Instruct	70.75	79.82	77.00	73.03	76.33	74.89	43.44
Qwen2.5-14B-Instruct	70.02	84.72	78.93	68.65	75.06	75.05	37.72
gpt-4o-mini-2024-07-18	68.47	82.64	76.15	73.13	70.70	73.66	34.56
Qwen2.5-7B-Instruct	59.60	69.23	64.29	59.38	69.71	68.67	26.35
glm-4-9b-chat	53.61	66.19	51.09	55.09	62.47	64.36	22.44
Llama-3.1-8B-Instruct	49.26	67.40	53.35	48.82	57.06	53.15	15.76
DeepSeek-Coder-V2-Lite-Instruct	48.30	60.40	56.39	50.03	51.51	50.62	20.83
DeepSeek-V2-Lite-Chat	36.75	48.52	35.43	35.97	51.80	41.61	7.15

(Updated: 25/03/06)

🛠️ Installation & Setup

Step 1: Clone this repository

git clone https://github.com/wenge-research/TableEval.git
cd TableEval

Step 2: Create a virtual environment(optional) and install dependencies:

conda create -n tableeval python=3.11
conda activate tableeval
pip install -r requirements.txt

🤖 Evaluate

This section outlines how to configure API keys, generate model responses, and run evaluations. It is designed to help you get started with TableEval quickly.

Step 1. Configure API keys

Currently, We support OpenAI-compatible API servers. Please create or update the config/api_config.yaml file with your API settings. Ensure your API keys are kept secure and not exposed in public repositories. Below is an example configuration:

gpt-4o-2024-11-20:
  model_name: gpt-4o-2024-11-20
  api_key: YOUR_OPENAI_API_KEY
  base_url: YOUR_OPENAI_API_URL
  temperature: 0.0
  max_tokens: 8192
  top_p: 1.0
  seed: 33
  timeout: 600    # Request timeout in seconds
  max_retries: 3  # Maximum number of retries if a request fails

YOUR_CUSTOM_API_MODEL:
  model_name: YOUR_CUSTOM_API_MODEL_NAME  
  api_key: YOUR_CUSTOM_API_KEY
  base_url: YOUR_CUSTOM_API_URL
  temperature: 0.6
  max_tokens: 8192
  top_p: 0.95
  seed: 33
  timeout: 600   
  max_retries: 3

If you are using open-source LLMs, we recommend vLLM and sglang, which provides an HTTP server compatible with OpenAI’s Chat APIs.

Step 2: Generate Model Responses

Run the following command to generate model responses using the specified model:

python run_prediction.py --model_name gpt-4o-mini

This command will call the model to generate responses for the questions and store the results in the outputs directory.

Step 3: Run Evaluation

After generating responses, execute the following command to run the evaluation:

python run_evaluation.py --llm_judge_name gpt-4o-2024-11-20 --prediction_file_path ./outputs/20250604_212820/gpt-4o-mini_cot_markdown.json

--llm_judge_name: Specifies the judge model name used for evaluation.
--prediction_file_path: Path to the file containing prediction results. Both relative and absolute paths are supported.

Upon successful execution, you can review the evaluation results in the specified output directory.

🔧 Advanced Parameter Guide

run_prediction.py and run_evaluation.py support more configurable parameters, enabling users to adjust the prediction and evaluation settings:

To view the full list of parameters of run_prediction.py, use:

python run_prediction.py --help

parameter details:

Parameter	Type	Default	Description
`--model_name`	`str`	`"gpt-4o-mini"`	Name of the model to use for inference
`--test_data_filepath`	`str`	`"./data/TableEval-test.jsonl"`	File path containing evaluation dataset in JSONL format
`--config_file`	`str`	`"./config/api_config.yaml"`	YAML configuration file for API authentication parameters
`--prompt_file`	`str`	`"./config/prompts.yaml"`	YAML file containing prompt templates for evaluation tasks
`--output_directory`	`str`	`"./outputs"`	Directory path to store prediction and evaluation results
`--max_workers`	`int`	`5`	Maximum number of parallel worker processes
`--context_type`	`str`	`"markdown"`	Table formatting syntax for model input (markdown, html, latex
`--specific_tasks`	`list[str]`	`None`	Filter evaluation tasks by name (Default None=all tasks), e.g. 信息查询
`--specific_ids`	`list[str]`	`None`	Filter dataset samples by ID (Default None=all ids), e.g. 1 2

To view the full list of parameters of run_evaluation.py, use:

python run_evaluation.py --help

parameter details:

Parameter	Type	Default	Description
`--llm_judge_name`	`str`	`"gpt-4o-2024-11-20"`	Name of the LLM judge to use for evaluation
`--prediction_file_path`	`str`		File path containing prediction result in JSON format
`--model_name`	`str`	`"gpt-4o-mini"`	Name of the model to use for inference
`--config_file`	`str`	`"./config/api_config.yaml"`	YAML configuration file for API authentication parameters
`--prompt_file`	`str`	`"./config/prompts.yaml"`	YAML file containing prompt templates for evaluation tasks
`--max_workers`	`int`	`5`	Maximum number of parallel worker processes

📁 Repository Structure

TableEval-main/
├── assets/                  # Static resources (images, diagrams, etc.)
├── config/                  # Configuration files
│   ├── api_config.yaml      # Step 1: Set up api key and other configs
│   ├── prompts.yaml         # LLM prompt templates
│   └── logging.yaml         # Logging format and handlers
├── data/                    # Dataset storage
│   ├── tables/              # Tables in excel format
│   ├── TableEval-meta.jsonl # Table metadata (context, source, size, etc.)
│   └── TableEval-test.jsonl # Evaluation dataset with ground truths
├── outputs/                 # Output directory
│   ├── evaluation/          # Evaluation results
│   ├── logs/                # Log files
│   ├── prediction/          # Model prediction outputs
│   └── scores/              # Final evaluation scores
├── openai_client.py         # OpenAI API client wrapper
├── README.md                
├── requirements.txt         # Python dependencies
├── run_evaluation.py        # Step 3: LlM Evaluation &  Calculate metrics
├── run_prediction.py        # Step 2: Generate model predictions
└── utils.py                 # Helper functions

📚 Citation

If you find TableEval useful, please consider citing our paper:

@misc{zhu2025tableevalrealworldbenchmarkcomplex,
      title={TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering}, 
      author={Junnan Zhu and Jingyi Wang and Bohan Yu and Xiaoyu Wu and Junbo Li and Lei Wang and Nan Xu},
      year={2025},
      eprint={2506.03949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.03949}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TableEval

🔍 Overview

🔥 Latest News

🏆 Leaderboard

🛠️ Installation & Setup

🤖 Evaluate

📁 Repository Structure

📚 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
config		config
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
openai_client.py		openai_client.py
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py
run_prediction.py		run_prediction.py
utils.py		utils.py

License

wenge-research/TableEval

Folders and files

Latest commit

History

Repository files navigation

TableEval

🔍 Overview

🔥 Latest News

🏆 Leaderboard

🛠️ Installation & Setup

🤖 Evaluate

📁 Repository Structure

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages