🌲EverGreen QA 🍂

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

This repository contains the implementation of a multilingual text classification system that categorizes questions based on their temporal mutability.

Abstract

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o’s retrieval behavior.

Overview

This project implements a classifier that determines whether questions have answers that are:

Evergreen: Answers that almost never change (e.g., "What was Cain's brother's name?")
Mutable: Answers that typically change over several years or less (e.g., "What breed of dog is considered the smallest in the world?")

The system supports classification across 7 languages: English, Russian, French, German, Hebrew, Arabic, and Chinese.

Features

Multilingual Support: Train and evaluate on 7 different languages
Multiple Model Support: Compatible with various transformer models (mDeBERTa, E5, mBERT)
Per-Language Metrics: Detailed F1 scores for each language
Efficient Generation: Uses vLLM for fast inference with multiple LLMs

Installation

# Clone the repository
git clone https://github.com/s-nlp/EverGreen-classification.git
cd EverGreen-classification

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Project Structure

evergreen-classification/
├── README.md
├── requirements.txt
├── config/
│   └── config.yaml          # Configuration file
├── src/
│   ├── train.py            # Training script
│   ├── generate.py         # Generation/evaluation script
│   └── utils/
│       ├── data_utils.py
├── datasets/
│   └── aliases/           
├── models/                 # Saved models directory
├── results/               # Training results
└── docs/
    └── paper.pdf          # Research paper

Data Format

The expected CSV format for training data:

is_evergreen: Binary classification label (0 or 1)
Language columns: Russian, English, French, German, Hebrew, Arabic, Chinese

Datasets also available on HuggingFace

Usage

Training

# Basic training
python src/train.py \
    --model_name "intfloat/multilingual-e5-large-instruct" \
    --output_dir "./results/multilingual-e5-large" \
    --num_epochs 8 \
    --batch_size 16

# Training with custom configuration
python src/train.py --config config/config.yaml

Generation and Evaluation

# Generate translations using OpenAI
python src/generate.py \
    --mode translate \
    --input_file "data/datasets/train.csv" \
    --api_key "your-openai-api-key"

# Generate classifications using vLLM
python src/generate.py \
    --mode classify \
    --model "meta-llama/Llama-3.1-8B-Instruct" \
    --input_file "data/datasets/train.csv" \
    --output_file "data/datasets/output.csv"

Results

Model Performance

Model	Overall F1	English	Russian	French	German	Hebrew	Arabic	Chinese
multilingual-e5-large-instruct	0.910	0.913	0.909	0.910	0.904	0.900	0.897	0.906
multilingual-e5-small	0.821	0.822	0.819	0.815	0.804	0.807	0.817	0.815
mdeberta-v3-base	0.836	0.842	0.845	0.841	0.832	0.825	0.831	0.836
bert-base-multilingual-cased	0.893	0.900	0.889	0.884	0.889	0.883	0.902	0.891

multilingual-e5-small and mmultilingual-e5-large-instruct are available at Huggingface Collections

Training Configuration

Learning Rate: 4.676e-05
Batch Size: 16
Epochs: 8
Warmup Steps: 500
Weight Decay: 0.01

Configuration

Create a config/config.yaml file:

model:
  name: "intfloat/multilingual-e5-large-instruct"
  max_length: 64
  
training:
  epochs: 8
  batch_size: 16
  learning_rate: 4.676e-05
  warmup_steps: 500
  weight_decay: 0.01
  
data:
  train_path: "/datasets/train.csv"
  test_path: "/datasets/test.csv"
  languages: ["Russian", "English", "French", "German", "Hebrew", "Arabic", "Chinese"]

API Keys

We use OpenAI API for additional validational and synth. generation, so you need to import a OpenAI API-key for using it:

export OPENAI_API_KEY="your-api-key-here"

Citation

If you use this code in your research, please cite:

@misc{pletenev2025truetomorrowmultilingualevergreen,
      title={Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA}, 
      author={Sergey Pletenev and Maria Marina and Nikolay Ivanov and Daria Galimzianova and Nikita Krayko and Mikhail Salnikov and Vasily Konovalov and Alexander Panchenko and Viktor Moskvoretskii},
      year={2025},
      eprint={2505.21115},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.21115}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌲EverGreen QA 🍂

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

Abstract

Overview

Features

Installation

Project Structure

Data Format

Usage

Training

Generation and Evaluation

Results

Model Performance

Training Configuration

Configuration

API Keys

Citation

License

Contributing

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
datasets		datasets
docs		docs
src		src
README.md		README.md
evergreen.pdf		evergreen.pdf
requirements.txt		requirements.txt

s-nlp/Evergreen-classification

Folders and files

Latest commit

History

Repository files navigation

🌲EverGreen QA 🍂

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

Abstract

Overview

Features

Installation

Project Structure

Data Format

Usage

Training

Generation and Evaluation

Results

Model Performance

Training Configuration

Configuration

API Keys

Citation

License

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages