Matelda-Baselines Precision Experiments

This repository contains code that performs the precision experiment of Raha on data lakes with no ground truth. Specifically, Raha needed to be adapted to allow efficient experiment execution by grouping all user labeling into one step.

Installation

Create an anaconda environment
- use environment.yml
Install adapted raha
- ```
cd raha
pip install -e .
cd ..
```

Usage

There are multiple steps required to run the different experiments. The Raha precision experiment is executed in the following steps:

Configure the configs in raha_experiment/hydra_configs
Setup the experimental data
- prepare a data lake (for example the WDC data lake) in a folder (every dataset needs to be in csv format)
- sample datasets from the data lake with python raha_experiment/pre-processing/sample_data_lake.py
- Using python raha_experiment/pre-processing/setup_experiments.py the folder structure necessary for Raha can be created
run python raha_experiment/start_experiments.py
After the script finished the sampled tuples need to be labeled. This can be done by running python raha_experiment/label_experiments.py. This requires user input and can take a lot of time!
- Depending on the labeling budget this part can take a lot of time. Using the raha.start and raha.end in the file raha_experiment/hydra_configs/base.yaml the work can be split into different sessions.
- If needed the execution can be stopped by using Strg+C. Remember to set the raha.start in the file raha_experiment/hydra_configs/base.yaml
- Same value entered means correct, any other value means error
The last step uses the labels to execute the prediction part of Raha. This can be done by running python raha_experiment/finish_experiments.py
- raha.start and raha.end in the file raha_experiment/hydra_configs/base.yaml also control the execution here.
The results of Raha can then further be used by:
- python raha_experiment/post-processing/retrieve_detected_errors.py: to collect all detected errors into one csv
- python raha_experiment/post-processing/sample_errors.py: to sample an amount of errors from the csv that contains all errors
- python raha_experiment/post-processing/retrieve_labels.py: to collect all user labels to allow others to see what has been labeled (to ensure consistency)

Experiment

The configs are configured for the experiment run in the paper. Only the config at raha_experiment/hydra_configs/preprocessing.yaml needs to be changed as it is configured for the test_lake.

Changes:

sampling.maximum_columns: 20 -> 10
sampling.amount_of_datasets: 1 -> 100
sampling.unidetect_training_corpus: null -> a pickled list of paths pointing to datasets in the lake, that have been used for training unidetect or should in general be excluded
sampling.datalake_path: ./test_lake -> path pointing towards the WDC lake

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
raha		raha
raha_experiment		raha_experiment
test_lake		test_lake
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Matelda-Baselines Precision Experiments

Installation

Usage

Experiment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

LUH-DBS/Matelda-Precision-Exp

Folders and files

Latest commit

History

Repository files navigation

Matelda-Baselines Precision Experiments

Installation

Usage

Experiment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages