This repository contains code that performs the precision experiment of Raha on data lakes with no ground truth. Specifically, Raha needed to be adapted to allow efficient experiment execution by grouping all user labeling into one step.
- Create an anaconda environment
- use
environment.yml
- use
- Install adapted raha
-
cd raha pip install -e . cd ..
-
There are multiple steps required to run the different experiments. The Raha precision experiment is executed in the following steps:
- Configure the configs in
raha_experiment/hydra_configs
- Setup the experimental data
- prepare a data lake (for example the WDC data lake) in a folder (every dataset needs to be in csv format)
- sample datasets from the data lake with
python raha_experiment/pre-processing/sample_data_lake.py
- Using
python raha_experiment/pre-processing/setup_experiments.py
the folder structure necessary for Raha can be created
- run
python raha_experiment/start_experiments.py
- After the script finished the sampled tuples need to be labeled. This can be done
by running
python raha_experiment/label_experiments.py
. This requires user input and can take a lot of time!- Depending on the labeling budget this part can take a lot of time. Using the
raha.start
andraha.end
in the fileraha_experiment/hydra_configs/base.yaml
the work can be split into different sessions. - If needed the execution can be stopped by using
Strg+C
. Remember to set theraha.start
in the fileraha_experiment/hydra_configs/base.yaml
- Same value entered means correct, any other value means error
- Depending on the labeling budget this part can take a lot of time. Using the
- The last step uses the labels to execute the prediction part of Raha. This can be done
by running
python raha_experiment/finish_experiments.py
raha.start
andraha.end
in the fileraha_experiment/hydra_configs/base.yaml
also control the execution here.
- The results of Raha can then further be used by:
python raha_experiment/post-processing/retrieve_detected_errors.py
: to collect all detected errors into one csvpython raha_experiment/post-processing/sample_errors.py
: to sample an amount of errors from the csv that contains all errorspython raha_experiment/post-processing/retrieve_labels.py
: to collect all user labels to allow others to see what has been labeled (to ensure consistency)
The configs are configured for the experiment run in the paper. Only the config at
raha_experiment/hydra_configs/preprocessing.yaml
needs to be changed as it is configured for the test_lake.
Changes:
sampling.maximum_columns
:20
->10
sampling.amount_of_datasets
:1
->100
sampling.unidetect_training_corpus
:null
-> a pickled list of paths pointing to datasets in the lake, that have been used for training unidetect or should in general be excludedsampling.datalake_path
:./test_lake
-> path pointing towards the WDC lake