Demonstrating MATE and COCOA for Data Discovery

This repository contains the code of our data discovery system MaCO as well as the corresponding demonstration.

For the WHO and Movie datasets, we demonstrate how the pipeline can be used to enrich a given input dataset with features from a data lake.

Interactive demo in Google Colab

On Google Colab, we provide the demonstration notebook. By running the code cells, MaCo is installed in the current session and can be used afterwards.

Please note: The duplicate tables graph cannot be displayed within Google Colab. Please use a local jupyter notebook instance for the full demo.

Local Usage

Please follow these steps to use the demonstration notebook on your local machine:

Install anaconda.
Create a new environment with python 3.9 and activate it.

conda create -n MaCo python=3.9
conda activate MaCo

Clone the GitHub repository

git clone https://github.com/LUH-DBS/datalake_indexes.git
cd datalake_indexes

Install the requirements via pip install -r requirements.txt.
Run a jupyter notebook instance jupyter notebook
Open datalakes_indexes_demo.ipynb

Code is tested on Mac only.

MaCo installation

You can also install and use MaCo in your own projects by following these steps:

Install anaconda.
Create a new environment with python 3.9 and activate it.

conda create -n MaCo python=3.9
conda activate MaCo

Clone the GitHub repository

git clone https://github.com/LUH-DBS/datalake_indexes.git
cd datalake_indexes

Install MaCo

pip install .

Use MATE and COCOA

import pandas as pd
from maco.data_handler import DataHandler
from maco.cocoa import COCOA
from maco.mate import MATE
from maco.duplicate_detection import DuplicateDetection
from maco.util import get_cleaned_text
import psycopg2

conn = psycopg2.connect({
    "user": "user",
    "password": "password"
})

# Create a DataHandler and pass the DB relation names
data_handler = DataHandler(
    conn,
    main_table=f"MaCo_main_tokenized",
    column_headers_table=f"MaCo_column_headers",
    table_info_table=f"MaCo_table_info",
    cocoa_index_table=f"MaCo_cocoa_index"
)

# Index the data lake
data_handler.add_tables_folder("my_data_lake")  # add folder containing csv/parquet/json files
data_handler.update_index()

# Read and prepare the input dataset
input_dataset = pd.read_csv("input_dataset.csv")
input_dataset = input_dataset.applymap(get_cleaned_text)  # tokenization

# Joinability discovery using MATE
top_joinable_tables = MATE(data_handler).join_search(
    input_dataset,
    ["query_column_A", "query_column_B"],
    10,             # number of top table-column combinations to return 
    k_c=5000        # number of candidate table-column combinations to fetch
)

# Use the result
for joinability_score, table_id, columns, join_map in top_joinable_tables:
    pass
    
# Duplicate detection using XASH index
dup = DuplicateDetection(data_handler)

duplicate_tables = []      # stores all duplicate tables for the joinable tables

for _, table_id, _, _ in top_joinable_tables:
    table = data_handler.get_table(table_id)
    duplicate_tables += dup.get_duplicate_tables(table)

duplicate_relations = dup.get_relations(duplicate_tables)   # relations within duplicates

# Correlation calculation using COCOA
top_correlating_columns = COCOA(data_handler).enrich_multicolumn(
    input_dataset,
    top_joinable_tables,
    10,
    "target_column"
)

# Use the result
for corr_coeff, table_col_id, is_numeric in top_correlating_columns:
    pass

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
datasets		datasets
maco		maco
LICENSE		LICENSE
README.md		README.md
README.rst		README.rst
datalake_indexes_qr.png		datalake_indexes_qr.png
datalakes_indexes_demo.ipynb		datalakes_indexes_demo.ipynb
paper_plots.py		paper_plots.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Demonstrating MATE and COCOA for Data Discovery

Interactive demo in Google Colab

Local Usage

MaCo installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

LUH-DBS/datalake_indexes

Folders and files

Latest commit

History

Repository files navigation

Demonstrating MATE and COCOA for Data Discovery

Interactive demo in Google Colab

Local Usage

MaCo installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages