Skip to content

Conversation

esracosgun
Copy link

This pull request implements a deduplication pipeline for string tuples using distributed representations and LSH, as required for the assignment. The following steps are covered:

  • Distributed Representation: Each tuple is mapped to a dense vector using average GloVe embeddings.
  • LSH-based Blocking: Similar tuples are grouped into candidate buckets using random hyperplanes (LSH).
  • Similarity Computation: Candidate pairs are compared using a similarity measure (cosine or euclidean)
  • Duplicate Filtering: Tuples exceeding a similarity threshold are marked as duplicates; only the first occurrence is retained as unique.

Outputs:

  • Y_unique: Deduplicated tuples (first occurrence only).
  • Y_duplicates: All detected duplicates removed from the input.

@github-project-automation github-project-automation bot moved this to In Progress in SystemDS PR Queue Jul 14, 2025
@esracosgun esracosgun changed the title AMLS Exercise: Builtin for tuples deduplication [SYSTEMDS-3178] AMLS Exercise: Builtin for tuples deduplication Jul 14, 2025
@esracosgun esracosgun changed the title [SYSTEMDS-3178] AMLS Exercise: Builtin for tuples deduplication [SYSTEMDS-3178] Builtin for tuples deduplication (AMLS Exercise) Jul 14, 2025
@esracosgun esracosgun changed the title [SYSTEMDS-3178] Builtin for tuples deduplication (AMLS Exercise) [SYSTEMDS-3178] Builtin for tuples deduplication Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants