Semantic Cache

An interactive CLI program combining SQL++ keyword search and vector similarity search using Couchbase Capella and LangChain to retrieve contextually relevant search results using the sample dataset (hotels) provided by Couchbase.

Setup Instructions

Prerequisites

Python 3.10.x or higher
Provision a Couchbase Capella free tier cluster
Ensure travel-sample dataset bucket is installed (provided by free tier)

Setup Steps

Clone repository and cd into directory: `
Create and activate a virtual environment: python -m venv venv and source venv/bin/activate
Install dependencies: python -m pip install -r src/requirements.txt
Update environemnt variables in .env.sample and rename to .env
Create indexes in Couchbase

Vector index: travel_inventory_hotel_hugging_face_vector_index
FTS index: travel_inventory_hotel_fts_index

Run the program: python src/main.py

Embedding & Schema

Embedding Model: all-mpnet-base-v2

32M downloads per month on Hugging Face suggests it's widely adopted and a popular choice for sentence embedding
Produces 768-dimensional embeddings to capture the meaning of the text semantically
Strikes a balance between performance and accuracy
The Hugging Face documentation description explicitly states it's a good choice for semantic search
Alternative model: considered all-MiniLM-L6-v2 as it is faster but seems less accurate in comparison

Schema:

{
    "id": "hotel_123",
    "name": "Oceanfront Resort",
    "description": "Luxury beachfront resort...",
    "description_minilm_vector": [0.123, 0.456, 0.789, ...],
    "city": "Miami",
    "state": "Florida",
    "country": "USA"
}

Uses Couchbase as a backend database for storing embeddings
Leverages langchain_couchbase package to access Couchbase's native vector store
Documents stored in travel-sample bucket under inventory scope and hotel collection
Full text search on on the index embedding

SQL++ & Vector Search

Keyword Search (SQL++) Strengths:

Allows for finding exact matches when users search specific hotel features like "oceanfront"
Filters hotels by specific amenities mentioned in descriptions
Quick results for common hotel keywords

Vector Search Strengths:

Captures intent behind user queries like "oceanfront" being similar to "beachfront"
Retrieves hotels with similar features, even if not explicitly mentioned in user queries

How They Complement Each Other:

Vectors find similar concepts while keywords find exact matches with speed
This approach ensures users don't miss out on relevant hotels due to phrasing
Users find relevant hotels regardless of how they phrase their search queries

Challenges & Improvements

Challenges

Having never used LangChain before, I spent significant time understanding the technical exercise requirements and the LangChain framework. The langchain-couchbase package helped abstract complexity and reading through the source code proved to be time well spent. I found these resources helpful: LangChain Couchbase Documentation, LangChain Couchbase API Reference
Deciding on an embedding model presented challenges in balancing performance and accuracy. I opted for all-mpnet-base-v2 due to its popularity and support for semantic understanding. The tradeoff was managing larger embeddings for better accuracy at the cost of slower performance. For the presentation, I considered using an in-memory embedding model like all-MiniLM-L6-v2 which stores smaller embeddings but would provide less semantic understanding. Ultimately, I stuck with all-mpnet-base-v2 even though downloading the model locally took a while.

Improvements

We're relying on basic print statements for user feedback which makes it difficult to track issues if this project scales. Adding a logging system and retry logic for failed API calls would improve the user experience.
Users cannot balance the scoring weights from search results of which might be valuable for tuning results.
Restructuring main.py by extracting the logic from if __name__ == __main__ to a separate function (likely called main) to extend the program for other uses like a web app.
Create unit tests for embedding and hybrid search logic to ensure coverage of application.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Cache

Setup Instructions

Prerequisites

Setup Steps

Embedding & Schema

SQL++ & Vector Search

Challenges & Improvements

Challenges

Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

andy4thehuynh/semantic_cache

Folders and files

Latest commit

History

Repository files navigation

Semantic Cache

Setup Instructions

Prerequisites

Setup Steps

Embedding & Schema

SQL++ & Vector Search

Challenges & Improvements

Challenges

Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages