Usage of GTE-ModernColBERT or ModernBERT embedding model for storing file chunks and vector embeddings in a vector store

Summary:
We are looking for a code example that show how to use ModernColBERT model to vectorize the chunks incrementally read from a file and how to to store them in a vector database, e.g. FAISS.

Details:
We have 2 requirements and we are looking for a solution to meet them. I
1. We are want to use lightonai/GTE-ModernColBERT-v1 embedding model with vector stores, e.g. FAISS. We are using LangChain to integrate with FAISS
2. We want to incrementally read chunks of a file (or files), create embeddings(vectorize) for each chunk using lightonai/GTE-ModernColBERT-v1 embedding model and write each chunk along with its vector into FAISS.
We have not been able to figure out whether this can be done.

The typical process for vectorizing chunks and storing them in FAISS, using the LangChain wrapper classes, looks like this:

```
from uuid import uuid4
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings

index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query("hello world")))

vector_store = FAISS(
    embedding_function = OpenAIEmbeddings(),
    index = index,
    docstore = InMemoryDocstore(),
    index_to_docstore_id = {}
)

loop to read a file in chunks:
    chunk = read the next chunk from the file
    chunk_id = str(uuid4())
    vector_store.add_documents(documents = [chunk], ids = [chunk_id])
```

We have tried many approaches including the following:

```
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_GTE_MODERN_COLBERT)
embedding_model = AutoModel.from_pretrained(EMBEDDING_MODEL_GTE_MODERN_COLBERT) # This returns ModenBertModel class

# Need a method to vectorize a sample string and get the length of the vector. The following code does not work bedacuse encode method does not exist We don't want to hardcode the vector dimensions to 128.
vector_dimension =  len(embedding_model.encode("hello world))

index = faiss.IndexFlatL2(vector_dimension)

# Need an embedding function to be passed to the FAISS constructor for ModernBert model. The following code does not work because  embedding_mode (ModernBertModel class) cannot be assigned to embedding_function paramter
vector_store = FAISS(
    embedding_function = embedding_model,
    index = index,
    docstore = InMemoryDocstore(),
    index_to_docstore_id = {}
)
```
Questions:
In general, we are looking for a code example that show how to use ModernColBERT model to vectorize the chunks incrementally read from a file and how to to store them in a vector database, e.g. FAISS.
1. We are looking for a way to create a vector/embeddings for a sample string to get the vector dimension without hardcoding it to 128.
2. We are looking for an embedding_function that can be passed as a paramter to the FAISS constructor (the same requirement for Chroma vector DB, too). The embedding function will be used by the vector DB to convert strings/documents into vectors (create embeddings) when the strings/documents are added to the vector DB.

Any help is much appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage of GTE-ModernColBERT or ModernBERT embedding model for storing file chunks and vector embeddings in a vector store #240

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Usage of GTE-ModernColBERT or ModernBERT embedding model for storing file chunks and vector embeddings in a vector store #240

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions