-
Notifications
You must be signed in to change notification settings - Fork 121
Open
Description
Summary:
We are looking for a code example that show how to use ModernColBERT model to vectorize the chunks incrementally read from a file and how to to store them in a vector database, e.g. FAISS.
Details:
We have 2 requirements and we are looking for a solution to meet them. I
- We are want to use lightonai/GTE-ModernColBERT-v1 embedding model with vector stores, e.g. FAISS. We are using LangChain to integrate with FAISS
- We want to incrementally read chunks of a file (or files), create embeddings(vectorize) for each chunk using lightonai/GTE-ModernColBERT-v1 embedding model and write each chunk along with its vector into FAISS.
We have not been able to figure out whether this can be done.
The typical process for vectorizing chunks and storing them in FAISS, using the LangChain wrapper classes, looks like this:
from uuid import uuid4
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings
index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query("hello world")))
vector_store = FAISS(
embedding_function = OpenAIEmbeddings(),
index = index,
docstore = InMemoryDocstore(),
index_to_docstore_id = {}
)
loop to read a file in chunks:
chunk = read the next chunk from the file
chunk_id = str(uuid4())
vector_store.add_documents(documents = [chunk], ids = [chunk_id])
We have tried many approaches including the following:
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_GTE_MODERN_COLBERT)
embedding_model = AutoModel.from_pretrained(EMBEDDING_MODEL_GTE_MODERN_COLBERT) # This returns ModenBertModel class
# Need a method to vectorize a sample string and get the length of the vector. The following code does not work bedacuse encode method does not exist We don't want to hardcode the vector dimensions to 128.
vector_dimension = len(embedding_model.encode("hello world))
index = faiss.IndexFlatL2(vector_dimension)
# Need an embedding function to be passed to the FAISS constructor for ModernBert model. The following code does not work because embedding_mode (ModernBertModel class) cannot be assigned to embedding_function paramter
vector_store = FAISS(
embedding_function = embedding_model,
index = index,
docstore = InMemoryDocstore(),
index_to_docstore_id = {}
)
Questions:
In general, we are looking for a code example that show how to use ModernColBERT model to vectorize the chunks incrementally read from a file and how to to store them in a vector database, e.g. FAISS.
- We are looking for a way to create a vector/embeddings for a sample string to get the vector dimension without hardcoding it to 128.
- We are looking for an embedding_function that can be passed as a paramter to the FAISS constructor (the same requirement for Chroma vector DB, too). The embedding function will be used by the vector DB to convert strings/documents into vectors (create embeddings) when the strings/documents are added to the vector DB.
Any help is much appreciated.
Metadata
Metadata
Assignees
Labels
No labels