You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-08-28-what-is-vector.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,30 +64,30 @@ Vectors belong to a larger category of _tensors_. In machine learning (ML), "ten
64
64
65
65
If you want to convert text into vectors, you would typically interact with the LLM at a specific stage in the following process.
66
66
67
-
***Tokenization:** The tax is first tokenized, which means breaking down into text, into smaller units. Tokens are usually words or sub-words. This is the first step, but it's not yet the factorization process.
68
-
***Embedding (Vectorization):** After tokenization, the test is passed through an **embedding layer**. This is where the interaction with the LLM happens. The LLM takes the tokens and converts them into dense numerical representations—**vectors**. These vectors are high dimensional (e.g. 768 dimensions in the case of BERT or GBT-3's default embeddings), and contain semantic information about the text.
67
+
***Tokenization:** The text is first tokenized, which means breaking down into text, into smaller units. Tokens are usually words or sub-words. This is the first step, but it's not yet the factorization process.
68
+
***Embedding (Vectorization):** After tokenization, the text is passed through an **embedding layer**. This is where the interaction with the LLM happens. The LLM takes the tokens and converts them into dense numerical representations—**vectors**. These vectors are high dimensional (e.g. 768 dimensions in the case of BERT or GBT-3's default embeddings), and contain semantic information about the text.
69
69
70
-
Here is an example from Anshu's article [Understanding the Fundamental Limitations of Vector-Based Retrieval for Building LLM-powered Chatbot](https://medium.com/thirdai-blog/understanding-the-fundamental-limitations-of-vector-based-retrieval-for-building-llm-powered-48bb7b5a57b3), where a corpus of text documents being broken down into smaller blocks of text (chunk). Each trunk is then fed to a trained language model like BERT or GPT to generate vector representation, also known as embedding. The embedding is then stored into the vector database.
70
+
Here is an example from Anshu's article [Understanding the Fundamental Limitations of Vector-Based Retrieval for Building LLM-powered Chatbot](https://medium.com/thirdai-blog/understanding-the-fundamental-limitations-of-vector-based-retrieval-for-building-llm-powered-48bb7b5a57b3), where a corpus of text documents being broken down into smaller blocks of text (chunks). Each trunk is then fed to a trained language model like BERT or GPT to generate vector representation, also known as embedding. The embedding is then stored in the vector database.
71
71
72
72

73
73
74
-
However, any changes or update to the LLM require reindexing everything in the vector database. You need the exact same for querying, changing dimensions is not allowed. So you can imagine the cost of using LLM to power your solution.
74
+
However, any changes or updates to the LLM require reindexing everything in the vector database. You need the exact same for querying, changing dimensions is not allowed. So you can imagine the cost of using LLM to power your solution.
75
75
76
76
## Why (not) using vector?
77
77
78
78
Vectors can be used to determine the similarity of different objects. You can convert any kind of data from text, image, and audio data to unstructured data into vectors. Then, determine their semantic similarity by measuring the distance between vectors. The K-nearest neighbors (KNN) are the ones that are the most similar to the vector that you are looking for.
79
79
80
-
This is useful for finding words that are similar to each other even if their representation are completely different. For example, "king" and "queen" are similar but they look different. The word "king" in English and "roi" in French are also similar. This kind of sementic similarity is difficult to achieve in traditional full-text search, yet, They are very useful for many activities such as recruiting, e-commerce, etc.
80
+
This is useful for finding words that are similar to each other even if their representation are completely different. For example, "king" and "queen" are similar but they look different. The word "king" in English and "roi" in French are also similar. This kind of semantic similarity is difficult to achieve in traditional full-text search, yet, They are very useful for many activities such as recruiting, e-commerce, etc.
81
81
82
-
There are also cases that you don't want to use vectors. When you know precisely what are you searching for, you want to ensure the searching criteria are precise and strictly applied by the database / search engine. You dont want any irrelevant results to appear, even if they look similar. For example, if you are looking for Kings in France, you don't want any kings from England even if they are similar. You want exact matches in this case.
82
+
There are also cases that you don't want to use vectors. When you know precisely what are you searching for, you want to ensure the search criteria are precise and strictly applied by the database / search engine. You dont want any irrelevant results to appear, even if they look similar. For example, if you are looking for Kings in France, you don't want any kings from England even if they are similar. You want exact matches in this case.
83
83
84
84
## Vector Database
85
85
86
-
A vector database is a specific kind of database that saves information in a form of multi-dimensional factors representing certain characteristic or qualities. According to the article [The Top 5 Vector Databases](https://www.datacamp.com/blog/the-top-5-vector-databases) by Moez Ali, there are a lot of vector databases in the market. They are either dedicated vector database or existing databases that support vector search.
86
+
A vector database is a specific kind of database that saves information in the form of multi-dimensional factors representing certain characteristic or qualities. According to the article [The Top 5 Vector Databases](https://www.datacamp.com/blog/the-top-5-vector-databases) by Moez Ali, there are a lot of vector databases in the market. They are either dedicated vector database or existing databases that support vector search.
87
87
88
88

89
89
90
-
It plays a crucial role in finding similar assets by querying for neighboring factors. Vector databases are typically used to power vector search use cases like visual, semantic and multimodal search. These kinds of search can be used at a stand-alone search query or a hybrid search by combining it with a full-text search.
90
+
It plays a crucial role in finding similar assets by querying for neighboring factors. Vector databases are typically used to power vector search use cases like visual, semantic, and multimodal search. These kinds of search can be used at a stand-alone search query or a hybrid search by combining it with a full-text search.
91
91
92
92
Recently I had the chance widness the updates from Elasticsearch and MongoDB, so I'm going to explore those engines and show you how they store vectors there.
93
93
@@ -112,7 +112,7 @@ PUT my-index
112
112
}
113
113
```
114
114
115
-
And then you'll need to put the document into the index with the vector. In these two documents, the vector is the embedding of the text, probably pre-processed by a LLM.
115
+
Then you'll need to put the document into the index with the vector. In these two documents, the vector is the embedding of the text, probably pre-processed by a LLM.
116
116
117
117
```sh
118
118
PUT my-index/_doc/1
@@ -152,7 +152,7 @@ After finding K candidates from each shard, the coordinator node will merge all
152
152
153
153
Documents are ranked by the vector field similarity to the query vector. There are different algorithms for calculating the vector similarity: `l2_norm`, `dot_product`, `cosine`, and `max_inner_product`. See [official documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html).
154
154
155
-
In the paragraph above, we talk about vector search. This is great for finding information when you are more or less clear what you are looking for. Now let's talk about hybrid search, a combination of full-text search and vector search.
155
+
In the paragraph above, we talk about vector search. This is great for finding information when you are more or less clear about what you are looking for. Now let's talk about hybrid search, a combination of full-text search and vector search.
156
156
157
157
The motivation behind hybrid search is quite clear: users often have a precise idea of what they want in certain aspects of their query, but they may be less certain about others. For example, in an e-commerce scenario, a user might want to buy products from a specific marketplace, within a particular category, and at a fixed price range. However, they might be more flexible with the search query used to describe the product.
158
158
@@ -189,11 +189,11 @@ flowchart TB
189
189
vector_search --"find similar products"--> knn
190
190
```
191
191
192
-
Elastic is also building the Elasticsearch Relevance Engine (ESRE), designed to power artificial intelligence-based search applications. Use ESRE to apply semantic search with superior relevance out of the box (without domain adaptation), integrate with external large language models (LLMs), implement hybrid search, and use third-party or your own transformer models. Here is an example of the GenAI architecture with Google Cloud and Elasticsearch for the retail, presented by Delphin Barankanira during the Meetup ElasticFR 91 on June 23, 2024 (video: <https://youtu.be/Uti0fB5HpRY?si=E0_7g3Ja24zpD3sM>)
192
+
Elastic is also building the Elasticsearch Relevance Engine (ESRE), designed to power artificial intelligence-based search applications. Use ESRE to apply semantic search with superior relevance out of the box (without domain adaptation), integrate with external large language models (LLMs), implement hybrid search, and use third-party or your own transformer models. Here is an example of the GenAI architecture with Google Cloud and Elasticsearch for Retail, presented by Delphin Barankanira during the Meetup ElasticFR 91 on June 23, 2024 (video: <https://youtu.be/Uti0fB5HpRY?si=E0_7g3Ja24zpD3sM>)
193
193
194
194

195
195
196
-
In this architecture, you can see how LLM is integrated into the database of the retailer company to provide a sementic search experience. Not only the system allows users ask questions and use the LLM to provide relevant answers, it allows the retail company to check the availability of the products using hybrid search and control the access using role-based-access-control (RBAC) via the LDAP. This becomes the relevant context and is then used by the VertexAI, developed by Google to provide the final answer to the customer.
196
+
In this architecture, you can see how LLM is integrated into the database of the retailer company to provide a semantic search experience. Not only does the system allow users to ask questions and use the LLM to provide relevant answers, it also allows the retail company to check the availability of the products using hybrid search and control the access using role-based-access-control (RBAC) via the LDAP. This becomes the relevant context and is then used by the VertexAI, developed by Google to provide the final answer to the customer.
197
197
198
198
## Vector in MongoDB
199
199
@@ -213,7 +213,7 @@ MongoDB annonced their support for vectors recently. You can see their introduct
213
213
}
214
214
```
215
215
216
-
From user's perspective, this is very similar to the configuration shown in Elasticsearch. The field "numDimensions" defines the number of dimensions in the vector, and the similarity is the default algorithm used for comparing the similarity of vectors when searching for top K-nearest neighbors.
216
+
From the user's perspective, this is very similar to the configuration shown in Elasticsearch. The field "numDimensions" defines the number of dimensions in the vector, and the similarity is the default algorithm used for comparing the similarity of vectors when searching for top K-nearest neighbors.
217
217
218
218
Once the data are persisted into MongoDB, you can perform a `$vectorSearch` query to search the information for the given index. MongoDB supports two types of vector searches: ANN Search and ENN Search. For ANN search, Atlas Vector Search finds vector embeddings in your data that are closest to the vector embedding in your query based on their proximity in multi-dimensional space and based on the number of neighbors that it considers. For ENN search, Atlas Vector Search exhaustively searches all the indexed vector embeddings by calculating the distance between all the embeddings and finds the exact nearest neighbor for the vector embedding in your query. This is computationally intensive.
219
219
@@ -241,10 +241,10 @@ If you were interested in knowing more, you can register to MongoDB's online cou
241
241
242
242
After learning these concepts, it means me realize several things as a normal software engineer with little AI knowledge:
243
243
244
-
1. AI Engineers are Software Engineers. Most of the hard work for AI projects is handled by LLM or databases, which respectively handle the produciton of vectors and the storage of vectors. Therefore, as an AI engineer in a company, your role is mainly to choose how to integrate LLM and vectors into the existing system architecture to better fit the business requirements.
244
+
1. AI Engineers are Software Engineers. Most of the hard work for AI projects is handled by LLM or databases, which respectively handle the production of vectors and the storage of vectors. Therefore, as an AI engineer in a company, your role is mainly to choose how to integrate LLM and vectors into the existing system architecture to better fit the business requirements.
245
245
2. Using AI sounds extremely expensive. You have to call a LLM as the encoder for creating the vectors, both for the existing data and the user queries. The vectors have to be produced by the same LLM otherwise the queries in the database will fail. So you have to choose a LLM model, e.g. `gpt-4o` and stick with it. Then, when a new LLM model is chosen (because it's newer, more cost-effective, etc), you will have to stop the world and replace everything again in your database.
246
-
3. Not all the applications need sementic search. Sementic search is a revolutional tool for domains where users cannot precisely define what they want. This is due to the lack of knowledge for the things that they are looking for, or the flexibility of the scope that they can allow in their queries, etc. It's typically useful for e-commerce, recruiting, content management. But in other cases, it may not be that important.
247
-
4. Companies owning the data are the kings. As you can see, the vectors are used as the embeddings of the existing data. So if you don't have data, then it's hard to get an opportunities to leverage LLM for business.
246
+
3. Not all the applications need semantic search. Semantic search is a revolutionary tool for domains where users cannot precisely define what they want. This is due to the lack of knowledge of the things that they are looking for, the flexibility of the scope that they can allow in their queries, etc. It's typically useful for e-commerce, recruiting, content management. But in other cases, it may not be that important.
247
+
4. Companies owning the data are the kings. As you can see, the vectors are used as the embeddings of the existing data. So if you don't have data, then it's hard to get opportunities to leverage LLM for business.
0 commit comments