Skip to content

Commit 56db1b6

Browse files
authored
How to test an analyzer in Elasticsearch? (#104)
* How to test an analyzer in Elasticsearch? * update * Add more * update
1 parent 4320b3b commit 56db1b6

File tree

5 files changed

+222
-7
lines changed

5 files changed

+222
-7
lines changed

.article_num

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
213
1+
214

_data/images.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,18 @@
1+
/assets/patterns/pawel-czerwinski-xubOAAKUwXc-unsplash.jpg:
2+
author: Pawel Czerwinski
3+
url: https://unsplash.com/@pawel_czerwinski
4+
height: 1920
5+
width: 1280
6+
license: Unplash License
7+
license_url: https://unsplash.com/license
8+
19
/assets/patterns/pawel-czerwinski-dQuNjCvy9uc-unsplash.jpg:
210
author: Pawel Czerwinski
311
url: https://unsplash.com/@pawel_czerwinski
412
height: 1920
513
width: 1280
614
license: Unplash License
7-
license_url: https://unsplash.com/@pawel_czerwinski
15+
license_url: https://unsplash.com/license
816

917
/assets/bg-aakash-dhage-Ir43SiiFUOA-unsplash.jpg:
1018
author: Aakash Dhage

_data/locale.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ en: &EN
1919
FOLLOW_US : "Follow us on [NAME]."
2020
EMAIL_ME : "Send me an Email."
2121
EMAIL_US : "Send us an Email."
22-
COPYRIGHT_DATES : "2016 - 2023"
22+
COPYRIGHT_DATES : "2016 - 2024"
2323
BACKGROUND_PREFIX : "Photo by"
2424
BACKGROUND_MIDDLE : "on"
2525

@@ -53,7 +53,7 @@ zh-Hans: &ZH_HANS
5353
FOLLOW_US : "在 [NAME] 上关注我们。"
5454
EMAIL_ME : "给我发邮件。"
5555
EMAIL_US : "给我们发邮件。"
56-
COPYRIGHT_DATES : "2016 - 2023"
56+
COPYRIGHT_DATES : "2016 - 2024"
5757
BACKGROUND_PREFIX : "照片由"
5858
BACKGROUND_MIDDLE : "提供 /"
5959

@@ -84,7 +84,7 @@ zh-Hant: &ZH_HANT
8484
FOLLOW_US : "在 [NAME] 上關注我們。"
8585
EMAIL_ME : "給我發郵件。"
8686
EMAIL_US : "給我們發郵件。"
87-
COPYRIGHT_DATES : "2016 - 2023"
87+
COPYRIGHT_DATES : "2016 - 2024"
8888

8989
zh-TW:
9090
<<: *ZH_HANT
@@ -111,7 +111,7 @@ ko: &KO
111111
FOLLOW_US : "[NAME]에서 팔로우하기"
112112
EMAIL_ME : "이메일 보내기"
113113
EMAIL_US : "이메일 보내기"
114-
COPYRIGHT_DATES : "2016 - 2023"
114+
COPYRIGHT_DATES : "2016 - 2024"
115115

116116
ko-KR:
117117
<<: *KO
@@ -136,7 +136,7 @@ fr: &FR
136136
FOLLOW_US : "Suivez-nous sur [NAME]."
137137
EMAIL_ME : "Envoyez-moi un courriel."
138138
EMAIL_US : "Envoyez-nous un courriel"
139-
COPYRIGHT_DATES : "2016 - 2023"
139+
COPYRIGHT_DATES : "2016 - 2024"
140140
DONATE : "Faites un don de [NAME]."
141141

142142
fr-BE:
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
---
2+
article_num: 214
3+
layout: post
4+
type: classic
5+
title: How to test an analyzer in Elasticsearch?
6+
subtitle: >
7+
Understanding how your content is processed by Elasticsearch
8+
9+
lang: en
10+
date: 2024-08-04 09:30:23 +0200
11+
categories: [elasticsearch]
12+
tags: [elasticsearch]
13+
comments: true
14+
excerpt: >
15+
Understanding how your content is processed by Elasticsearch even if you have little experience with Elasticsearch using the Analyze API.
16+
17+
image: /assets/patterns/pawel-czerwinski-xubOAAKUwXc-unsplash.jpg
18+
cover: /assets/patterns/pawel-czerwinski-xubOAAKUwXc-unsplash.jpg
19+
article_header:
20+
type: overlay
21+
theme: dark
22+
background_color: "#203028"
23+
background_image:
24+
gradient: "linear-gradient(135deg, rgba(0, 0, 0, .6), rgba(0, 0, 0, .4))"
25+
wechat: false
26+
---
27+
28+
## Introduction
29+
30+
We are going to talk about text processing in Elasticsearch, more specifically, how to test Analyzers in Elasticsearch. Analyzer is a powerful concept, it is useful for processing your content and your queries. However, an analyzer is complex. It contains several kinds of components, including character filters, tokenizers, and token filters. The official documentation of Elastic provides a lot of information about each component. However, it is really difficult to understand the exact behaviors of each analyzer and how your data are processed. Also, the official documentation only explains the technical aspects of each component, but you need to translate and apply them for your application. That is, finding other ways to evaluate whether the current setup meets the business requirements. This can be specific to your industry, the language of the content, the geographic zone, etc. In this article, we are going to use the Analyze API to evaluate the data ingestion end-to-end, where you can provide the content and see the output generated by Elasticsearch.
31+
32+
## Analyzer Overview
33+
34+
```mermaid
35+
---
36+
title: Analyzer Example
37+
---
38+
flowchart LR
39+
subgraph analyzer
40+
subgraph character_filters
41+
html_strip
42+
end
43+
subgraph tokenizer
44+
standard
45+
end
46+
subgraph token_filters
47+
direction TB
48+
lowercase --> ascii_folding
49+
end
50+
end
51+
52+
input --> character_filters
53+
character_filters --> tokenizer
54+
tokenizer --> token_filters
55+
token_filters --> output
56+
```
57+
58+
Here is an example showing the components inside an analyzer.
59+
60+
* [Character filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html) are used to preprocess the stream of characters before it is passed to the tokenizer. A character filter receives the original text as a stream of characters, and can transform the stream by adding, removing or changing characters. For example, the HTML Strip Character Filter strips out HTML elements like `<b>` and the code html entity like `&amp;`.
61+
* [A tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For example, the white space tokenizer breaks text into tokens whenever it sees any whitespace.
62+
* [Token filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) accept a stream of tokens from a tokenizer and can modify tokens (e.g. lowercasing), delete tokens (e.g. remove stopwords), or add tokens (eg synonyms).
63+
64+
Once you have the analyzer, you can use it at different levels. You can use it at a filed level, at the index level, at the percolator level for queries, at the ingestion-pipelines level, at the search-query level, etc. Here is an example where we use the analyzer at the field level. Under the mappings of your property (your field), you specify the analyzer used for analyzing the data.
65+
66+
```js
67+
{
68+
"settings": {
69+
// ...
70+
"mappings": {
71+
"properties": {
72+
"content": {
73+
"type": "text",
74+
"analyzer": "lowercase_ascii_folding_analyzer",
75+
}
76+
// ...
77+
}
78+
}
79+
}
80+
}
81+
```
82+
83+
## Analyze API
84+
85+
The [Analyze API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html) is an API for viewing the terms produced by an analyzer. You can choose an analyzer and an input text to evaluate the tokens produced by the analyzer. Recently, I was working on supporting French and Chinese for the [ChatGPT QuickSearch Extension](https://chromewebstore.google.com/detail/chatgpt-quicksearch/jclniokkhcjpgfijopjahldoepdikcko). Here are some examples of that product.
86+
87+
You can use Analyze API by specifying explicitly the analyzer. Here I choose the lowercase ASCII folding analyzer. I want to use the French sentence "À bientôt !" for the test.
88+
89+
```sh
90+
GET /my_index/_analyze
91+
92+
{
93+
"analyzer": "lowercase_ascii_folding_analyzer",
94+
"text": "À bientôt !"
95+
}
96+
```
97+
98+
From the results below, you can see that the accent is removed from the character 'à' and 'ô'. The analyzer produced two additional tokens and the original words are preserved.
99+
100+
```json
101+
{
102+
"tokens": [
103+
{
104+
"token": "a",
105+
"start_offset": 0,
106+
"end_offset": 1,
107+
"type": "<ALPHANUM>",
108+
"position": 0
109+
},
110+
{
111+
"token": "à",
112+
"start_offset": 0,
113+
"end_offset": 1,
114+
"type": "<ALPHANUM>",
115+
"position": 0
116+
},
117+
{
118+
"token": "bientot",
119+
"start_offset": 2,
120+
"end_offset": 9,
121+
"type": "<ALPHANUM>",
122+
"position": 1
123+
},
124+
{
125+
"token": "bientôt",
126+
"start_offset": 2,
127+
"end_offset": 9,
128+
"type": "<ALPHANUM>",
129+
"position": 1
130+
}
131+
]
132+
}
133+
```
134+
135+
But sometimes you don't want to limit your evaluation to the analyzer, but you want to target a field. Because the field is closer to the application level than the analyzer. So when targeting the field, then your validation will still be valid even if the analyzer is changed.
136+
137+
```sh
138+
GET /my_index/_analyze
139+
140+
{
141+
"field": "content",
142+
"text": "À bientôt !"
143+
}
144+
```
145+
146+
If you work with Chinese content, you can also evaluate the Chinese content. This is completely different from English or French. The tokenizer needs to understand and tokenize ideography. One possible solution is to use the [International Components for Unicode (ICU)](https://icu.unicode.org/) plugin. Here are the tokens produced for 中华人民共和国国歌 (National Anthem of the People's Republic of China).
147+
148+
```sh
149+
GET /my_chinese_index/_analyze
150+
151+
{
152+
"field": "content",
153+
"text": "中华人民共和国国歌"
154+
}
155+
```
156+
157+
```json
158+
{
159+
"tokens": [
160+
{
161+
"token": "中华",
162+
"start_offset": 0,
163+
"end_offset": 2,
164+
"type": "<IDEOGRAPHIC>",
165+
"position": 0
166+
},
167+
{
168+
"token": "人民",
169+
"start_offset": 2,
170+
"end_offset": 4,
171+
"type": "<IDEOGRAPHIC>",
172+
"position": 1
173+
},
174+
{
175+
"token": "共和国",
176+
"start_offset": 4,
177+
"end_offset": 7,
178+
"type": "<IDEOGRAPHIC>",
179+
"position": 2
180+
},
181+
{
182+
"token": "国歌",
183+
"start_offset": 7,
184+
"end_offset": 9,
185+
"type": "<IDEOGRAPHIC>",
186+
"position": 3
187+
}
188+
]
189+
}
190+
```
191+
192+
193+
## Other Considerations
194+
195+
When working with the analyzer you need to consider the impact of the changes for your existing documents. Especially when you have a lot of documents. Any changes can impact the data ingestion, the data storage, or the data retrieval process. Therefore, it is important to evaluate the impact using the Analyze API and write non-regression tests to ensure the quality of your application. Then, at the document level, you can also use the Explain API to understand whether a document matches a specific query. This is not directly related to the analyzer, but if you change your analyzer, then it may change the terms produced by the system, which impacts the queries.
196+
197+
## Conclusion
198+
199+
In this article, we went through the overview of the analyzer. Then, we saw how to use the Analyze API to evaluate specific content on an analyzer.
200+
Interested to know more? You can subscribe to [the feed of my blog](/feed.xml), follow me
201+
on [Twitter](https://twitter.com/mincong_h) or
202+
[GitHub](https://github.com/mincong-h/). Hope you enjoy this article, see you the next time!
203+
204+
## References
205+
206+
* Elasticsearch: analyzer by Elastic 中国社区官方博客, https://blog.csdn.net/UbuntuTouch/article/details/100392478
207+
* Analyze API by Elasticsearch Documentation, https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
675 KB
Loading

0 commit comments

Comments
 (0)