Elasticsearch分词

前言

比如，当我们想将某个博客标记为“大神”时，博客系统却将这个单词粗暴的分成了如图所示的两个词“大”和“神”。显然，这并不符合用户的使用习惯。

这是 Elasticsearch 语言分析器上的限制，它并不能友好的处理所有语言，特别是中文。这种情况下，我们就需要额外的中文分词器来协助我们了。

分词（Analysis）

将文本切分为一系列单词的过程，比如 "美国留给伊拉克的是个烂摊子吗?"经过分词后的后果为:美国、伊拉克、烂摊子。

分词器（Analyzer）

elasticsearch中执行的分词的主体，官方把分词器分成三个层次：

Character Filters：针对文档的原始文本进行处理，例如将印度语的阿拉伯数字"0 12345678 9"转换成拉丁语的阿拉伯数字"0123456789"，或者去除HTML中的特殊标记符号，Character Filters可以有零或多个，安装顺序应用;PS：类似Servlet中的过滤器，或者拦截器，想象一下有一个过滤器链
Tokenizer：核心，将文档的原始文本按照一定规则切分为单词，Tokenizer只能有一个;PS：Tokenizer 负责将文本拆分成单个token ，这里token就指的就是一个一个的单词。就是一段文本被分割成好几部分，相当于Java中的字符串的 split
Token Filter：对经过Tokenizer处理过后的单词进行二次加工，比如转换为小写，Token Filter也可以有多个，按顺序依次调用。token过滤器接收token流，并且可能会添加、删除或更改tokens。不允许token过滤器更改每个token的位置或字符偏移量。一个分析器可能有0个或多个token过滤器，它们按顺序应用。

三者的调用顺序：Character Filters--->Tokenizer--->Token Filter

小结&回顾

analyzer（分析器）是一个包，这个包由三部分组成，分别是：character filters （字符过滤器）、tokenizer（分词器）、token filters（token过滤器）
一个analyzer可以有0个或多个character filters
一个analyzer有且只能有一个tokenizer
一个analyzer可以有0个或多个token filters
character filter 是做字符转换的，它接收的是文本字符流，输出也是字符流
tokenizer 是做分词的，它接收字符流，输出token流（文本拆分后变成一个一个单词，这些单词叫token）
token filter 是做token过滤的，它接收token流，输出也是token流
由此可见，整个analyzer要做的事情就是将文本拆分成单个单词，文本 ----> 字符 ----> token

Analyze API

es提供了endpoint为_analyze的语句来测试分词效果，你可以指定索引中的字段或者显式输入文本来测试分词效果

预定义的分词器

es自带的分词器如下，默认是standard，创建索引的mapping（类似于表结构）时候可指定

因为文档中的每个字段都会建立倒排索引，所以你也可以在创建索引的mapping时指定每个字段的分词器。

下面简单的测试一下Standard、Simple、whitespace这三个分词器分词效果，其余的就不测试了。

standard

Standard

Simple

whitespace

其余的分词器留给大家自己去测试，分词器的选择还是很重要的，按照你想要的切分方式切分文本得到的分词效果，既可以节省空间，又可以较好的解决搜索问题。尤其是中文，如何切分是个难点，比如文本"中国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首"如果经过默认的分词器standard analyzer切分的话，会得到"中、国、驻、洛、杉、矶、领、事、馆、遭、亚、裔、男、子、枪、击、嫌、犯、已、自、首"，这显然不是我们想要的分词效果；再比如，"乒乓球拍卖完了"，是切分为"乒乓球/拍卖/完了"还是切分为"乒乓球拍/卖完了"。

这里分享一个常用的中文分词器：ik_smart，它能较好的切分中文及英文文本，支持自定义词库，开源分词器 ik 的github：https://github.com/medcl/elasticsearch-analysis-ik

安装iksmart分词器如下：

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.0/elasticsearch-analysis-ik-6.4.0.zip

注意:

替换6.4.0为自己安装的elasticsearch版本，安装好后的ik插件在/elasticsearch/plugins/目录下，接着就可以直接指定分词器为ik_smart了，ik里面提供了ik_smart、ik_max_word，大家可以通过如下测试下两种分词器分词效果：

分词使用时机

1.创建或更新文档时候，es会对相应的文档数据进行分词处理，比如你某个索引字段类型为text，那么插入一条文档时候就会对该字段进行分词处理，维护该字段文本内容的倒排索引，这种我们成为索引时分词;

2.查询时候，会对你的查询文本进行分词，比如你要查询"苹果手机"，则会分词为"苹果、手机"两个单词;

我们可以在创建索引时候指定该字段的分词器:

创建索引mapping时候指定该字段的分词器

也可以在查询时指定分词器:

查询时指定分词器

实际使用时候我们需要明确文档中的某个字段是否要分词，如果没必要分词，请关闭，这能节省一定的空间及提高es的写入效率，同时实际生产中的具体的分词器选择要经过自己的实际测试。

1. 测试分析器

analyze API 是一个工具，可以帮助我们查看分析的过程。（PS：类似于执行计划）

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}
'

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
'

输出：

{
    "tokens":[
        {
            "token":"The",
            "start_offset":0,
            "end_offset":3,
            "type":"word",
            "position":0
        },
        {
            "token":"quick",
            "start_offset":4,
            "end_offset":9,
            "type":"word",
            "position":1
        },
        {
            "token":"brown",
            "start_offset":10,
            "end_offset":15,
            "type":"word",
            "position":2
        },
        {
            "token":"fox.",
            "start_offset":16,
            "end_offset":20,
            "type":"word",
            "position":3
        }
    ]
}

可以看到，对于每个term，记录了它的位置和偏移量

2. Analyzer

2.1. 配置内置的分析器

内置的分析器不用任何配置就可以直接使用。当然，默认配置是可以更改的。例如，standard分析器可以配置为支持停止字列表:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type":     "text",
          "analyzer": "standard", 
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "std_english" 
            }
          }
        }
      }
    }
  }
}

在这个例子中，我们基于standard分析器来定义了一个std_englisth分析器，同时配置为删除预定义的英语停止词列表。后面的mapping中，定义了my_text字段用standard，my_text.english用std_english分析器。因此，下面两个的分词结果会是这样的：

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text", 
  "text": "The old brown cow"
}
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text.english", 
  "text": "The old brown cow"
}

第一个由于用的standard分析器，因此分词的结果是：[ the, old, brown, cow ]

第二个用std_english分析的结果是：[ old, brown, cow ]

2.2. Standard Analyzer （默认）

如果没有特别指定的话，standard 是默认的分析器。它提供了基于语法的标记化（基于Unicode文本分割算法），适用于大多数语言。

例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

上面例子中，那段文本将会输出如下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

2.2.1. 配置

标准分析器接受下列参数：

max_token_length ：最大token长度，默认255
stopwords ：预定义的停止词列表，如_english_ 或包含停止词列表的数组，默认是 _none_
stopwords_path ：包含停止词的文件路径

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

以上输出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

2.2.3. 定义

standard分析器由下列两部分组成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （默认被禁用）

你还可以自定义

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}

2.3. Simple Analyzer

simple 分析器当它遇到只要不是字母的字符，就将文本解析成term，而且所有的term都是小写的。例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输入结果如下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定义

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [         
          ]
        }
      }
    }
  }
}

2.4. Whitespace Analyzer

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
'

输出结果如下：

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持。默认用的停止词是 _englisht_

（PS：意思是，假设有一句话“this is a apple”，并且假设“this” 和 “is”都是停止词，那么用simple的话输出会是[ this , is , a , apple ]，而用stop输出的结果会是[ a , apple ]，到这里就看出二者的区别了，stop 不会输出停止词，也就是说它不认为停止词是一个term）

（PS：所谓的停止词，可以理解为分隔符）

2.5.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
    "analyzer": "stop",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
'

输出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受以下参数：

stopwords ：一个预定义的停止词列表（比如，_englisht_）或者是一个包含停止词的列表。默认是 _english_
stopwords_path ：包含停止词的文件路径。这个路径是相对于Elasticsearch的config目录的一个路径

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}

上面配置了一个stop分析器，它的停止词有两个：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

基于以上配置，这个请求输入会是这样的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正则表达式来将文本分割成terms，默认的正则表达式是W+（非单词字符）

2.6.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

由于默认按照非单词字符分割，因此输出会是这样的：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受如下参数：

pattern ：一个Java正则表达式，默认 W+
flags ： Java正则表达式flags。比如：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否将terms全部转成小写。默认true
stopwords ：一个预定义的停止词列表，或者包含停止词的一个列表。默认是 _none_
stopwords_path ：停止词文件路径

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "W|_", 
          "lowercase": true
        }
      }
    }
  }
}

上面的例子中配置了按照非单词字符或者下划线分割，并且输出的term都是小写

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}

因此，基于以上配置，本例输出如下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支持不同语言环境下的文本分析。内置（预定义）的语言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定义Analyzer

前面也说过，一个分析器由三部分构成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 实例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

4. 中文分词器

4.1. smartCN

一个简单的中文或中英文混合文本的分词器

这个插件提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，而且不需要配置

# 安装
bin/elasticsearch-plugin install analysis-smartcn
# 卸载
bin/elasticsearch-plugin remove analysis-smartcn

下面测试一下

可以看到，“今天天气真好”用smartcn分析器的结果是：

[ 今天 ， 天气 ， 真 ， 好 ]

如果用standard分析器的话，结果会是：

[ 今 ，天 ，气 ， 真 ， 好 ]

4.2. IK分词器

到https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.6.0下载对应的版本，这里我下载6.4.0

6.4.0版本

然后，在Elasticsearch的plugins目录下建一个ik目录，将刚才下载的文件解压到该目录下

最后，重启Elasticsearch

接下来，还是用刚才那句话来测试一下

输出结果如下：

{
    "tokens": [
        {
            "token": "今天天气",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "天天",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "天气",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "真好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

显然比smartcn要更好一点

参考:

es官方文档https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

https://github.com/medcl/elasticsearch-analysis-ik

展开阅读全文

页面更新：2024-03-11

标签：分词切分分析器字段示例过滤器中文单词路径字符索引文本效果文档列表科技

1 2 3 4 5

Elasticsearch分词

前言

分词（Analysis）

分词器（Analyzer）

小结&回顾

Analyze API

预定义的分词器

安装iksmart分词器如下：

注意:

分词使用时机

1. 测试分析器

2. Analyzer

3. Tokenizer

4. 中文分词器

Redis 21问，砖友们知道几个？

基于sanic的微服务基础架构

关于Redis性能的注意事项

Druid实时大数据分析介绍(一)

Druid实时大数据分析应用(二)

初识ElasticSearch

HTTP请求的数据结构

Python实现LRU算法

成都车展 | 领克09踏浪而来，惊喜亮相

2022铃木GSX-S1000GT发布：生化机械风脸谱，全能公升级跑旅登场

2022奥古斯塔F3 RR发布：大型定风翼上身，动力悬挂系统全面强化

地表最快布加迪，时隔2年终于交车，1578匹马力极速440公里

如何俘获年轻人的心？一台炫界Pro就够了

杜卡迪中量级越野ADV于12月9日发布，与骇客950同款发动机

WSBK将修改600cc规则让更多车型参赛

MotoGP各车厂Misano测试出炉，黑科技倾巢而出

本地和Docker安装IK中文分词器

DRF 过滤器 filters.FilterSet 过滤类功能大全

Elasticsearch分布式搜索引擎架构(万字总结)

1598元！小米生态链企业发布新款扫地机器人：路径规划+扫

杜蕾斯表白黑科技，这次天猫超级品牌日玩出新花样！

华为鸿蒙OS2的支持手机列表流出，快来看看你的手机可以

除了双屏+10G超大内存，vivo NEX双屏版还有这些黑科技！

解锁科技新美学：vivo X27全新设计曝光！

vivo X30 Pro拍照样张来了，效果堪比大片