前言
Elasticsearch作為一個母語為英文的索引軟體, 對中文的分詞效果簡直慘不忍睹; 不過還好有ik分析器可以使用, 解決了這尷尬的窘境.
一顆慘不忍睹的栗子
放置資料
首先我們放四筆資料吧
curl -XPOST http://localhost:9200/ik/fulltext/1 -H 'Content-Type:application/json' -d '{"content":"美国留给伊拉克的是个烂摊子吗"}'
curl -XPOST http://localhost:9200/ik/fulltext/2 -H 'Content-Type:application/json' -d '{"content":"公安部:各地校车将享最高路权"}'
curl -XPOST http://localhost:9200/ik/fulltext/3 -H 'Content-Type:application/json' -d '{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}'
curl -XPOST http://localhost:9200/ik/fulltext/4 -H 'Content-Type:application/json' -d '{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'
嘗試搜尋
我們現在有四筆資料, 讓我們嘗試搜尋中国
試試看
root@ubuntu-87:~# curl -XPOST "http://localhost:9200/ik/fulltext/_search?pretty" -H 'Content-Type:application/json' -d'
{
"query" : { "match" : { "content" : "中国" }},
"highlight" : {
"fields" : {
"content" : {}
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.264571,
"hits" : [
{
"_index" : "ik",
"_type" : "fulltext",
"_id" : "4",
"_score" : 1.264571,
"_source" : {
"content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
},
"highlight" : {
"content" : [
"<em>中</em><em>国</em>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
]
}
},
{
"_index" : "ik",
"_type" : "fulltext",
"_id" : "3",
"_score" : 0.68324494,
"_source" : {
"content" : "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
},
"highlight" : {
"content" : [
"<em>中</em>韩渔警冲突调查:韩警平均每天扣1艘<em>中</em><em>国</em>渔船"
]
}
},
{
"_index" : "ik",
"_type" : "fulltext",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"content" : "美国留给伊拉克的是个烂摊子吗"
},
"highlight" : {
"content" : [
"美<em>国</em>留给伊拉克的是个烂摊子吗"
]
}
}
]
}
}
結果跑出了三筆資料, 可以看到highlight的部分也就是elasticsearch認定搜索到的字詞
中
国
驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首中
韩渔警冲突调查:韩警平均每天扣1艘中
国
渔船- 美
国
留给伊拉克的是个烂摊子吗
這明顯不是我們想要的結果呀!
問題原因
原來是因為分詞出了問題, 預設的分詞器根本不曉得中文這玩意兒, 於是就將詞逐字分詞了. 還不曉得分詞的同學可以看Elasticsearch當中的分析器-Analyzer.
讓我們看看這貨都將我們的句子怎麼分詞了:
以第一筆資料為例子: 美国留给伊拉克的是个烂摊子吗
root@ubuntu-87:~# curl -H "Content-Type:application/json" "http://localhost:9200/ik/fulltext/1/_termvectors?pretty" -d '{ "fields" :
["content"] }'
{
"_index" : "ik",
"_type" : "fulltext",
"_id" : "1",
"_version" : 1,
"found" : true,
"took" : 0,
"term_vectors" : {
"content" : {
"field_statistics" : {
"sum_doc_freq" : 14,
"doc_count" : 1,
"sum_ttf" : 14
},
"terms" : {
"个" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 9,
"start_offset" : 9,
"end_offset" : 10
}
]
},
"伊" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 4,
"start_offset" : 4,
"end_offset" : 5
}
]
},
"克" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 6,
"start_offset" : 6,
"end_offset" : 7
}
]
},
"吗" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 13,
"start_offset" : 13,
"end_offset" : 14
}
]
},
"国" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 1,
"end_offset" : 2
}
]
},
"子" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 12,
"start_offset" : 12,
"end_offset" : 13
}
]
},
"拉" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 5,
"start_offset" : 5,
"end_offset" : 6
}
]
},
"摊" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 11,
"start_offset" : 11,
"end_offset" : 12
}
]
},
"是" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 8,
"start_offset" : 8,
"end_offset" : 9
}
]
},
"烂" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 10,
"start_offset" : 10,
"end_offset" : 11
}
]
},
"留" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 2,
"start_offset" : 2,
"end_offset" : 3
}
]
},
"的" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 7,
"start_offset" : 7,
"end_offset" : 8
}
]
},
"给" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 3,
"start_offset" : 3,
"end_offset" : 4
}
]
},
"美" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 1
}
]
}
}
}
}
}
句子裡面每個字都被逐一分開了, 可謂碎屍萬段~好慘呀~!!
ik分析器
為了解決中文分詞這種好恐怖好恐怖的狀況, 我們必須換一個分析器. ik分析器
是一個第三方的外掛, 也是本文的主角, 專治給中文胡亂分詞的elasticsearch.
提供github傳送門elasticsearch-analysis-ik
安裝
我們直接透過內建的方式安裝吧. 這邊一定要注意的就是elasticsearch和ik分析器的版本必須對應.
# 進入elasticsearch目錄
cd /usr/share/elasticsearch
# 安裝
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.2/elasticsearch-analysis-ik-6.4.2.zip
# 重啟es後生效
systemctl restart elasticsearch
測試
來測試吧, 首先我們創建一個索引
curl -XPUT "http://localhost:9200/ik1?pretty"
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "ik1"
}
然後配置映射, 告訴他我們要使用ik分析器
curl -XPOST "http://localhost:9200/ik1/fulltext/_mapping?pretty" -H 'Content-Type:application/json' -d'
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}'
{
"acknowledged" : true
}
接著一樣丟上剛剛的四筆文檔
curl -XPOST http://localhost:9200/ik1/fulltext/1 -H 'Content-Type:application/json' -d '{"content":"美国留给伊拉克的是个烂摊子吗"}'
curl -XPOST http://localhost:9200/ik1/fulltext/2 -H 'Content-Type:application/json' -d '{"content":"公安部:各地校车将享最高路权"}'
curl -XPOST http://localhost:9200/ik1/fulltext/3 -H 'Content-Type:application/json' -d '{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}'
curl -XPOST http://localhost:9200/ik1/fulltext/4 -H 'Content-Type:application/json' -d '{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'
最後查看搜尋成果, 可以看到已經準確多啦
{
"query" : { "match" : { "content" : "中国" }},
"highlight" : {
"fields" : {
"content" : {}
}
}
}
'
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.6489038,
"hits": [
{
"_index": "ik1",
"_type": "fulltext",
"_id": "4",
"_score": 0.6489038,
"_source": {
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
},
"highlight": {
"content": [
"<em>中国</em>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
]
}
},
{
"_index": "ik1",
"_type": "fulltext",
"_id": "3",
"_score": 0.2876821,
"_source": {
"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
},
"highlight": {
"content": [
"中韩渔警冲突调查:韩警平均每天扣1艘<em>中国</em>渔船"
]
}
}
]
}
}
搜尋結果兩筆資料
中国
驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首- 中韩渔警冲突调查:韩警平均每天扣1艘
中国
渔船
這樣就對啦~
驗證分詞
我們來看看透過ik分析器之後, elasticsearch會怎麼分詞吧~
一樣第一筆資料為例子: 美国留给伊拉克的是个烂摊子吗
root@ubuntu-87:~# curl -H "Content-Type:application/json" "http://localhost:9200/ik1/fulltext/1/_termvectors?pretty" -d '{ "fields": ["content"] }'
{
"_index" : "ik1",
"_type" : "fulltext",
"_id" : "1",
"_version" : 1,
"found" : true,
"took" : 0,
"term_vectors" : {
"content" : {
"field_statistics" : {
"sum_doc_freq" : 9,
"doc_count" : 1,
"sum_ttf" : 9
},
"terms" : {
"个" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 5,
"start_offset" : 9,
"end_offset" : 10
}
]
},
"伊拉克" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 2,
"start_offset" : 4,
"end_offset" : 7
}
]
},
"吗" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 8,
"start_offset" : 13,
"end_offset" : 14
}
]
},
"摊子" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 7,
"start_offset" : 11,
"end_offset" : 13
}
]
},
"是" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 4,
"start_offset" : 8,
"end_offset" : 9
}
]
},
"烂摊子" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 6,
"start_offset" : 10,
"end_offset" : 13
}
]
},
"留给" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 2,
"end_offset" : 4
}
]
},
"的" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 3,
"start_offset" : 7,
"end_offset" : 8
}
]
},
"美国" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 2
}
]
}
}
}
}
}
尾聲
透過ik分詞器, 原本被切的亂七八糟的中文句子也得以解脫囉~