為Elasticsarch添增ik分析器優化中文搜索(一)

前言

Elasticsearch作為一個母語為英文的索引軟體, 對中文的分詞效果簡直慘不忍睹; 不過還好有ik分析器可以使用, 解決了這尷尬的窘境.

一顆慘不忍睹的栗子

放置資料

首先我們放四筆資料吧

curl -XPOST http://localhost:9200/ik/fulltext/1 -H 'Content-Type:application/json' -d '{"content":"美国留给伊拉克的是个烂摊子吗"}'
curl -XPOST http://localhost:9200/ik/fulltext/2 -H 'Content-Type:application/json' -d '{"content":"公安部：各地校车将享最高路权"}'
curl -XPOST http://localhost:9200/ik/fulltext/3 -H 'Content-Type:application/json' -d '{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}'
curl -XPOST http://localhost:9200/ik/fulltext/4 -H 'Content-Type:application/json' -d '{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'

嘗試搜尋

我們現在有四筆資料, 讓我們嘗試搜尋中国試試看

root@ubuntu-87:~# curl -XPOST "http://localhost:9200/ik/fulltext/_search?pretty"  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "fields" : {
            "content" : {}
        }
    }
}
'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.264571,
    "hits" : [
      {
        "_index" : "ik",
        "_type" : "fulltext",
        "_id" : "4",
        "_score" : 1.264571,
        "_source" : {
          "content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight" : {
          "content" : [
            "<em>中</em><em>国</em>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      },
      {
        "_index" : "ik",
        "_type" : "fulltext",
        "_id" : "3",
        "_score" : 0.68324494,
        "_source" : {
          "content" : "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight" : {
          "content" : [
            "<em>中</em>韩渔警冲突调查：韩警平均每天扣1艘<em>中</em><em>国</em>渔船"
          ]
        }
      },
      {
        "_index" : "ik",
        "_type" : "fulltext",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "美国留给伊拉克的是个烂摊子吗"
        },
        "highlight" : {
          "content" : [
            "美<em>国</em>留给伊拉克的是个烂摊子吗"
          ]
        }
      }
    ]
  }
}

結果跑出了三筆資料, 可以看到highlight的部分也就是elasticsearch認定搜索到的字詞

中 国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首
中韩渔警冲突调查：韩警平均每天扣1艘中 国渔船
美国留给伊拉克的是个烂摊子吗

這明顯不是我們想要的結果呀！

問題原因

原來是因為分詞出了問題, 預設的分詞器根本不曉得中文這玩意兒, 於是就將詞逐字分詞了. 還不曉得分詞的同學可以看Elasticsearch當中的分析器-Analyzer.

讓我們看看這貨都將我們的句子怎麼分詞了:
以第一筆資料為例子: 美国留给伊拉克的是个烂摊子吗

root@ubuntu-87:~# curl -H "Content-Type:application/json"  "http://localhost:9200/ik/fulltext/1/_termvectors?pretty" -d '{ "fields" :
 ["content"] }'
{
  "_index" : "ik",
  "_type" : "fulltext",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 0,
  "term_vectors" : {
    "content" : {
      "field_statistics" : {
        "sum_doc_freq" : 14,
        "doc_count" : 1,
        "sum_ttf" : 14
      },
      "terms" : {
        "个" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 9,
              "start_offset" : 9,
              "end_offset" : 10
            }
          ]
        },
        "伊" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 4,
              "end_offset" : 5
            }
          ]
        },
        "克" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 6,
              "start_offset" : 6,
              "end_offset" : 7
            }
          ]
        },
        "吗" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 13,
              "start_offset" : 13,
              "end_offset" : 14
            }
          ]
        },
        "国" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 1,
              "end_offset" : 2
            }
          ]
        },
        "子" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 12,
              "start_offset" : 12,
              "end_offset" : 13
            }
          ]
        },
        "拉" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 5,
              "end_offset" : 6
            }
          ]
        },
        "摊" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 11,
              "start_offset" : 11,
              "end_offset" : 12
            }
          ]
        },
        "是" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 8,
              "start_offset" : 8,
              "end_offset" : 9
            }
          ]
        },
        "烂" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 10,
              "start_offset" : 10,
              "end_offset" : 11
            }
          ]
        },
        "留" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 2,
              "end_offset" : 3
            }
          ]
        },
        "的" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 7,
              "start_offset" : 7,
              "end_offset" : 8
            }
          ]
        },
        "给" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 3,
              "end_offset" : 4
            }
          ]
        },
        "美" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 1
            }
          ]
        }
      }
    }
  }
}

句子裡面每個字都被逐一分開了, 可謂碎屍萬段～好慘呀～！！

ik分析器

為了解決中文分詞這種好恐怖好恐怖的狀況, 我們必須換一個分析器. ik分析器是一個第三方的外掛, 也是本文的主角, 專治給中文胡亂分詞的elasticsearch.

提供github傳送門elasticsearch-analysis-ik

安裝

我們直接透過內建的方式安裝吧. 這邊一定要注意的就是elasticsearch和ik分析器的版本必須對應.

# 進入elasticsearch目錄
cd /usr/share/elasticsearch

# 安裝
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.2/elasticsearch-analysis-ik-6.4.2.zip

# 重啟es後生效
systemctl restart elasticsearch

測試

來測試吧, 首先我們創建一個索引

curl -XPUT "http://localhost:9200/ik1?pretty"
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "ik1"
}

然後配置映射, 告訴他我們要使用ik分析器

curl -XPOST "http://localhost:9200/ik1/fulltext/_mapping?pretty" -H 'Content-Type:application/json' -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
            }
        }

}'

{
  "acknowledged" : true
}

接著一樣丟上剛剛的四筆文檔

curl -XPOST http://localhost:9200/ik1/fulltext/1 -H 'Content-Type:application/json' -d '{"content":"美国留给伊拉克的是个烂摊子吗"}'
curl -XPOST http://localhost:9200/ik1/fulltext/2 -H 'Content-Type:application/json' -d '{"content":"公安部：各地校车将享最高路权"}'
curl -XPOST http://localhost:9200/ik1/fulltext/3 -H 'Content-Type:application/json' -d '{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}'
curl -XPOST http://localhost:9200/ik1/fulltext/4 -H 'Content-Type:application/json' -d '{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'

最後查看搜尋成果, 可以看到已經準確多啦

{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "fields" : {
            "content" : {}
        }
    }
}
' 

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.6489038,
    "hits": [
      {
        "_index": "ik1",
        "_type": "fulltext",
        "_id": "4",
        "_score": 0.6489038,
        "_source": {
          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight": {
          "content": [
            "<em>中国</em>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      },
      {
        "_index": "ik1",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight": {
          "content": [
            "中韩渔警冲突调查：韩警平均每天扣1艘<em>中国</em>渔船"
          ]
        }
      }
    ]
  }
}

搜尋結果兩筆資料

中国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首
中韩渔警冲突调查：韩警平均每天扣1艘中国渔船

這樣就對啦～

驗證分詞

我們來看看透過ik分析器之後, elasticsearch會怎麼分詞吧～
一樣第一筆資料為例子: 美国留给伊拉克的是个烂摊子吗

root@ubuntu-87:~# curl -H "Content-Type:application/json"  "http://localhost:9200/ik1/fulltext/1/_termvectors?pretty" -d '{ "fields": ["content"] }'
{
  "_index" : "ik1",
  "_type" : "fulltext",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 0,
  "term_vectors" : {
    "content" : {
      "field_statistics" : {
        "sum_doc_freq" : 9,
        "doc_count" : 1,
        "sum_ttf" : 9
      },
      "terms" : {
        "个" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 5,
              "start_offset" : 9,
              "end_offset" : 10
            }
          ]
        },
        "伊拉克" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 4,
              "end_offset" : 7
            }
          ]
        },
        "吗" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 8,
              "start_offset" : 13,
              "end_offset" : 14
            }
          ]
        },
        "摊子" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 7,
              "start_offset" : 11,
              "end_offset" : 13
            }
          ]
        },
        "是" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 8,
              "end_offset" : 9
            }
          ]
        },
        "烂摊子" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 6,
              "start_offset" : 10,
              "end_offset" : 13
            }
          ]
        },
        "留给" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 2,
              "end_offset" : 4
            }
          ]
        },
        "的" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 7,
              "end_offset" : 8
            }
          ]
        },
        "美国" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        }
      }
    }
  }
}

尾聲

透過ik分詞器, 原本被切的亂七八糟的中文句子也得以解脫囉～

運維筆記

LINUX || OPS || DEV