由于KennFalcon、AnyListen以及muxiaobai 的elasticsearch-analysis-hanlp
插件均已停止更新,不得已自己改写了一个,代码也参考了IK分词器以及Elasticsearch官方给的一些example
注意:这是一个classic plugin,不是stable plugin,每次版本变更都需要重新编译测试
- 下载仓库并编译打包
# 相关大文件(>100M)已通过git-lfs纪录在.gitattributes
git clone https://github.com/Y2k38/analysis-hanlp.git
cd analysis-hanlp
./gradlew build
- 执行命令安装
cd /path/to/elasticsearch
./bin/elasticsearch-plugin install file:///path/to/analysis-hanlp/build/distributions/elasticsearch-analysis-hanlp-x.y.z.zip
- 文件权限hack
ES现仅支持read、readlink权限,但hanlp程序需要读写缓存文件。一个解决方法是,将数据放在config目录,该目录支持读写,即使是security配置只写了read,非常hack的做法
# cd /path/to/elasticsearch
# hanlp.properties配置文件已将root目录指向config/analysis-hanlp/
# 1. readlink不被允许,弃用
# ln -s plugins/analysis-hanlp/data config/analysis-hanlp/
# 2. 之所以要手动mv而不是放在config,因为es不允许config里有目录,hack
mv plugins/analysis-hanlp/data config/analysis-hanlp/
特性列表:KennFalcon/elasticsearch-analysis-hanlp
支持的分词方式有
- hanlp: hanlp默认分词
- hanlp_standard: 标准分词
- hanlp_index: 索引分词
- hanlp_nlp: NLP分词
- hanlp_crf: CRF分词
- hanlp_n_short: N-最短路分词
- hanlp_dijkstra: 最短路分词
- hanlp_speed: 极速词典分词
注意: 当前版本移除KennFalcon版本的local词典热更新
POST http://localhost:9200/twitter2/_analyze
{
"text": "美国阿拉斯加州发生8.0级地震",
"tokenizer": "hanlp"
}
{
"tokens": [
{
"token": "美国",
"start_offset": 0,
"end_offset": 2,
"type": "nsf",
"position": 0
},
{
"token": "阿拉斯加州",
"start_offset": 0,
"end_offset": 5,
"type": "nsf",
"position": 1
},
{
"token": "发生",
"start_offset": 0,
"end_offset": 2,
"type": "v",
"position": 2
},
{
"token": "8.0",
"start_offset": 0,
"end_offset": 3,
"type": "m",
"position": 3
},
{
"token": "级",
"start_offset": 0,
"end_offset": 1,
"type": "q",
"position": 4
},
{
"token": "地震",
"start_offset": 0,
"end_offset": 2,
"type": "n",
"position": 5
}
]
}