1 什么时候应该用ElasticSearch?
1.1 场景
关系型数据库查询有瓶颈:考虑下用它!为啥是考虑?ES的优点在于查询,然而实践证明,在被作为数据库来使用,即写完马上查询会有延迟。
数据分析场景:考虑下用它!为啥是考虑?简单通用的场景需求可以大规模使用,但在特定业务场景领域,还是要选择更加专业的数据产品,如复杂聚合,ClickHouse相比 Elasticserach 做亿级别数据深度聚合需求会更加合适。
1.2 版本
2 部署
2.1 部署方式
2.1.1 docker 部署
(1) Install and start Docker Desktop.
(2) Run:
Copy docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.16.2
docker run --name es01-test --net elastic -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.16.2
(3) Install and run Kibana
To analyze, visualize, and manage Elasticsearch data using an intuitive UI, install Kibana.
In a new terminal session, run:
Copy docker pull docker.elastic.co/kibana/kibana:7.16.2
docker run --name kib01-test --net elastic -p 127.0.0.1:5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.16.2
To access Kibana, go to http://localhost:5601
2.2.1 物理部署
服务端
7.17.5 版本
Copy # 启动
./bin/elasticsearch
客户端
2.2 部署时常见错误
2.2.1 ES 系列:解决 max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
将配置文件中的 network.host 改为 0.0.0.0 后,重新启动时报的此错误
处理办法
Copy # 修改 /etc/sysctl.conf 文件,增加配置
vm.max_map_count=262144
# 执行命令 sysctl -p 生效
sysctl -p
2.2.2 ES 系列:解决 the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
看提示可知:缺少默认配置,至少需要配置discovery.seed_hosts/discovery.seed_providers/cluster.initial_master_nodes中的一个参数.
discovery.seed_hosts: 集群主机列表
discovery.seed_providers: 基于配置文件配置集群主机列表
cluster.initial_master_nodes: 启动时初始化的参与选主的node,生产环境必填
处理办法
Copy # custom config
node.name: "node-1"
discovery.seed_hosts: ["127.0.0.1"]
cluster.initial_master_nodes: ["node-1"]
3 名词说明
3.1 关于 Index、Type、Document
在 7.0 以及之后的版本中 Type 被废弃了。(其实还可以用,但是已经不推荐了)
一个 MySQL 实例中可以创建多个 Database,一个 Database 中可以创建多个 Table。
ES 的Type 被废弃后:
ES 实例:对应 MySQL 实例中的一个 Database。
Index 对应 MySQL 中的 Table 。
3.2 正/倒排索引
类似于书的目录,目录能够方便的定位哪一章节或哪一小节的页码,但是无法定位某一关键字的位置。有一些书的最后有索引页,它的功能就是帮助定位某些关键字出现的位置。
对于搜索引擎来讲:
正排索引是文档 Id 到文档内容、单词的关联关系。也就是说可以通过 Id获取到文档的内容。
倒排索引是单词到文档 Id 的关联关系。也就是说了一通过单词搜索到文档 Id。
倒排索引的查询流程是:首先根据关键字搜索到对应的文档 Id,然后根据正排索引查询文档 Id 的完整内容,最后返回给用户想要的结果。
3.2.1 倒排索引的组成
倒排索引是搜索引擎的核心,主要包含两个部分:
单词词典(Trem Dictionary):记录的是所有的文档分词后的结果 倒排列表(Posting List):记录了单词对应文档的集合,由倒排索引项(Posting)组成。 单词字典的实现一般采用B+Tree的方式,来保证高效
倒排索引项(Posting)主要包含如下的信息:
2、单词频率(TF,Term Frequency),记录该单词在该文档中出现的次数,用于后续相关性算分。
3、位置(Position),记录单词在文档中的分词位置(多个),用于做词语搜索。
4、偏移(Offset),记录单词在文档的开始和结束位置,用于高亮显示。
3.3 分词
分词是指将文本转换成一系列的单词的过程,也可以叫做文本分析,在 es 中称为Analysis。
3.3.1 分词器
分词器(Analyzer)是es中专门用于分词的组件,它的组成如下:
Character Filter 针对原始文本进行处理,比如去除html标记符。
Tokenuzer 将原始文本按照一定规则切分为单词。
Token Filters 针对tokenizer处理的单词进行再加工,比如转为小写、删除或新增等。
3.3.2 Analyze API
es 提供了一个测试分词的 api 接口,方便验证分词效果,endpoint 是 _analyze
这个 api 具有以下特点:
直接指定 analyzer 进行测试
Copy POST _analyze
{
"analyzer": "standard",
"text": "hello world"
}
analyzer 表示指定的分词器,这里使用 es 自带的分词器 standard,text 用来指定待分词的文本
python 例子
Copy result= es.indices.analyze(body={'text': "hello world", "analyzer": "standard"})
import json
print(json.dumps(result, indent=4, ensure_ascii=False))
---
{
"tokens": [
{
"end_offset": 5,
"token": "hello",
"type": "<ALPHANUM>",
"start_offset": 0,
"position": 0
},
{
"end_offset": 11,
"token": "world",
"type": "<ALPHANUM>",
"start_offset": 6,
"position": 1
}
]
}
指定索引中的字段进行测试
应用场景:当创建好索引后发现某一字段的查询和预期不一样,就可以对这个字段进行分词测试。
Copy POST text_index/_analyze
{
"field": "username",
"text": "hello world"
}
Python 例子
Copy result= es.indices.analyze(index=index, body={"field": "zone", 'text': "hello world", "analyzer": "keyword"})
import json
print(json.dumps(result, indent=4, ensure_ascii=False))
---
{
"tokens": [
{
"end_offset": 5,
"token": "hello",
"type": "<ALPHANUM>",
"start_offset": 0,
"position": 0
},
{
"end_offset": 11,
"token": "world",
"type": "<ALPHANUM>",
"start_offset": 6,
"position": 1
}
]
}
3.3.3 预定义的分词器
standard: 默认分词器,具有按词切分、支持多语言、小写处理的特点。
simple: 具有特性是:按照非字母切分、小写处理。
keyword: 不分词,直接将输入作为一个单词输出
pattern: 通过正则表达式自定义分隔符,默认是\W+
3.3.4 中文分词
IK
可以实现中英文单词的分词,支持ik_smart、ik_maxword等模式;可以自定义词库,支持热更新分词词典。
jieba
python中最流行的分词系统,支持分词和词性标注;支持繁体分词、自定义词典、并行分词。
3.3.5 分词使用说明
分词一般会在以下的情况下使用:
创建或更新文档的时候,会对相应的文档进行分词处理。
明确字段是否需要分词,不需要分词的字段就将 type 设置为 keyword,可以节省空间和提高写性能。
善用 _analyze API 查看文档具体分词结果
3.4 Mapping
Mapping 在 Elasticsearch 中是非常重要的一个概念。决定了一个 index 中的field 使用什么数据格式存储,使用什么分词器解析,是否有子字段等。
如果没有 mapping 所有 text 类型属性默认都使用 standard 分词器。所以如果希望使用 IK 分词就必须配置自定义 mapping。
3.4.1 mapping 核心数据类型
Elasticsearch 中的数据类型有很多,在这里只介绍常用的数据类型。 只有text类型才能被分词。其他类型不允许。
整数:byte、short、integer、long
3.4.2 dynamic mapping 对字段的类型分配
在上述的自动 mapping 字段类型分配的时候,只有 text 类型的字段需要分词器。默认分词器是 standard 分词器。
3.4.3 查看索引 mapping
可以通过命令查看已有 index 的 mapping 具体信息,语法如下:
Copy GET 索引名/_mapping
如:
GET test_index/_mapping
如:
Copy {
"test_index": { # 索引名
"mappings": { # 映射列表
"test_type": { # 类型名
"properties": { # 字段列表
"age": { # 字段名
"type": "long" # 字段类型
},
"gender": {
"type": "text",
"fields": { # 子字段列表
"keyword": { # 子字段名
"type": "keyword", # 子字段类型,keyword 不进行分词处理的文本类型
"ignore_above": 256 # 子字段存储数据长度
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
3.4.4 定制 mapping
创建 index 的时候来定制 mapping 映射,也就是指定字段的类型和字段数据使用的分词器。
手动定制 mapping 时,只能新增 mapping 设置,不能对已有的 mapping进行修改。
3.4.5 mapping 中的子字段
Copy "name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
query name 时,使用的 standard 分词
query name.keyword 时,使用的 keyword,即不分词
4 使用 curl 命令操作与 ES 交互
4.1 查看信息
curl localhost:9200
Copy {
: "name" : "86302a07d5ab",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "hRFZ_8Q1SPOLkiL8R-D0UQ",
"version" : {
"number" : "7.16.2",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "2b937c44140b6559905130a8650c64dbd0879cfb",
"build_date" : "2021-12-18T19:42:46.604893745Z",
"build_snapshot" : false,
"lucene_version" : "8.10.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
4.2 查看所有的 Index
Copy $ curl 'localhost:9200/_mapping?pretty=true'
{ }
4.3 创建 Index
使用 HTTP PUT 请求创建 Index:
Copy $ curl -X PUT 'localhost:9200/school?pretty=true'
{
"acknowledged":true,
"shards_acknowledged":true,
"index":"school"
}
再次查询所有 Index:
Copy $ curl 'localhost:9200/_mapping?pretty=true'
{
"school" : {
"mappings" : { }
}
}
4.4 删除 Index
Copy $ curl -X DELETE 'localhost:9200/school?pretty=true'
{
"acknowledged" : true
}
4.5 新增记录
Copy curl -X POST -H "Content-Type: application/json" 'localhost:9200/school/_doc/1?pretty=true' -d '
{
"name": "王五"
}'
{
"_index" : "school",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
4.6 查询记录
Copy curl 'localhost:9200/school/_doc/1?pretty=true'
{
"_index" : "school",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "王五"
}
}
4.7 更新记录
Copy $ curl -X POST -H "Content-Type: application/json" 'localhost:9200/school/_doc/1?pretty=true' -d '
{
"name": "张三换名字了"
}'
{
"_index" : "school",
"_type" : "_doc",
"_id" : "1",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1
}
4.8 删除记录
Copy $ curl 'localhost:9200/school/_doc/1?pretty=true'
{
"_index" : "school",
"_type" : "_doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "张三换名字了"
}
}
5 Elasticsearch as Database
5.1 SQL
Elasticsearch 6.3 之后包含了的 SQL 特性
Copy SELECT select_expr [, ...]
[ FROM table_name ]
[ WHERE condition ]
[ GROUP BY grouping_element [, ...] ]
[ HAVING condition]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count ] ]
[ PIVOT ( aggregation_expr FOR column IN ( value [ [ AS ] alias ] [, ...] ) ) ]
5.1.1 添加测试记录
直接在 Kibana 的 Dev Tools 中运行如下命令即可:
Copy POST /account/_bulk
{"index":{"_id":"1"}}
{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
{"index":{"_id":"6"}}
{"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}
{"index":{"_id":"13"}}
{"account_number":13,"balance":32838,"firstname":"Nanette","lastname":"Bates","age":28,"gender":"F","address":"789 Madison Street","employer":"Quility","email":"nanettebates@quility.com","city":"Nogal","state":"VA"}
{"index":{"_id":"18"}}
{"account_number":18,"balance":4180,"firstname":"Dale","lastname":"Adams","age":33,"gender":"M","address":"467 Hutchinson Court","employer":"Boink","email":"daleadams@boink.com","city":"Orick","state":"MD"}
{"index":{"_id":"20"}}
{"account_number":20,"balance":16418,"firstname":"Elinor","lastname":"Ratliff","age":36,"gender":"M","address":"282 Kings Place","employer":"Scentric","email":"elinorratliff@scentric.com","city":"Ribera","state":"WA"}
{"index":{"_id":"25"}}
{"account_number":25,"balance":40540,"firstname":"Virginia","lastname":"Ayala","age":39,"gender":"F","address":"171 Putnam Avenue","employer":"Filodyne","email":"virginiaayala@filodyne.com","city":"Nicholson","state":"PA"}
{"index":{"_id":"32"}}
{"account_number":32,"balance":48086,"firstname":"Dillard","lastname":"Mcpherson","age":34,"gender":"F","address":"702 Quentin Street","employer":"Quailcom","email":"dillardmcpherson@quailcom.com","city":"Veguita","state":"IN"}
{"index":{"_id":"37"}}
{"account_number":37,"balance":18612,"firstname":"Mcgee","lastname":"Mooney","age":39,"gender":"M","address":"826 Fillmore Place","employer":"Reversus","email":"mcgeemooney@reversus.com","city":"Tooleville","state":"OK"}
{"index":{"_id":"44"}}
{"account_number":44,"balance":34487,"firstname":"Aurelia","lastname":"Harding","age":37,"gender":"M","address":"502 Baycliff Terrace","employer":"Orbalix","email":"aureliaharding@orbalix.com","city":"Yardville","state":"DE"}
{"index":{"_id":"49"}}
{"account_number":49,"balance":29104,"firstname":"Fulton","lastname":"Holt","age":23,"gender":"F","address":"451 Humboldt Street","employer":"Anocha","email":"fultonholt@anocha.com","city":"Sunriver","state":"RI"}
5.1.2 查询下前 5 条记录
可以通过format
参数控制返回结果的格式,txt表示文本格式,看起来更直观点,默认为json格式。
Copy POST /_sql?format=txt
{
"query": "SELECT account_number,address,age,balance FROM account LIMIT 5"
}
txt 结果
Copy account_number | address | age | balance
---------------+--------------------+---------------+---------------
1 |880 Holmes Lane |32 |39225
6 |671 Bristol Street |36 |5686
13 |789 Madison Street |28 |32838
18 |467 Hutchinson Court|33 |4180
20 |282 Kings Place |36 |16418
json 结果
Copy {
"columns" : [
{
"name" : "account_number",
"type" : "long"
},
{
"name" : "address",
"type" : "text"
},
{
"name" : "age",
"type" : "long"
},
{
"name" : "balance",
"type" : "long"
}
],
"rows" : [
[
1,
"880 Holmes Lane",
32,
39225
],
[
6,
"671 Bristol Street",
36,
5686
],
[
13,
"789 Madison Street",
28,
32838
],
[
18,
"467 Hutchinson Court",
33,
4180
],
[
20,
"282 Kings Place",
36,
16418
]
]
}
Python 客户端
Copy from elasticsearch import Elasticsearch
es = Elasticsearch([{'host':'localhost','port':9200}])
res = es.sql.query(body={'query': 'SELECT account_number,address,age,balance FROM account LIMIT 5'})
print res
5.1.3 将 SQL 转化为 DSL
Copy from elasticsearch import Elasticsearch
es = Elasticsearch([{'host':'localhost','port':9200}])
res = es.sql.translate(body={'query': 'SELECT account_number,address,age,balance FROM account LIMIT 5'})
print res
结果
Copy {
"size" : 5,
"_source" : false,
"fields" : [
{
"field" : "account_number"
},
{
"field" : "address"
},
{
"field" : "age"
},
{
"field" : "balance"
}
],
"sort" : [
{
"_doc" : {
"order" : "asc"
}
}
]
}
5.1.4 常用SQL操作
WHERE
可以使用WHERE
语句设置查询条件,比如查询state字段为VA的记录,查询语句如下。
Copy POST /_sql?format=txt
{
"query": "SELECT account_number,address,age,balance,state FROM account WHERE state='VA' LIMIT 10 "
}
GROUP BY
我们可以使用GROUP BY
语句对数据进行分组,统计出分组记录数量,最大age和平均balance等信息,查询语句如下。
Copy POST /_sql?format=txt
{
"query": "SELECT state,COUNT(*),MAX(age),AVG(balance) FROM account GROUP BY state LIMIT 10"
}
HAVING
我们可以使用HAVING
语句对分组数据进行二次筛选,比如筛选分组记录数量大于15的信息,查询语句如下。
Copy POST /_sql?format=txt
{
"query": "SELECT state,COUNT(*),MAX(age),AVG(balance) FROM account GROUP BY state HAVING COUNT(*)>15 LIMIT 10"
}
ORDER BY
我们可以使用ORDER BY
语句对数据进行排序,比如按照balance字段从高到低排序,查询语句如下。
Copy POST /_sql?format=txt
{
"query": "SELECT account_number,address,age,balance,state FROM account ORDER BY balance DESC LIMIT 10 "
}
DESCRIBE
我们可以使用DESCRIBE
语句查看表(ES中为索引)中有哪些字段,比如查看account表的字段,查询语句如下。
Copy POST /_sql?format=txt
{
"query": "DESCRIBE account"
}
SHOW TABLES
我们可以使用SHOW TABLES
查看所有的表(ES中为索引)。
Copy POST /_sql?format=txt
{
"query": "SHOW TABLES"
}
5.1.5 全文搜索函数
全文搜索函数是 ES 中特有的,当使用MATCH
或QUERY
函数时,会启用全文搜索功能,SCORE
函数可以用来统计搜索评分。
MATCH()
使用MATCH
函数查询 address 中包含 Street 的记录。
Copy POST /_sql?format=txt
{
"query": "SELECT account_number,address,age,balance,SCORE() FROM account WHERE MATCH(address,'Street') LIMIT 10"
}
QUERY()
使用QUERY
函数查询 address 中包含 Street 的记录。
Copy POST /_sql?format=txt
{
"query": "SELECT account_number,address,age,balance,SCORE() FROM account WHERE QUERY('address:Street') LIMIT 10"
}
5.2 表结构定义
Elasticsearch独创的DSL。主要是两个方面的DSL:
6 elasticsearch-py
6.1 例子
Copy from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()
doc = {
'author': 'kimchy',
'text': 'Elasticsearch: cool. bonsai cool.',
'timestamp': datetime.now(),
}
res = es.index(index="test-index", id=1, document=doc)
# updated
print(res['result'])
res = es.get(index="test-index", id=1)
# {u'text': u'Elasticsearch: cool. bonsai cool.', u'author': u'kimchy', u'timestamp': u'2022-01-03T16:55:52.518771'}
print(res['_source'])
es.indices.refresh(index="test-index")
res = es.search(index="test-index", query={"match_all": {}})
# Got 1 Hits:
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
# 2022-01-03T16:55:52.518771 kimchy: Elasticsearch: cool. bonsai cool.
print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])
7 kibana
进入 http://localhost:5601 ,选择 Dev tools。或者直接进入 http://localhost:5601/app/kibana#/dev_tools 。