ES

1 什么时候应该用ElasticSearch?

1.1 场景

典型搜索场景：闭着眼用它！
典型日志分析场景：闭着眼用它！
关系型数据库查询有瓶颈：考虑下用它！为啥是考虑？ES的优点在于查询，然而实践证明，在被作为数据库来使用，即写完马上查询会有延迟。
数据分析场景：考虑下用它！为啥是考虑？简单通用的场景需求可以大规模使用，但在特定业务场景领域，还是要选择更加专业的数据产品，如复杂聚合，ClickHouse相比 Elasticserach 做亿级别数据深度聚合需求会更加合适。

MySQL 是 B+ 树索引，ElasticSearch 使用的是倒排索引 (Inverted Index)

B+ 树的主要特点是，非叶子节点不存储数据，数据只存储在叶子节点上，并且所有叶子节点组成有序链表
倒排索引和 B+ 树索引一样，都是一种索引结构。

1.2 版本

elasticsearch-py 8.0 Removed support for Python 2.7 and Python 3.5, the library now supports only Python 3.6+

故当前选择的版本为 7.X

2 部署

2.1 部署方式

2.1.1 docker 部署

Install Elasticsearch with Docker | Elasticsearch Guide [8.2] | ElasticElastic

(1) Install and start Docker Desktop.

(2) Run:

docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.16.2
docker run --name es01-test --net elastic -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.16.2

(3) Install and run Kibana

To analyze, visualize, and manage Elasticsearch data using an intuitive UI, install Kibana.

In a new terminal session, run:

docker pull docker.elastic.co/kibana/kibana:7.16.2
docker run --name kib01-test --net elastic -p 127.0.0.1:5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.16.2

To access Kibana, go to http://localhost:5601

2.2.1 物理部署

服务端

7.17.5 版本

Elasticsearch 7.17.5elastic

# 启动
./bin/elasticsearch

客户端

Release 7.17.4 · elastic/elasticsearch-pyGitHub

2.2 部署时常见错误

2.2.1 ES 系列：解决 max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

将配置文件中的 network.host 改为 0.0.0.0 后，重新启动时报的此错误

处理办法

# 修改 /etc/sysctl.conf 文件，增加配置
vm.max_map_count=262144


# 执行命令 sysctl -p 生效
sysctl -p

2.2.2 ES 系列：解决 the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured

看提示可知：缺少默认配置，至少需要配置discovery.seed_hosts/discovery.seed_providers/cluster.initial_master_nodes中的一个参数.

discovery.seed_hosts: 集群主机列表
discovery.seed_providers: 基于配置文件配置集群主机列表
cluster.initial_master_nodes: 启动时初始化的参与选主的node，生产环境必填

处理办法

# custom config
node.name: "node-1"
discovery.seed_hosts: ["127.0.0.1"]
cluster.initial_master_nodes: ["node-1"]

3 名词说明

3.1 关于 Index、Type、Document

在 7.0 以及之后的版本中 Type 被废弃了。（其实还可以用，但是已经不推荐了）

一个 MySQL 实例中可以创建多个 Database，一个 Database 中可以创建多个 Table。

ES 的Type 被废弃后：

ES 实例：对应 MySQL 实例中的一个 Database。
Index 对应 MySQL 中的 Table 。
Document 对应 MySQL 中表的记录。

3.2 正/倒排索引

类似于书的目录，目录能够方便的定位哪一章节或哪一小节的页码，但是无法定位某一关键字的位置。有一些书的最后有索引页，它的功能就是帮助定位某些关键字出现的位置。

目录页对应正排索引
索引页对应倒排索引

对于搜索引擎来讲：

正排索引是文档 Id 到文档内容、单词的关联关系。也就是说可以通过 Id获取到文档的内容。

倒排索引是单词到文档 Id 的关联关系。也就是说了一通过单词搜索到文档 Id。

倒排索引的查询流程是：首先根据关键字搜索到对应的文档 Id，然后根据正排索引查询文档 Id 的完整内容，最后返回给用户想要的结果。

3.2.1 倒排索引的组成

倒排索引是搜索引擎的核心，主要包含两个部分：

单词词典（Trem Dictionary）：记录的是所有的文档分词后的结果倒排列表（Posting List）：记录了单词对应文档的集合，由倒排索引项（Posting）组成。单词字典的实现一般采用B+Tree的方式，来保证高效

倒排索引项（Posting）主要包含如下的信息：

1、文档ID，用于获取原始文档的信息
2、单词频率（TF，Term Frequency），记录该单词在该文档中出现的次数，用于后续相关性算分。
3、位置（Position），记录单词在文档中的分词位置（多个），用于做词语搜索。
4、偏移（Offset），记录单词在文档的开始和结束位置，用于高亮显示。

3.3 分词

分词是指将文本转换成一系列的单词的过程，也可以叫做文本分析，在 es 中称为Analysis。

例如文本 "elasticsearch 是最流行的搜索引擎"，经过分词后变成"elasticsearch"，"流行"，"搜索引擎"

3.3.1 分词器

分词器（Analyzer）是es中专门用于分词的组件，它的组成如下：

Character Filter 针对原始文本进行处理，比如去除html标记符。
Tokenuzer 将原始文本按照一定规则切分为单词。
Token Filters 针对tokenizer处理的单词进行再加工，比如转为小写、删除或新增等。

3.3.2 Analyze API

es 提供了一个测试分词的 api 接口，方便验证分词效果，endpoint 是 _analyze

这个 api 具有以下特点：

可以直接指定 analyzer 进行测试
可以直接指定索引中的字段进行测试
可以自定义分词器进行测试

直接指定 analyzer 进行测试

POST _analyze
{
    "analyzer": "standard",
    "text": "hello world"
}

analyzer 表示指定的分词器，这里使用 es 自带的分词器 standard，text 用来指定待分词的文本

python 例子

result= es.indices.analyze(body={'text': "hello world", "analyzer": "standard"})
import json
print(json.dumps(result, indent=4, ensure_ascii=False))

---
{
    "tokens": [
        {
            "end_offset": 5,
            "token": "hello",
            "type": "<ALPHANUM>",
            "start_offset": 0,
            "position": 0
        },
        {
            "end_offset": 11,
            "token": "world",
            "type": "<ALPHANUM>",
            "start_offset": 6,
            "position": 1
        }
    ]
}

指定索引中的字段进行测试

应用场景：当创建好索引后发现某一字段的查询和预期不一样，就可以对这个字段进行分词测试。

POST text_index/_analyze
{
  "field": "username",
  "text": "hello world"
}

Python 例子

result= es.indices.analyze(index=index, body={"field": "zone", 'text': "hello world", "analyzer": "keyword"})
import json
print(json.dumps(result, indent=4, ensure_ascii=False))

---
{
    "tokens": [
        {
            "end_offset": 5,
            "token": "hello",
            "type": "<ALPHANUM>",
            "start_offset": 0,
            "position": 0
        },
        {
            "end_offset": 11,
            "token": "world",
            "type": "<ALPHANUM>",
            "start_offset": 6,
            "position": 1
        }
    ]
}

3.3.3 预定义的分词器

standard: 默认分词器，具有按词切分、支持多语言、小写处理的特点。
simple: 具有特性是：按照非字母切分、小写处理。
whitespace: 按照空格切分。
stop: 将 stop word 切掉
keyword: 不分词，直接将输入作为一个单词输出
pattern: 通过正则表达式自定义分隔符，默认是\W+

3.3.4 中文分词

IK 可以实现中英文单词的分词，支持ik_smart、ik_maxword等模式；可以自定义词库，支持热更新分词词典。
jieba python中最流行的分词系统，支持分词和词性标注；支持繁体分词、自定义词典、并行分词。

3.3.5 分词使用说明

分词一般会在以下的情况下使用：

创建或更新文档的时候，会对相应的文档进行分词处理。
查询时，会对查询语句进行分词

明确字段是否需要分词，不需要分词的字段就将 type 设置为 keyword，可以节省空间和提高写性能。

善用 _analyze API 查看文档具体分词结果

3.4 Mapping

Mapping 在 Elasticsearch 中是非常重要的一个概念。决定了一个 index 中的field 使用什么数据格式存储，使用什么分词器解析，是否有子字段等。

如果没有 mapping 所有 text 类型属性默认都使用 standard 分词器。所以如果希望使用 IK 分词就必须配置自定义 mapping。

ES 中的 Mapping 相当于传统数据库中的表定义，它有以下作用：

定义索引中的字段的名字。
定义索引中的字段的类型，比如字符串，数字等。
定义索引中的字段是否建立倒排索引。

3.4.1 mapping 核心数据类型

Elasticsearch 中的数据类型有很多，在这里只介绍常用的数据类型。只有text类型才能被分词。其他类型不允许。

文本（字符串）：text
整数：byte、short、integer、long
浮点型：float、double
布尔类型：boolean
日期类型：date
数组类型：array {a:[]}
对象类型：object {a:{}}
不分词的字符串（关键字）： keyword

3.4.2 dynamic mapping 对字段的类型分配

true or false -> boolean
123 -> long
123.123 -> double
2018-01-01 -> date
hello world -> text
[] -> array
{} -> object

在上述的自动 mapping 字段类型分配的时候，只有 text 类型的字段需要分词器。默认分词器是 standard 分词器。

3.4.3 查看索引 mapping

可以通过命令查看已有 index 的 mapping 具体信息，语法如下：

GET 索引名/_mapping
如：
GET test_index/_mapping

如:

{
  "test_index": { # 索引名
    "mappings": { # 映射列表
      "test_type": { # 类型名
        "properties": { # 字段列表
          "age": { # 字段名
            "type": "long" # 字段类型
          },
          "gender": {
            "type": "text",
            "fields": { # 子字段列表
              "keyword": { # 子字段名
                "type": "keyword", # 子字段类型，keyword 不进行分词处理的文本类型
                "ignore_above": 256 # 子字段存储数据长度
              }
            }
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

3.4.4 定制 mapping

创建 index 的时候来定制 mapping 映射，也就是指定字段的类型和字段数据使用的分词器。
手动定制 mapping 时，只能新增 mapping 设置，不能对已有的 mapping进行修改。

3.4.5 mapping 中的子字段

"name": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }

query name 时，使用的 standard 分词
query name.keyword 时，使用的 keyword，即不分词

4 使用 curl 命令操作与 ES 交互

4.1 查看信息

curl localhost:9200

{
    ： "name" : "86302a07d5ab",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "hRFZ_8Q1SPOLkiL8R-D0UQ",
  "version" : {
    "number" : "7.16.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "2b937c44140b6559905130a8650c64dbd0879cfb",
    "build_date" : "2021-12-18T19:42:46.604893745Z",
    "build_snapshot" : false,
    "lucene_version" : "8.10.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

4.2 查看所有的 Index

$ curl 'localhost:9200/_mapping?pretty=true'

{ }

4.3 创建 Index

使用 HTTP PUT 请求创建 Index：

$ curl -X PUT 'localhost:9200/school?pretty=true'

{
    "acknowledged":true,
    "shards_acknowledged":true,
    "index":"school"
}

再次查询所有 Index：

$ curl 'localhost:9200/_mapping?pretty=true'
{
  "school" : {
    "mappings" : { }
  }
}

4.4 删除 Index

$ curl -X DELETE 'localhost:9200/school?pretty=true'
{
  "acknowledged" : true
}

4.5 新增记录

curl -X POST -H "Content-Type: application/json" 'localhost:9200/school/_doc/1?pretty=true' -d '
{
  "name": "王五"
}'
{
  "_index" : "school",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

4.6 查询记录

curl 'localhost:9200/school/_doc/1?pretty=true'
{
  "_index" : "school",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "王五"
  }
}

4.7 更新记录

$ curl -X POST -H "Content-Type: application/json" 'localhost:9200/school/_doc/1?pretty=true' -d '
{
  "name": "张三换名字了"
}'

{
  "_index" : "school",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

4.8 删除记录

$ curl 'localhost:9200/school/_doc/1?pretty=true'
 
{
  "_index" : "school",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "_seq_no" : 1,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "张三换名字了"
  }
}

5 Elasticsearch as Database

Elasticsearch as Database

5.1 SQL

Elasticsearch 6.3 之后包含了的 SQL 特性

SELECT select_expr [, ...]
[ FROM table_name ]
[ WHERE condition ]
[ GROUP BY grouping_element [, ...] ]
[ HAVING condition]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count ] ]
[ PIVOT ( aggregation_expr FOR column IN ( value [ [ AS ] alias ] [, ...] ) ) ]

5.1.1 添加测试记录

直接在 Kibana 的 Dev Tools 中运行如下命令即可：

POST /account/_bulk
{"index":{"_id":"1"}}
{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
{"index":{"_id":"6"}}
{"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}
{"index":{"_id":"13"}}
{"account_number":13,"balance":32838,"firstname":"Nanette","lastname":"Bates","age":28,"gender":"F","address":"789 Madison Street","employer":"Quility","email":"nanettebates@quility.com","city":"Nogal","state":"VA"}
{"index":{"_id":"18"}}
{"account_number":18,"balance":4180,"firstname":"Dale","lastname":"Adams","age":33,"gender":"M","address":"467 Hutchinson Court","employer":"Boink","email":"daleadams@boink.com","city":"Orick","state":"MD"}
{"index":{"_id":"20"}}
{"account_number":20,"balance":16418,"firstname":"Elinor","lastname":"Ratliff","age":36,"gender":"M","address":"282 Kings Place","employer":"Scentric","email":"elinorratliff@scentric.com","city":"Ribera","state":"WA"}
{"index":{"_id":"25"}}
{"account_number":25,"balance":40540,"firstname":"Virginia","lastname":"Ayala","age":39,"gender":"F","address":"171 Putnam Avenue","employer":"Filodyne","email":"virginiaayala@filodyne.com","city":"Nicholson","state":"PA"}
{"index":{"_id":"32"}}
{"account_number":32,"balance":48086,"firstname":"Dillard","lastname":"Mcpherson","age":34,"gender":"F","address":"702 Quentin Street","employer":"Quailcom","email":"dillardmcpherson@quailcom.com","city":"Veguita","state":"IN"}
{"index":{"_id":"37"}}
{"account_number":37,"balance":18612,"firstname":"Mcgee","lastname":"Mooney","age":39,"gender":"M","address":"826 Fillmore Place","employer":"Reversus","email":"mcgeemooney@reversus.com","city":"Tooleville","state":"OK"}
{"index":{"_id":"44"}}
{"account_number":44,"balance":34487,"firstname":"Aurelia","lastname":"Harding","age":37,"gender":"M","address":"502 Baycliff Terrace","employer":"Orbalix","email":"aureliaharding@orbalix.com","city":"Yardville","state":"DE"}
{"index":{"_id":"49"}}
{"account_number":49,"balance":29104,"firstname":"Fulton","lastname":"Holt","age":23,"gender":"F","address":"451 Humboldt Street","employer":"Anocha","email":"fultonholt@anocha.com","city":"Sunriver","state":"RI"}

5.1.2 查询下前 5 条记录

可以通过format参数控制返回结果的格式，txt表示文本格式，看起来更直观点，默认为json格式。

POST /_sql?format=txt
{
  "query": "SELECT account_number,address,age,balance FROM account LIMIT 5"
}

txt 结果

account_number |      address       |      age      |    balance    
---------------+--------------------+---------------+---------------
1              |880 Holmes Lane     |32             |39225          
6              |671 Bristol Street  |36             |5686           
13             |789 Madison Street  |28             |32838          
18             |467 Hutchinson Court|33             |4180           
20             |282 Kings Place     |36             |16418

json 结果

{
  "columns" : [
    {
      "name" : "account_number",
      "type" : "long"
    },
    {
      "name" : "address",
      "type" : "text"
    },
    {
      "name" : "age",
      "type" : "long"
    },
    {
      "name" : "balance",
      "type" : "long"
    }
  ],
  "rows" : [
    [
      1,
      "880 Holmes Lane",
      32,
      39225
    ],
    [
      6,
      "671 Bristol Street",
      36,
      5686
    ],
    [
      13,
      "789 Madison Street",
      28,
      32838
    ],
    [
      18,
      "467 Hutchinson Court",
      33,
      4180
    ],
    [
      20,
      "282 Kings Place",
      36,
      16418
    ]
  ]
}

Python 客户端

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host':'localhost','port':9200}])
res = es.sql.query(body={'query': 'SELECT account_number,address,age,balance FROM account LIMIT 5'})
print res

5.1.3 将 SQL 转化为 DSL

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host':'localhost','port':9200}])
res = es.sql.translate(body={'query': 'SELECT account_number,address,age,balance FROM account LIMIT 5'})
print res

结果

{
  "size" : 5,
  "_source" : false,
  "fields" : [
    {
      "field" : "account_number"
    },
    {
      "field" : "address"
    },
    {
      "field" : "age"
    },
    {
      "field" : "balance"
    }
  ],
  "sort" : [
    {
      "_doc" : {
        "order" : "asc"
      }
    }
  ]
}

5.1.4 常用SQL操作

WHERE

可以使用WHERE语句设置查询条件，比如查询state字段为VA的记录，查询语句如下。

POST /_sql?format=txt
{
  "query": "SELECT account_number,address,age,balance,state FROM account WHERE state='VA' LIMIT 10 "
}

GROUP BY

我们可以使用GROUP BY语句对数据进行分组，统计出分组记录数量，最大age和平均balance等信息，查询语句如下。

POST /_sql?format=txt
{
  "query": "SELECT state,COUNT(*),MAX(age),AVG(balance) FROM account GROUP BY state LIMIT 10"
}

HAVING

我们可以使用HAVING语句对分组数据进行二次筛选，比如筛选分组记录数量大于15的信息，查询语句如下。

POST /_sql?format=txt
{
  "query": "SELECT state,COUNT(*),MAX(age),AVG(balance) FROM account GROUP BY state HAVING COUNT(*)>15 LIMIT 10"
}

ORDER BY

我们可以使用ORDER BY语句对数据进行排序，比如按照balance字段从高到低排序，查询语句如下。

POST /_sql?format=txt
{
  "query": "SELECT account_number,address,age,balance,state FROM account ORDER BY balance DESC LIMIT 10 "
}

DESCRIBE

我们可以使用DESCRIBE语句查看表（ES中为索引）中有哪些字段，比如查看account表的字段，查询语句如下。

POST /_sql?format=txt
{
  "query": "DESCRIBE account"
}

SHOW TABLES

我们可以使用SHOW TABLES查看所有的表（ES中为索引）。

POST /_sql?format=txt
{
  "query": "SHOW TABLES"
}

5.1.5 全文搜索函数

全文搜索函数是 ES 中特有的，当使用MATCH或QUERY函数时，会启用全文搜索功能，SCORE函数可以用来统计搜索评分。

MATCH()

使用MATCH函数查询 address 中包含 Street 的记录。

POST /_sql?format=txt
{
  "query": "SELECT account_number,address,age,balance,SCORE() FROM account WHERE MATCH(address,'Street') LIMIT 10"
}

QUERY()

使用QUERY函数查询 address 中包含 Street 的记录。

POST /_sql?format=txt
{
  "query": "SELECT account_number,address,age,balance,SCORE() FROM account WHERE QUERY('address:Street') LIMIT 10"
}

5.2 表结构定义

Elasticsearch独创的DSL。主要是两个方面的DSL：

Query DSL（https://www.elastic.co/guide/...）相当于SQL里的 WHERE 部分，实现各种各样的过滤文档的方式
Aggregation DSL (https://www.elastic.co/guide/... ) 相当于SQL里的 GROUP BY 部分，实现文档按条件聚合并求一些指标（metric），比如求和求平均这些

6 elasticsearch-py

https://elasticsearch-py.readthedocs.io/en/v7.16.2/

6.1 例子

from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()

doc = {
    'author': 'kimchy',
    'text': 'Elasticsearch: cool. bonsai cool.',
    'timestamp': datetime.now(),
}
res = es.index(index="test-index", id=1, document=doc)
# updated
print(res['result'])

res = es.get(index="test-index", id=1)
# {u'text': u'Elasticsearch: cool. bonsai cool.', u'author': u'kimchy', u'timestamp': u'2022-01-03T16:55:52.518771'}
print(res['_source'])

es.indices.refresh(index="test-index")

res = es.search(index="test-index", query={"match_all": {}})
# Got 1 Hits:
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    # 2022-01-03T16:55:52.518771 kimchy: Elasticsearch: cool. bonsai cool.
    print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])

7 kibana

进入 http://localhost:5601 ，选择 Dev tools。或者直接进入 http://localhost:5601/app/kibana#/dev_tools 。

基础入门 | Kibana 用户手册 | ElasticElastic

PreviousNoSQL NextElasticsearch as Database

Last updated 2 years ago