2018-10-09发表2023-01-31更新编程16 分钟读完 (大约2352个字)0次访问

Elasticsearch 入门

简介

什么是 Elasticsearch

基于 Apache Lucene 构建的 开源搜索引擎
采用 Java 编写，提供简单易用的 RESTful API
轻松的 横向扩展，可支持 PB级 的结构化或非结构化数据处理

应用场景

海量数据分析引擎
站内搜索引擎
数据仓库

一线公司实际应用场景：

维基百科、Github - 站内实时搜索
百度 - 实时日志监控平台

简介部分描述摘自：瓦力老师的《ElasticSearch入门》。—— 教程版本为 5.x。本文内容基于 6.4 版本整理。

Tips： 以下部分会列出 Elasticsearch 方式和 SQL “等价”方式便于理解。但实际应用中，部分场景，数据结果，不完全等价，仅供参考。

基础概念

index

索引

官方传统的定义倾向数据库中 “库” 的定义，但这种比喻已被官方否定。

type

索引类型。6.x 版本，一个index 有且只能有一个 type。

官方传统的定义倾向数据库中 “表” 的定义，但这种比喻已被官方否定。

官方在 6.0.0 中弃用（非删除）。请参考删除映射类型。 为兼容后续版本，官方推荐使用 _doc 作为类型名

后续代码中，建议弱化 type 参数

document

文档数据，类似数据库 “行” 数据。

基本使用

请求与输出

请求
- 接口 API 使用 RESTful 规范，URI格式：http://localhost:9200/index/<type>/<document_id>
- ?pretty=true：JSON 格式化

输出

数据格式：json（也支持 yaml，接口地址：?format=yaml）

字段	说明
`took`	请求耗时，单位：毫秒
`timed_out`	是否超时
`_shards.total`	总共查询了多少分片，含跨`index`
`_shards.successful`	查询成功的分片，含跨`index`
`hits`	匹配文档的数据结果
`hits.total`	匹配文档的总记录数
`hits.max_score`	匹配文档的最高分值
`hits.hits`	匹配文档的文档数据
`hits.hits._index`	文档所对应的索引
`hits.hits._type`	文档所对应的索引类型
`hits.hits._id`	文档所对应的文档ID
`hits.hits._score`	文档根据匹配对的分值
`hits.hits._source`	文档原始数据
更多…

索引

倒排索引

Elasticsearch 使用一种称为 倒排索引 的结构，它适用于快速的全文搜索。一个倒排索引由文档中所有不重复词的列表构成，对于其中每个词，有一个包含它的文档列表。

例如，假设我们有两个文档，每个文档的 content 域包含如下内容：

The quick brown fox jumped over the lazy dog
Quick brown foxes leap over lazy dogs in summer

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

重要： 你只能搜索在索引中出现的词条，所以索引文本和查询字符串必须标准化为相同的格式。

创建/设置索引
1
PUT /twitter
删除索引
1
DELETE /twitter
查看索引
1
GET /twitter

文档

新增文档

PUT twitter/_doc/1
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

更新文档

// 指定文档 ID
PUT twitter/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

// 脚本更改（如递增）
POST twitter/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}

删除文档

// 指定文档 ID
DELETE /twitter/_doc/1

// 指定条件
POST twitter/_delete_by_query
{
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}

获取文档
1
GET twitter/_doc/1
搜索文档：传送门、语法
1
GET twitter/_search?q=user:kimchy

指定字段

Elasticsearch

1
2
3

{
    "_source": ["*", "user", "user.age", 'user*']
}

SQL

1	select *, user from table;

From / Size ：传送门

Elasticsearch

{
    "from" : 10,
    "size" : 20
}

SQL

1	select * from table limit 10, 20;

from 默认为 0， size 默认为 10。这点与 SQL 不同。

请注意，from + size 不能超过索引设置 index.max_result_window, 该值默认为 10,000。

排序

Elasticsearch

{
    "sort" : [
        {"post_date" : {"order" : "asc"}},
        "user",
        {"name" : "desc"},
        {"age" : "desc" },
        "_score"
    ]
}

SQL

1	select * from table order by post_date asc, user, name desc, age desc;

不设置 sort, 默认按 _score 从高到低排序

常用 DSL

Match 全文查询

match

默认匹配：传送门

Elasticsearch

GET /_search
{
    "query": {
        "match" : {
            "message" : "this is a test"
        }
    }
}

SQL

select * from table
where message like '%this%'
    or message like '%is%'
    or message like '%a%'
    or message like '%test%';

不等价匹配的示例：如 message 值为 this is a test。去搜索 st，Elasticsearch 无法匹配上；但 SQL 可匹配

match_phrase

短语匹配：传送门

Elasticsearch

GET /_search
{
    "query": {
        "match_phrase" : {
            "message" : "this is a test"
        }
    }
}

SQL

1	select * from table where message like '%this is a test%';

不等价匹配的示例：如 message 值为 Mozilla/5.0 Windows。去搜索 Mozilla 5.0，Elasticsearch 能匹配上；但 SQL 无法匹配

match_phrase_prefix

短语前缀匹配：传送门

一般用于搜索框：建议搜索

Elasticsearch

GET /_search
{
    "query": {
        "match_phrase_prefix" : {
            "message" : "quick brown f"
            // "max_expansions": 10
        }
    }
}

SQL

1	select * from table where message like '%quick brown f%';

重要：match_phrase_prefix 与 match_phrase 区别：

如上面例子中 f， match_phrase_prefix 从分析器的后的索引词中进行查找 f 开头的索引词。而 match_phrase 是直接查找 f 索引词。

例子如下：

字段值：我是中国人
分词后的索引词：我, 是, 中国, 人
搜索词：是中
match_phrase_prefix 能匹配； match_phrase 无法匹配

multi_match

多重匹配：传送门

一般用于关键字搜索

Elasticsearch

GET /_search
{
  "query": {
    "multi_match" : {
      "query": "this is", 
      "fields": ["subject", "message"]
    }
  }
}

SQL

select * from table
where subject like '%this%'
    or subject like '%is%'
    or message like '%this%'
    or message like '%is%';

其他用法

模糊字段匹配："fields": ["subject", "*_message"]
_score 权重调整："fields" : [ "subject^3", "message" ]
其他匹配类型，如： best_fields(默认)、 phrase、 phrase_prefix 等

警告： 一次查询的字段数不得超过1024个

Term 查询

term

通过倒排索引查找确切的值。相当于 =，但不只是 =，更像是 PHP 的 in_array 的概念：传送门

Elasticsearch

POST _search
{
  "query": {
    "term" : { "user" : "Kimchy" } 
  }
}

SQL

1	select * from table where user = 'Kimchy';

为什么 term 查询不符合我的文档？

字符串 字段可以是 text 类型（视为全文，如电子邮件正文），或 keyword 类型（视为确切的值，如电子邮件地址或邮政编码）。确切的值（如数字，日期和关键字）具有在添加到倒排索引的字段中指定的确切值，以便使它们可搜索。_(——官方文档谷歌翻译…仅供参考)_

个人总结: 日常应用中，term 查询相对更适合 int 类型的数据，如字符串可使用 keyword 类型。而 Elasticsearch 默认在生成 mapping 映射时，会对字符串自动添加 keyword 类型。如字段 name，可使用 name.keyword

数组说明： 当文档的字段值为数组形式，如： {"a": [1, 2]}，{"term": {"a": 1}} 也能搜索到此文档，—— in_array 的理解。

terms

通过倒排索引中查找多个确切的值，相当于 in：传送门

Elasticsearch

GET /_search
{
    "query": {
        "terms" : { "user" : ["kimchy", "elasticsearch"]}
    }
}

SQL

1	select * from table where user in ('Kimchy', 'elasticsearch');

range

范围区间查询，传送门

Elasticsearch

GET _search
{
    "query": {
        "range" : {
            "age" : {
                "gte" : 10,
                "lte" : 20
            }
        }
    }
}

SQL

1	select * from table where age >=10 and age <= 20;

参数说明

参数	说明
`gte`	>=
`gt`	>
`lte`	<=
`lt`	<

复合查询

bool

与其他查询构建语句进行组合的查询，构建语句包括：

构建语句	说明
`must`	必须出现在匹配的文档中的条件。有助于计算得分
`filter`	必须出现在匹配的文档中的条件。但是不同于 `must` 查询的分数将被忽略。忽略评分并考虑使用子句进行高速缓存
`should`	应出现在匹配的文档中的条件，即满足即可。计算分值
`must_not`	不得出现在匹配的文档中的条件，不计算分值

Elasticsearch

POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user" : "kimchy" }
      },
      "filter": {
        "term" : { "tag" : "tech" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [
        { "term" : { "tag" : "wow" } },
        { "term" : { "tag" : "elasticsearch" } }
      ]
    }
  }
}

SQL

select * from table
where user = "kimchy"
    and tag = "tech"
    and (age < 10 or age > 20) -- not (age >= 10 and age <= 20)
    and (tag = "wow" or tag = "elasticsearch");

个人理解： bool 类似 ()，而 must(filter) 类似 and；should 类似 or，must_not 类似 not。

注： bool 支持无限极嵌套，但 bool 下仅能编写以上四种构建语句。

补充：Elasticsearch会自动缓存经常使用的过滤器，以加快性能。

只要将查询子句传递给 filter 参数（例如查询中的 filter或 must_not 参数， bool 查询中的filter 参数 constant_score 或 filter 聚合），过滤器上下文就会生效。

#工具 Elasticsearch

Elasticsearch 入门

简介

什么是 Elasticsearch

应用场景

基础概念

index

type

document

基本使用

请求与输出

索引

文档

指定字段

From / Size ：传送门

排序

常用 DSL

Match 全文查询

match

match_phrase

match_phrase_prefix

multi_match

Term 查询

term

terms

range

复合查询

bool

喜欢这篇文章？打赏一下作者吧

评论

分类

标签

目录

最新文章

链接

归档

Elasticsearch 入门

简介

什么是 Elasticsearch

应用场景

基础概念

index

type

document

基本使用

请求与输出

索引

文档

指定字段

From / Size ： 传送门

排序

常用 DSL

Match 全文查询

match

match_phrase

match_phrase_prefix

multi_match

Term 查询

term

terms

range

复合查询

bool

喜欢这篇文章？打赏一下作者吧

评论

分类

标签

目录

最新文章

链接

归档

From / Size ：传送门