【云计算】elasticsearch关键词查询之实现like查询-职坐标

【云计算】elasticsearch关键词查询之实现like查询

小标 2018-12-24 来源：阅读 1329 评论 0

摘要：本文主要向大家介绍了【云计算】elasticsearch关键词查询之实现like查询，通过具体的内容向大家展现，希望对大家学习云计算有所帮助。

本文主要向大家介绍了【云计算】elasticsearch关键词查询之实现like查询，通过具体的内容向大家展现，希望对大家学习云计算有所帮助。

背景：我们项目需要对es索引里面的一个字段进行关键词（中文+英文+数字混合，中文偏多）搜索，相当于关系型数据库的like操作。要实现这个功能，我们首先想到的方式是用*通配符，但是实际应用场景查询语句会很复杂，*通配符的方式显得不够友好，导致慢查询，甚至内存溢出。

考虑到实际应用场景，一次查询会查询多个字段，我们项目采用query_string query方式，下面只考虑关键词字段。

数据准备

创建索引 es_test_index

PUT  127.0.0.1:9200/es_test_index
{
    "order": 0,
    "index_patterns": [
        "es_test_index"
    ],
    "settings": {
        "index": {
            "max_result_window": "30000",
            "refresh_interval": "60s",
            "number_of_shards": "3",
            "number_of_replicas": "1"
        }
    },
    "mappings": {
        "logs": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "search_word": {
                    "type": "keyword"
                }
            }
        }
    }
}

方式一

{
    "profile":true,
    "from":0,
    "size":100,
    "query":{
        "query_string":{
            "query":"search_word:(*中国* NOT *美国* AND *VIP* AND *经济* OR *金融*)",
            "default_operator":"and"
        }
    }
}

采用*通配符的方式，相当于wildcard query，只是query_string能支持查询多个关键词，并且可以用 AND OR NOT进行连接，会更加灵活。

{
    "query": {
        "wildcard" : { "search_word" : "*中国*" }
    }
}

在我们的应用场景中，关键词前后都有*通配符，这个查询会非常慢，因为该查询需要遍历index里面的每个term。官方文档解释：Matches documents that have fields matching a wildcard expression (not analyzed). Supported wildcards are *, which matches any character sequence (including the empty one), and , which matches any single character. Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or . 官方文档建议避免以*开头，但是我们要实现全匹配，前后都需要*通配符，可想而知效率是非常慢的。

在我们的实际项目中，我们发现用户有时候会输入很多个关键词，再加上其他的查询条件，单个查询的压力很大，导致了大量的超时。所以，我们决定换种方式实现like查询。

在仔细研究官方文档后，发现可以用standard分词+math_pharse查询实现。

重新创建索引

PUT  127.0.0.1:9200/es_test_index

{
    "order": 0,
    "index_patterns": [
        "es_test_index_2"
    ],
    "settings": {
        "index": {
            "max_result_window": "30000",
            "refresh_interval": "60s",
            "analysis": {
                "analyzer": {
                    "custom_standard": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "char_filter": [
                            "my_char_filter"
                        ],
                        "filter": "lowercase"
                    }
                },
                "char_filter": {
                    "my_char_filter": {
                        "type": "mapping",
                        "mappings": [
                            "· => xxDOT1xx",
                            "+ => xxPLUSxx",
                            "- => xxMINUSxx",
                            "\" => xxQUOTATIONxx",
                            "（ => xxLEFTBRACKET1xx",
                            "） => xxRIGHTBRACKET1xx",
                            "& => xxANDxx",
                            "| => xxVERTICALxx",
                            "—=> xxUNDERLINExx",
                            "/=> xxSLASHxx",
                            "！=> xxEXCLAxx",
                            "=> xxDOT2xx",
                            "【=>xxLEFTBRACKET2xx",
                            "】 => xxRIGHTBRACKET2xx",
                            "`=>xxapostrophexx",
                            ".=>xxDOT3xx",
                            "#=>xxhashtagxx",
                            "，=>xxcommaxx"
                        ]
                    }
                }
            },
            "number_of_shards": "3",
            "number_of_replicas": "1"
        }
    },
    "mappings": {
        "logs": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "search_text": {
                    "analyzer": "custom_standard",
                    "type": "text"
                },
                "search_word": {
                    "type": "keyword"
                }
            }
        }
    }
}

注意看上面的索引，我创建了两个字段，search_word 跟方式一相同，为了对比两种方式的性能。 search_text ：为了使用分析器，将type设置为text ，分析器设置为custom_standard 。

custom_standard组成：

字符过滤器char_filter：采用了mapping char filter 即接受原始文本作为字符流输入，把某些字符（自定义）转换为另外的字符。因为分词器采用了standard分词器，它会去掉大多数的符号，但是关键词搜索的过程可能会带有这些符号，如果去掉的话，会使搜索出来的结果不准确。比如搜索红+黄，分词之后变成红黄，那么，搜索出来的结果可能包含红+黄，红黄，而红黄并不是我们想要的。因此，运用字符过滤器，把+转换成字符串xxPLUSxx，那么在分词的时候，+就不会被去掉了。

分词器：standard 该分词器对英文比较友好，对于中文分词会分为单个字这样。

词元过滤器filter：lowercase 把分词过后的词元变为小写。

准备工作就绪，我们准备查询了，现在我们采用match_pharse查询方式。

方式二：

{
    "from": 0,
    "size": 100,
    "query": {
        "query_string": {
            "query": "search_text:(\"中国\" NOT \"美国\" AND \"VIP\" AND \"经济\" OR \"金融\")",
            "default_operator": "and"
        }
    }
}

我们来看下为什么match_phrase查询能实现关键词左右模糊匹配。

match_phrase 查询首先将查询字符串进行分词（如果不进行其他的参数设置，分词器采用创建索引时search_text字段的分词器custom_standard，如果不明白可以参考官方文档https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html），然后对这些词项进行搜索，但只保留那些包含全部搜索词项，且位置与搜索词项相同的文档。换句话说，match_phrase查询不仅匹配字，还匹配位置。比如，search_text字段包含的内容是：当代中国正处于高速发展时期。我们搜索关键词：中国

索引的时候 search_text经过分词器分为

我们可以用以下api查询分词效果

127.0.0.1:9200/es_test_index_2/_analyze

{
"analyzer": "custom_standard",
"text": "当代中国正处于高速发展时期"
}

返回结果：

{
    "tokens": [
        {
            "token": "当",
            "start_offset": 0,
            "end_offset": 1,
            "type": "",
            "position": 0
        },
        {
            "token": "代",
            "start_offset": 1,
            "end_offset": 2,
            "type": "",
            "position": 1
        },
        {
            "token": "中",
            "start_offset": 2,
            "end_offset": 3,
            "type": "",
            "position": 2
        },
        {
            "token": "国",
            "start_offset": 3,
            "end_offset": 4,
            "type": "",
            "position": 3
        },
        {
            "token": "正",
            "start_offset": 4,
            "end_offset": 5,
            "type": "",
            "position": 4
        },
        {
            "token": "处",
            "start_offset": 5,
            "end_offset": 6,
            "type": "",
            "position": 5
        },
        {
            "token": "于",
            "start_offset": 6,
            "end_offset": 7,
            "type": "",
            "position": 6
        },
        {
            "token": "高",
            "start_offset": 7,
            "end_offset": 8,
            "type": "",
            "position": 7
        },
        {
            "token": "速",
            "start_offset": 8,
            "end_offset": 9,
            "type": "",
            "position": 8
        },
        {
            "token": "发",
            "start_offset": 9,
            "end_offset": 10,
            "type": "",
            "position": 9
        },
        {
            "token": "展",
            "start_offset": 10,
            "end_offset": 11,
            "type": "",
            "position": 10
        },
        {
            "token": "时",
            "start_offset": 11,
            "end_offset": 12,
            "type": "",
            "position": 11
        },
        {
            "token": "期",
            "start_offset": 12,
            "end_offset": 13,
            "type": "",
            "position": 12
        }
    ]
}

我们可以看到经过分词之后，search_text会被分为单个的字并且还带有位置信息。位置信息可以被存储在倒排索引中，因此 match_phrase 查询这类对词语位置敏感的查询，就可以利用位置信息去匹配包含所有查询词项，且各词项顺序也与我们搜索指定一致的文档，中间不夹杂其他词项。

在搜索的时候，关键词“中国”也会经过分词被分为“中” “国”两个字，然后 match_phrase 查询会在倒排索引中检查是否包含词项“中”和“国”并且“中”出现的位置只比“国”出现的位置大1。这样就刚好可以实现like模糊匹配。

实际上match_phrase查询会比简单的query查询更高，一个 match 查询仅仅是看词条是否存在于倒排索引中，而一个 match_phrase 查询是必须计算并比较多个可能重复词项的位置。Lucene nightly benchmarks 表明一个简单的 term 查询比一个短语查询大约快 10 倍，比邻近查询(有 slop 的短语查询)大约快 20 倍。当然，这个代价指的是在搜索时而不是索引时。

通常，match_phrase 的额外成本并不像这些数字所暗示的那么吓人。事实上，性能上的差距只是证明一个简单的 term 查询有多快。标准全文数据的短语查询通常在几毫秒内完成，因此实际上都是完全可用，即使是在一个繁忙的集群上。

在某些特定病理案例下，短语查询可能成本太高了，但比较少见。一个典型例子就是DNA序列，在序列里很多同样的词项在很多位置重复出现。在这里使用高 slop 值会到导致位置计算大量增加。

下面我们来看看两种方式的查询效率：

我们用es_test_index_2 索引，里面 search_text是按照方式二定义的，search_word是按照方式一定义的，对两个字段导入相同的数据。

对该索引导入了25302条数据，11.3mb

方式一：*通配符

{
    "profile":true,
    "from":0,
    "size":100,
    "query":{
        "query_string":{
            "query":"search_word:(NOT *新品* AND *经典* OR *秒杀* NOT *预付*)",
            "fields": [],
            "type": "best_fields",
            "default_operator": "and",
            "max_determinized_states": 10000,
            "enable_position_increments": true,
            "fuzziness": "AUTO",
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "phrase_slop": 0,
            "escape": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_transpositions": true,
            "boost": 1
        }
    }
}

方式二：match_phrase方式

{
    "from": 0,
    "size": 100,
    "query": {
        "query_string": {
            "query": "search_text:(NOT \"新品\" AND \"经典\" OR \"秒杀\" NOT \"预付\")",
            "fields": [],
            "type": "best_fields",
            "default_operator": "and",
            "max_determinized_states": 10000,
            "enable_position_increments": true,
            "fuzziness": "AUTO",
            "fuzzy_prefix_length": 0,
            "fuzzy_max_expansions": 50,
            "phrase_slop": 0,
            "escape": false,
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_transpositions": true,
            "boost": 1
        }
    }
}

查询结果：

方式一：

方式二：

从上面可以看出时间差别还是很大的，当需要查询的关键词很多的时候，优化效果会更好。大家可以自行去验证。

好啦，关键词like查询解决啦。

补充点：

一、

上述我们用的match_phrase查询属于精确匹配，即必须相邻才能被查出来。如果我们想要查询 “中国经济”，能让包含“中国当代经济”的文档也能查得出来，我们可以用match_phrase查询的参数 slop（默认为0）来实现：—slop不为0的match_phrase查询称为邻近查询

{
    "from":0,
    "size":300,
    "query":{
        "match_phrase" : {
            "search_text" :
            {
                "query":"中国经济",
                "slop":2
            }
        }
    }
}

slop 参数告诉 match_phrase 查询词条相隔多远时仍然能将文档视为匹配。相隔多远的意思是为了让查询和文档匹配你需要移动词条多少次？将slop设置成2 那么包含“中国当代经济”的文档也能被查询出来。

在query_string query中可以这样写：

{
    "from": 0,
    "size": 100,
    "query": {
        "query_string": {
            "query": "search_text:(\"中国经济\"~2)",
            "default_operator": "and"
        }

当然你也可以运用query_string查询的参数 phrase_slop 来设置默认的slop的长度。详情参考https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

二、

在使用短语查询的时候，会有一些意外的情况出现，比如：

PUT /my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith"]
}

或者

PUT /my_index/groups/1
{
"names": "John Abraham, Lincoln Smith"
}

然后我们在运行一个Abraham Lincoln 短语查询的时候

GET /my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln"
        }
    }
}

我们会发现文档会匹配到上述文档，实际上，我们不希望这样的匹配出现，字段names 不管是text数组形式，还是text形式，经过分词之后，都是John Abraham Lincoln Smith ，而Abraham Lincoln 属于相邻的，所以短语查询能够匹配到。

在这样的情况下，我们可以这样解决，将这个字段存为数组

DELETE /my_index/groups/

PUT /my_index/_mapping/groups
{
    "properties": {
        "names": {
            "type":                "string",
            "position_increment_gap": 100
        }
    }
}

position_increment_gap 设置告诉 Elasticsearch 应该为数组中每个新元素增加当前词条 position 的指定值。所以现在当我们再索引 names 数组时，会产生如下的结果：

* Position 1: john

* Position 2: abraham

* Position 103: lincoln

* Position 104: smith

现在我们的短语查询可能无法匹配该文档因为 abraham 和 lincoln 之间的距离为 100 。为了匹配这个文档你必须添加值为 100 的 slop 。position_increment_gap默认是100.

另外，我们也可以在自定义分析器的时候设置该参数。

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ],
         “position_increment_gap":101
        }
      }
    }
  }
}