您现在的位置是:首页 > 文章详情

Elasticsearch 如何实现 SQL 语句中 Group By 和 Limit 的功能

日期:2020-07-27点击:551

作者介绍

魏彬,普翔科技 CTO,开源软件爱好者,中国第一位 Elastic 认证工程师,《Elastic日报》和 《ElasticTalk》社区项目发起人,被 elastic 中国公司授予 2019 年度合作伙伴架构师特别贡献奖。对 Elasticsearch、Kibana、Beats、Logstash、Grafana 等开源软件有丰富的实践经验,为零售、金融、保险、证券、科技等众多行业的客户提供过咨询和培训服务,帮助客户在实际业务中找准开源软件的定位,实现从 0 到 1 的落地、从 1 到 N 的拓展,产生实际的业务价值。

有 SQL 背景的同学在学习 Elasticsearch 时,面对一个查询需求,不由自主地会先思考如何用 SQL 来实现,然后再去想 Elasticsearch 的 Query DSL 如何实现。那么本篇就给大家讲一条常见的 SQL 语句如何用 Elasticsearch 的查询语言实现。

一、SQL语句

假设我们有一个汽车的数据集,每个汽车都有车型、颜色等字段,我希望获取颜色种类大于1个的前2车型。假设汽车的数据模型如下:

{ "model":"modelA", "color":"red" }

假设我们有一个 cars 表,通过如下语句创建测试数据。

INSERT INTO cars (model,color) VALUES ('A','red'); INSERT INTO cars (model,color) VALUES ('A','white'); INSERT INTO cars (model,color) VALUES ('A','black'); INSERT INTO cars (model,color) VALUES ('A','yellow'); INSERT INTO cars (model,color) VALUES ('B','red'); INSERT INTO cars (model,color) VALUES ('B','white'); INSERT INTO cars (model,color) VALUES ('C','black'); INSERT INTO cars (model,color) VALUES ('C','red'); INSERT INTO cars (model,color) VALUES ('C','white'); INSERT INTO cars (model,color) VALUES ('C','yellow'); INSERT INTO cars (model,color) VALUES ('C','blue'); INSERT INTO cars (model,color) VALUES ('D','red'); INSERT INTO cars (model,color) VALUES ('A','red');

那么实现我们需求的 SQL 语句也比较简单,实现如下:

SELECT model,COUNT(DISTINCT color) color_count FROM cars GROUP BY model HAVING color_count > 1 ORDER BY color_count desc LIMIT 2;

这条查询语句中 Group By 是按照 model 做分组, Having color_count>1 限定了车型颜色种类大于1,ORDER BY color_count desc 限定结果按照颜色种类倒序排列,而 LIMIT 2 限定只返回前3条数据。
那么在 Elasticsearch 中如何实现这个需求呢?

二、在 Elasticsearch 模拟测试数据

首先我们需要先在 elasticsearch 中插入测试的数据,这里我们使用 bulk 接口 ,如下所示:

POST _bulk {"index":{"_index":"cars","_type":"doc","_id":"1"}} {"model":"A","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"2"}} {"model":"A","color":"white"} {"index":{"_index":"cars","_type":"doc","_id":"3"}} {"model":"A","color":"black"} {"index":{"_index":"cars","_type":"doc","_id":"4"}} {"model":"A","color":"yellow"} {"index":{"_index":"cars","_type":"doc","_id":"5"}} {"model":"B","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"6"}} {"model":"B","color":"white"} {"index":{"_index":"cars","_type":"doc","_id":"7"}} {"model":"C","color":"black"} {"index":{"_index":"cars","_type":"doc","_id":"8"}} {"model":"C","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"9"}} {"model":"C","color":"white"} {"index":{"_index":"cars","_type":"doc","_id":"10"}} {"model":"C","color":"yellow"} {"index":{"_index":"cars","_type":"doc","_id":"11"}} {"model":"C","color":"blue"} {"index":{"_index":"cars","_type":"doc","_id":"12"}} {"model":"D","color":"red"} {"index":{"_index":"cars","_type":"doc","_id":"13"}} {"model":"A","color":"red"}

其中 index 为 cars,type 为 doc,所有数据与mysql 数据保持一致。大家可以在 Kibana 的 Dev Tools 中执行上面的命令,然后执行下面的查询语句验证数据是否已经成功存入。

GET cars/_search

三、Group By VS Terms/Metric Aggregation

SQL 中 Group By 语句在 Elasticsearch 中对应的是 Terms Aggregation,即分桶聚合,对应 Group By color 的语句如下所示:

GET cars/_search { "size":0, "aggs":{ "models":{ "terms":{ "field":"model.keyword" } } } }

结果如下:

{ "took": 161, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "A", "doc_count": 5 }, { "key": "C", "doc_count": 5 }, { "key": "B", "doc_count": 2 }, { "key": "D", "doc_count": 1 } ] } } }

我们看 aggregations 这个 key 下面的即为返回结果。
SQL 语句中还有一项是 COUNT(DISTINCT color) color_count 用于计算每个 model 的颜色数,在 Elasticsearch 中我们需要使用一个指标类聚合 Cardinality ,进行不同值计数。语句如下:

GET cars/_search { "size": 0, "aggs": { "models": { "terms": { "field": "model.keyword" }, "aggs": { "color_count": { "cardinality": { "field": "color.keyword" } } } } } }

其返回结果如下:

{ "took": 74, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "A", "doc_count": 5, "color_count": { "value": 4 } }, { "key": "C", "doc_count": 5, "color_count": { "value": 5 } }, { "key": "B", "doc_count": 2, "color_count": { "value": 2 } }, { "key": "D", "doc_count": 1, "color_count": { "value": 1 } } ] } } }

结果中 color_count 即为每个 model 的颜色数,但这里所有的模型都返回了,我们只想要颜色数大于1的模型,因此这里还要加一个过滤条件。

四、Having Condition VS Bucket Filter Aggregation

Having color_count > 1 在 Elasticsearch 中对应的是 Bucket Filter 聚合,语句如下所示:

GET cars/_search { "size": 0, "aggs": { "models": { "terms": { "field": "model.keyword" }, "aggs": { "color_count": { "cardinality": { "field": "color.keyword" } }, "color_count_filter": { "bucket_selector": { "buckets_path": { "colorCount": "color_count" }, "script": "params.colorCount>1" } } } } } }

返回结果如下:

{ "took": 39, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "A", "doc_count": 5, "color_count": { "value": 4 } }, { "key": "C", "doc_count": 5, "color_count": { "value": 5 } }, { "key": "B", "doc_count": 2, "color_count": { "value": 2 } } ] } } }

此时返回结果只包含颜色数大于1的模型,但大家会发现颜色数多的 C 不是在第一个位置,我们还需要做排序处理。

五、Order By Limit VS Bucket Sort Aggregation

ORDER BY color_count desc LIMIT 3 在 Elasticsearch 中可以使用 Bucket Sort 聚合实现,语句如下所示:

GET cars/_search { "size": 0, "aggs": { "models": { "terms": { "field": "model.keyword" }, "aggs": { "color_count": { "cardinality": { "field": "color.keyword" } }, "color_count_filter": { "bucket_selector": { "buckets_path": { "colorCount": "color_count" }, "script": "params.colorCount>1" } }, "color_count_sort": { "bucket_sort": { "sort": { "color_count": "desc" }, "size": 2 } } } } } }

返回结果如下:

{ "took": 32, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 13, "max_score": 0, "hits": [] }, "aggregations": { "models": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "C", "doc_count": 5, "color_count": { "value": 5 } }, { "key": "A", "doc_count": 5, "color_count": { "value": 4 } } ] } } }

至此我们便将 SQL 语句实现的功能用 Elasticsearch 查询语句实现了。对比 SQL 语句与 Elasticsearch 的查询语句,大家会发现后者复杂了很多,但并非无章可循,随着大家对常见语法越来越熟悉,相信一定会越写越得心应手!

声明:本文由原文《Elasticsearch如何实现 SQL语句中 Group By 和 Limit 的功能》作者“魏彬”授权转载,对未经许可擅自使用者,保留追究其法律责任的权利。


image.png

阿里云Elastic Stack】100%兼容开源ES,独有9大能力,提供免费X-pack服务(单节点价值$6000)

相关活动


更多折扣活动,请访问阿里云 Elasticsearch 官网

阿里云 Elasticsearch 商业通用版,1核2G ,SSD 20G首月免费
阿里云 Logstash 2核4G首月免费


image.png

image.png

原文链接:https://yq.aliyun.com/articles/768999
关注公众号

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。

持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

转载内容版权归作者及来源网站所有,本站原创内容转载请注明来源。

文章评论

共有0条评论来说两句吧...

文章二维码

扫描即可查看该文章

点击排行

推荐阅读

最新文章