本篇文章为大家展示了如何用ElasticSearch实现基于标签的兴趣推荐,内容简明扼要并且容易理解,绝对能使你眼前一亮,通过这篇文章的详细介绍希望你能有所收获。
下面将通过ElasticSearch(简称ES)倒排索引的特性实现基于标签的兴趣推荐
操作系统:ubuntu 20.04
Docker version 19.03.8
ElasticSearch 7.X
Curl工具,推荐Insomnia
ES GUI工具, 推荐appbaseio/dejavu
文章索引中有字段tags,存储了文章有关的标签
每个用户都有自己的兴趣标签tags
兴趣推荐就是用兴趣标签去匹配文章的标签,用户的一个兴趣标签命中N篇文章,用户的多个兴趣标签命中M篇文章,M和N有交叉,即文章中有重复,重复出现次数最多的文章就是最贴近用户兴趣的。原理理解起来简单,使用ES的目的是解决快速查询和排序的问题。
docker环境安装单机版ES,用来测试
docker run -d --name elasticsearch -v /home/cherokee/docker-data/es-data:/usr/share/elasticsearch/data -e http.cors.enabled=true -e http.cors.allow-origin="*" -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" successage/es-ik
在本地启动了ES服务,通过 http://localhost:9200 可以访问
创建一个名为rcmd的索引
curl --request PUT \ --url http://localhost:9200/rcmd
curl --request PUT \
--url http://localhost:9200/rcmd/_mapping \
--header 'content-type: application/json' \
--data '{
"properties": {
"tags": {
"type": "keyword",
"store": true
},
"update_time": {
"type": "date",
"store": true
}
}
}'
两个字段:
tags,文章的兴趣标签,keyword类型就是不需要全文检索,标签以数组的形式存放
update_time,更新时间,这是给兴趣推荐加一个额外的排序条件,实际项目中往往是需要结合时间和匹配度来排序的
插入一些数据
curl --request POST \
--url http://localhost:9200/rcmd/_doc \
--header 'content-type: application/json' \
--data '{
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
],
"update_time": "2020-06-01T00:02:11.030"
}'
再插入一条,同样标签,但是时间不一样,后面例子中有妙用
curl --request POST \
--url http://localhost:9200/rcmd/_doc \
--header 'content-type: application/json' \
--data '{
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
],
"update_time": "2020-07-01T00:02:11.030"
}'
curl --request POST \
--url http://localhost:9200/rcmd/_doc \
--header 'content-type: application/json' \
--data '{
"tags": [
"啤酒",
"米酒",
"饮料",
"餐饮",
"生活"
],
"update_time": "2020-06-02T00:02:11.030"
}'
curl --request POST \
--url http://localhost:9200/rcmd/_doc \
--header 'content-type: application/json' \
--data '{
"tags": [
"火锅",
"自助餐",
"外卖",
"烧烤",
"餐饮"
],
"update_time": "2020-06-03T00:02:11.030"
}'
curl --request POST \
--url http://localhost:9200/rcmd/_doc \
--header 'content-type: application/json' \
--data '{
"tags": [
"太阳",
"月亮",
"大海",
"星星",
"自然"
],
"update_time": "2020-06-01T00:02:11.030"
}'
curl --request POST \
--url http://localhost:9200/rcmd/_doc \
--header 'content-type: application/json' \
--data '{
"tags": [
"人类",
"动物",
"植物",
"地球",
"自然"
],
"update_time": "2020-06-01T00:02:11.030"
}'
curl --request POST \
--url http://localhost:9200/rcmd/_doc \
--header 'content-type: application/json' \
--data '{
"tags": [
"男人",
"女人",
"小孩",
"老人",
"人类"
],
"update_time": "2020-06-02T00:02:11.030"
}'
最终数据如下
curl --request POST \
--url http://localhost:9200/rcmd/_search \
--header 'content-type: application/json' \
--data '{
"query": {
"bool": {
"should": [
{
"constant_score": {
"boost": 1,
"filter": {
"match": {
"tags": "生活"
}
}
}
},
{
"constant_score": {
"boost": 1,
"filter": {
"match": {
"tags": "衣服"
}
}
}
},
{
"constant_score": {
"boost": 1,
"filter": {
"match": {
"tags": "火锅"
}
}
}
}
]
}
}
}'
should表达式的意义是匹配“生活”、“衣服”、“火锅”三个标签中任何一个的文章都可以返回。用constant_score查询,如果某个文章涵盖标签越多分值就越高。也就是说如果某个文章标签完全涵盖了这三个标签,那么它的分值最高的。查询结果如下:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 2.0,
"hits": [
{
"_index": "rcmd",
"_type": "_doc",
"_id": "brQO63MBTdXKc2eArv9A",
"_score": 2.0,
"_source": {
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
],
"update_time": "2020-06-01T00:02:11.030"
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "b7QP63MBTdXKc2eAPf_Y",
"_score": 2.0,
"_source": {
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
],
"update_time": "2020-07-01T00:02:11.030"
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "cLQQ63MBTdXKc2eA6_8v",
"_score": 1.0,
"_source": {
"tags": [
"啤酒",
"米酒",
"饮料",
"餐饮",
"生活"
],
"update_time": "2020-06-02T00:02:11.030"
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "cbQS63MBTdXKc2eAcP-N",
"_score": 1.0,
"_source": {
"tags": [
"火锅",
"自助餐",
"外卖",
"烧烤",
"餐饮"
],
"update_time": "2020-06-03T00:02:11.030"
}
}
]
}
}
有两篇文章涵盖了其中两个标签“生活”和“衣服”,得分为2,排到了前面。这个排序基本满足了兴趣匹配的要求。
实际的项目中往往是用户的兴趣标签的权值不一样,假设用户的兴趣标签是["火锅","生活","衣服"],排在越前面的权重越高,查询的时候需要给关键词设定权重,上面的查询语句所有boost都是默认值1,现在根据需求改动权值再查询。
curl --request POST \
--url http://localhost:9200/rcmd/_search \
--header 'content-type: application/json' \
--data '{
"query": {
"bool": {
"should": [
{
"constant_score": {
"boost": 1,
"filter": {
"match": {
"tags": "生活"
}
}
}
},
{
"constant_score": {
"boost": 4,
"filter": {
"match": {
"tags": "衣服"
}
}
}
},
{
"constant_score": {
"boost": 6,
"filter": {
"match": {
"tags": "火锅"
}
}
}
}
]
}
}
}'
分别给三个词加上权重6、4、1,查询结果如下:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 6.0,
"hits": [
{
"_index": "rcmd",
"_type": "_doc",
"_id": "cbQS63MBTdXKc2eAcP-N",
"_score": 6.0,
"_source": {
"tags": [
"火锅",
"自助餐",
"外卖",
"烧烤",
"餐饮"
],
"update_time": "2020-06-03T00:02:11.030"
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "brQO63MBTdXKc2eArv9A",
"_score": 5.0,
"_source": {
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
],
"update_time": "2020-06-01T00:02:11.030"
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "b7QP63MBTdXKc2eAPf_Y",
"_score": 5.0,
"_source": {
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
],
"update_time": "2020-07-01T00:02:11.030"
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "cLQQ63MBTdXKc2eA6_8v",
"_score": 1.0,
"_source": {
"tags": [
"啤酒",
"米酒",
"饮料",
"餐饮",
"生活"
],
"update_time": "2020-06-02T00:02:11.030"
}
}
]
}
}
可以看到包含“火锅”的文章排到了第一,包含“衣服”和“生活”的文章虽然两个词都命中,但是在权值的弱化之下排到了第二第三位。
curl --request POST \
--url http://localhost:9200/rcmd/_search \
--header 'content-type: application/json' \
--data '{
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"range": {
"update_time": {
"from": "2020-06-01",
"to": "2020-08-01"
}
}
},
{
"bool": {
"should": [
{
"term": {
"tags": {
"term": "火锅",
"boost": 2
}
}
},
{
"term": {
"tags": {
"term": "衣服",
"boost": 1
}
}
},
{
"term": {
"tags": {
"term": "生活",
"boost": 1
}
}
}
]
}
}
]
}
},
"functions": [
{
"gauss": {
"update_time": {
"scale": "3d",
"origin": "2020-07-02T00:01:00.000"
}
}
}
]
}
},
"_source": {
"include": [
"tags",
"update_time"
]
},
"from": 0,
"size": 10
}'
以上是相对完整的一个查询,首先对update_time发布时间做了限制,只选择一定范围内的数据,随后是标签的匹配,多个标签匹配条件之间是"OR"的关系,标签具有不同的权重,接下来用衰减函数gauss对update_time做衰减排序,衰减函数的意义是越近越好,scale": "3d"就是以3天为一个阶梯先对数据进行排序,相同阶梯内的数据再按照标签匹配度排序。 注:gauss中的origin可以不指定 最终的查询结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 3.6649413,
"hits": [
{
"_index": "rcmd",
"_type": "_doc",
"_id": "b7QP63MBTdXKc2eAPf_Y",
"_score": 3.6649413,
"_source": {
"update_time": "2020-07-01T00:02:11.030",
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
]
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "cbQS63MBTdXKc2eAcP-N",
"_score": 4.4511746E-28,
"_source": {
"update_time": "2020-06-03T00:02:11.030",
"tags": [
"火锅",
"自助餐",
"外卖",
"烧烤",
"餐饮"
]
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "cLQQ63MBTdXKc2eA6_8v",
"_score": 1.764942E-30,
"_source": {
"update_time": "2020-06-02T00:02:11.030",
"tags": [
"啤酒",
"米酒",
"饮料",
"餐饮",
"生活"
]
}
},
{
"_index": "rcmd",
"_type": "_doc",
"_id": "brQO63MBTdXKc2eArv9A",
"_score": 2.8566082E-32,
"_source": {
"update_time": "2020-06-01T00:02:11.030",
"tags": [
"布料",
"抹布",
"裤子",
"衣服",
"生活"
]
}
}
]
}
}
同样是匹配了“衣服”和“生活”的两篇文章,一篇在最前面,一篇在最后面,是因为update_time的缘故,一篇是7月1日发布的,另一篇在6月1日,不在同一时间阶梯内,日期久远的排到了后面。中间的两篇,各自匹配了一个标签,分别是“烧烤”和“生活”,两篇文章时间阶梯没有明显的区别,然而匹配“火锅”的排到了前面,是因为“火锅”的关键词加了较高的权重。 至此,我们实现了按照标签匹配文章,并且结合了时间因素和匹配度评分的兴趣推荐。
以上例子没有在超大数据环境下测试过,还没有具体的性能指标。
上述内容就是如何用ElasticSearch实现基于标签的兴趣推荐,你们学到知识或技能了吗?如果还想学到更多技能或者丰富自己的知识储备,欢迎关注亿速云行业资讯频道。
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。
原文链接:https://my.oschina.net/waterbear/blog/4493422