如何用ElasticSearch实现基于标签的兴趣推荐

发布时间：2021-12-16 17:59:21 阅读：426 作者：柒染栏目：大数据

开发者测试专用服务器限时活动，0元免费领，库存有限，领完即止！点击查看>>

本篇文章为大家展示了如何用ElasticSearch实现基于标签的兴趣推荐，内容简明扼要并且容易理解，绝对能使你眼前一亮，通过这篇文章的详细介绍希望你能有所收获。

前言

下面将通过ElasticSearch（简称ES）倒排索引的特性实现基于标签的兴趣推荐

前提

操作系统：ubuntu 20.04
Docker version 19.03.8
ElasticSearch 7.X

用到的工具

Curl工具，推荐Insomnia
ES GUI工具，推荐appbaseio/dejavu

安装ES

docker环境安装单机版ES，用来测试

docker run -d --name elasticsearch -v /home/cherokee/docker-data/es-data:/usr/share/elasticsearch/data -e http.cors.enabled=true -e http.cors.allow-origin="*" -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true  -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" successage/es-ik

在本地启动了ES服务，通过 http://localhost:9200 可以访问

创建索引

创建一个名为rcmd的索引

curl --request PUT \
  --url http://localhost:9200/rcmd

申明索引

curl --request PUT \
  --url http://localhost:9200/rcmd/_mapping \
  --header 'content-type: application/json' \
  --data '{
	"properties": {
		"tags": {
			"type": "keyword",
			"store": true
		},
		"update_time": {
			"type": "date",
			"store": true
		}
	}
}'

两个字段：

tags，文章的兴趣标签，keyword类型就是不需要全文检索，标签以数组的形式存放
update_time，更新时间，这是给兴趣推荐加一个额外的排序条件，实际项目中往往是需要结合时间和匹配度来排序的

模拟数据

插入一些数据

curl --request POST \
  --url http://localhost:9200/rcmd/_doc \
  --header 'content-type: application/json' \
  --data '{
	"tags": [
		"布料",
		"抹布",
		"裤子",
		"衣服",
		"生活"
	],
	"update_time": "2020-06-01T00:02:11.030"
}'

再插入一条，同样标签，但是时间不一样，后面例子中有妙用

curl --request POST \
  --url http://localhost:9200/rcmd/_doc \
  --header 'content-type: application/json' \
  --data '{
	"tags": [
		"布料",
		"抹布",
		"裤子",
		"衣服",
		"生活"
	],
	"update_time": "2020-07-01T00:02:11.030"
}'

curl --request POST \
  --url http://localhost:9200/rcmd/_doc \
  --header 'content-type: application/json' \
  --data '{
	"tags": [
		"啤酒",
		"米酒",
		"饮料",
		"餐饮",
		"生活"
	],
	"update_time": "2020-06-02T00:02:11.030"
}'

curl --request POST \
  --url http://localhost:9200/rcmd/_doc \
  --header 'content-type: application/json' \
  --data '{
	"tags": [
		"火锅",
		"自助餐",
		"外卖",
		"烧烤",
		"餐饮"
	],
	"update_time": "2020-06-03T00:02:11.030"
}'

curl --request POST \
  --url http://localhost:9200/rcmd/_doc \
  --header 'content-type: application/json' \
  --data '{
	"tags": [
		"太阳",
		"月亮",
		"大海",
		"星星",
		"自然"
	],
	"update_time": "2020-06-01T00:02:11.030"
}'

curl --request POST \
  --url http://localhost:9200/rcmd/_doc \
  --header 'content-type: application/json' \
  --data '{
	"tags": [
		"人类",
		"动物",
		"植物",
		"地球",
		"自然"
	],
	"update_time": "2020-06-01T00:02:11.030"
}'

curl --request POST \
  --url http://localhost:9200/rcmd/_doc \
  --header 'content-type: application/json' \
  --data '{
	"tags": [
		"男人",
		"女人",
		"小孩",
		"老人",
		"人类"
	],
	"update_time": "2020-06-02T00:02:11.030"
}'

最终数据如下如何用ElasticSearch实现基于标签的兴趣推荐

固定分数查询

curl --request POST \
  --url http://localhost:9200/rcmd/_search \
  --header 'content-type: application/json' \
  --data '{
	"query": {
		"bool": {
			"should": [
				{
					"constant_score": {
						"boost": 1,
						"filter": {
							"match": {
								"tags": "生活"
							}
						}
					}
				},
				{
					"constant_score": {
						"boost": 1,
						"filter": {
							"match": {
								"tags": "衣服"
							}
						}
					}
				},
				{
					"constant_score": {
						"boost": 1,
						"filter": {
							"match": {
								"tags": "火锅"
							}
						}
					}
				}
			]
		}
	}
}'

should表达式的意义是匹配“生活”、“衣服”、“火锅”三个标签中任何一个的文章都可以返回。用constant_score查询，如果某个文章涵盖标签越多分值就越高。也就是说如果某个文章标签完全涵盖了这三个标签，那么它的分值最高的。查询结果如下：

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 2.0,
    "hits": [
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "brQO63MBTdXKc2eArv9A",
        "_score": 2.0,
        "_source": {
          "tags": [
            "布料",
            "抹布",
            "裤子",
            "衣服",
            "生活"
          ],
          "update_time": "2020-06-01T00:02:11.030"
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "b7QP63MBTdXKc2eAPf_Y",
        "_score": 2.0,
        "_source": {
          "tags": [
            "布料",
            "抹布",
            "裤子",
            "衣服",
            "生活"
          ],
          "update_time": "2020-07-01T00:02:11.030"
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "cLQQ63MBTdXKc2eA6_8v",
        "_score": 1.0,
        "_source": {
          "tags": [
            "啤酒",
            "米酒",
            "饮料",
            "餐饮",
            "生活"
          ],
          "update_time": "2020-06-02T00:02:11.030"
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "cbQS63MBTdXKc2eAcP-N",
        "_score": 1.0,
        "_source": {
          "tags": [
            "火锅",
            "自助餐",
            "外卖",
            "烧烤",
            "餐饮"
          ],
          "update_time": "2020-06-03T00:02:11.030"
        }
      }
    ]
  }
}

有两篇文章涵盖了其中两个标签“生活”和“衣服”，得分为2，排到了前面。这个排序基本满足了兴趣匹配的要求。

兴趣标签权值

实际的项目中往往是用户的兴趣标签的权值不一样，假设用户的兴趣标签是["火锅","生活","衣服"]，排在越前面的权重越高，查询的时候需要给关键词设定权重，上面的查询语句所有boost都是默认值1，现在根据需求改动权值再查询。

curl --request POST \
  --url http://localhost:9200/rcmd/_search \
  --header 'content-type: application/json' \
  --data '{
	"query": {
		"bool": {
			"should": [
				{
					"constant_score": {
						"boost": 1,
						"filter": {
							"match": {
								"tags": "生活"
							}
						}
					}
				},
				{
					"constant_score": {
						"boost": 4,
						"filter": {
							"match": {
								"tags": "衣服"
							}
						}
					}
				},
				{
					"constant_score": {
						"boost": 6,
						"filter": {
							"match": {
								"tags": "火锅"
							}
						}
					}
				}
			]
		}
	}
}'

分别给三个词加上权重6、4、1，查询结果如下：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 6.0,
    "hits": [
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "cbQS63MBTdXKc2eAcP-N",
        "_score": 6.0,
        "_source": {
          "tags": [
            "火锅",
            "自助餐",
            "外卖",
            "烧烤",
            "餐饮"
          ],
          "update_time": "2020-06-03T00:02:11.030"
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "brQO63MBTdXKc2eArv9A",
        "_score": 5.0,
        "_source": {
          "tags": [
            "布料",
            "抹布",
            "裤子",
            "衣服",
            "生活"
          ],
          "update_time": "2020-06-01T00:02:11.030"
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "b7QP63MBTdXKc2eAPf_Y",
        "_score": 5.0,
        "_source": {
          "tags": [
            "布料",
            "抹布",
            "裤子",
            "衣服",
            "生活"
          ],
          "update_time": "2020-07-01T00:02:11.030"
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "cLQQ63MBTdXKc2eA6_8v",
        "_score": 1.0,
        "_source": {
          "tags": [
            "啤酒",
            "米酒",
            "饮料",
            "餐饮",
            "生活"
          ],
          "update_time": "2020-06-02T00:02:11.030"
        }
      }
    ]
  }
}

可以看到包含“火锅”的文章排到了第一，包含“衣服”和“生活”的文章虽然两个词都命中，但是在权值的弱化之下排到了第二第三位。

多条件排序

curl --request POST \
  --url http://localhost:9200/rcmd/_search \
  --header 'content-type: application/json' \
  --data '{
	"query": {
		"function_score": {
			"query": {
				"bool": {
					"must": [
						{
							"range": {
								"update_time": {
									"from": "2020-06-01",
									"to": "2020-08-01"
								}
							}
						},
						{
							"bool": {
								"should": [
									{
										"term": {
											"tags": {
												"term": "火锅",
												"boost": 2
											}
										}
									},
									{
										"term": {
											"tags": {
												"term": "衣服",
												"boost": 1
											}
										}
									},
									{
										"term": {
											"tags": {
												"term": "生活",
												"boost": 1
											}
										}
									}
								]
							}
						}
					]
				}
			},
			"functions": [
				{
					"gauss": {
						"update_time": {
							"scale": "3d",
							"origin": "2020-07-02T00:01:00.000"
						}
					}
				}
			]
		}
	},
	"_source": {
		"include": [
			"tags",
			"update_time"
		]
	},
	"from": 0,
	"size": 10
}'

以上是相对完整的一个查询，首先对update_time发布时间做了限制，只选择一定范围内的数据，随后是标签的匹配，多个标签匹配条件之间是"OR"的关系，标签具有不同的权重，接下来用衰减函数gauss对update_time做衰减排序，衰减函数的意义是越近越好，scale": "3d"就是以3天为一个阶梯先对数据进行排序，相同阶梯内的数据再按照标签匹配度排序。注：gauss中的origin可以不指定最终的查询结果：

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 3.6649413,
    "hits": [
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "b7QP63MBTdXKc2eAPf_Y",
        "_score": 3.6649413,
        "_source": {
          "update_time": "2020-07-01T00:02:11.030",
          "tags": [
            "布料",
            "抹布",
            "裤子",
            "衣服",
            "生活"
          ]
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "cbQS63MBTdXKc2eAcP-N",
        "_score": 4.4511746E-28,
        "_source": {
          "update_time": "2020-06-03T00:02:11.030",
          "tags": [
            "火锅",
            "自助餐",
            "外卖",
            "烧烤",
            "餐饮"
          ]
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "cLQQ63MBTdXKc2eA6_8v",
        "_score": 1.764942E-30,
        "_source": {
          "update_time": "2020-06-02T00:02:11.030",
          "tags": [
            "啤酒",
            "米酒",
            "饮料",
            "餐饮",
            "生活"
          ]
        }
      },
      {
        "_index": "rcmd",
        "_type": "_doc",
        "_id": "brQO63MBTdXKc2eArv9A",
        "_score": 2.8566082E-32,
        "_source": {
          "update_time": "2020-06-01T00:02:11.030",
          "tags": [
            "布料",
            "抹布",
            "裤子",
            "衣服",
            "生活"
          ]
        }
      }
    ]
  }
}

同样是匹配了“衣服”和“生活”的两篇文章，一篇在最前面，一篇在最后面，是因为update_time的缘故，一篇是7月1日发布的，另一篇在6月1日，不在同一时间阶梯内，日期久远的排到了后面。中间的两篇，各自匹配了一个标签，分别是“烧烤”和“生活”，两篇文章时间阶梯没有明显的区别，然而匹配“火锅”的排到了前面，是因为“火锅”的关键词加了较高的权重。至此，我们实现了按照标签匹配文章，并且结合了时间因素和匹配度评分的兴趣推荐。

以上例子没有在超大数据环境下测试过，还没有具体的性能指标。

上述内容就是如何用ElasticSearch实现基于标签的兴趣推荐，你们学到知识或技能了吗？如果还想学到更多技能或者丰富自己的知识储备，欢迎关注亿速云行业资讯频道。

亿速云「云服务器」，即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘，价格低至29元/月。点击查看>>

向AI问一下细节

如何用ElasticSearch实现基于标签的兴趣推荐

前言

前提

用到的工具

推荐原理

安装ES

创建索引

申明索引

模拟数据

固定分数查询

兴趣标签权值

多条件排序

猜你喜欢

如何用ElasticSearch实现基于标签的兴趣推荐

前言

前提

用到的工具

推荐原理

安装ES

创建索引

申明索引

模拟数据

固定分数查询

兴趣标签权值

多条件排序

猜你喜欢

最新资讯

相关推荐

开发者交流群：

相关标签