Java中Term Vector的概念和使用方法是什么

发布时间：2021-12-21 10:43:56 阅读：227 作者：iii 栏目：开发技术

Java开发者专用服务器限时活动，0元免费领，库存有限，领完即止！点击查看>>

本篇内容主要讲解“Java中Term Vector的概念和使用方法是什么”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“Java中Term Vector的概念和使用方法是什么”吧!

term vector是什么？

每次有document数据插入时，elasticsearch除了对document进行正排、倒排索引的存储之外，如果此索引的field设置了term_vector参数，elasticsearch还会对这个的分词信息进行计算、统计，比如这个document有多少个field，每个field的值分词处理后得到的term的df值，ttf值是多少，每个term存储的位置偏移量等信息，这些统计信息统称为term vector。term vector的值有5个

no：不存储term vector信息，默认值
yes：只存储field terms信息，不包含position和offset信息
with_positions：存储term信息和position信息
with_offsets：存储term信息和offset信息
with_positions_offsets：存储完整的term vector信息，包括field terms、position、offset信息。

term vector的信息生成有两种方式：index-time和query-time。index-time即建立索引时生成term vector信息，query-time是在查询过程中实时生成term vector信息，前者以空间换时间，后者以时间换空间。

Java中Term Vector的概念和使用方法是什么

term vector有什么作用？

term vector本质上是一个数据探查的工具（可以看成是一个debugger工具），上面记录着一个document内的field分词后的term的详细情况，如拆分成几个term，每个term在正排索引的哪个位置，各自的df值、ttf值分别是多少等等。一般用于数据疑似问题的排查，比如说排序和搜索与预期的结果不一致，需要了解根本原因，可以拿这个工具手动进行数据分析，帮助判断问题的根源。

读懂term vector信息

我们来看看一个完整的term vector报文，都有哪些信息，带#号的一行代码是添加的注释，如下示例：

{  "_index": "music",  "_type": "children",  "_id": "1",  "_version": 1,  "found": true,  "took": 0,  "term_vectors": {    "text": {      "field_statistics": {        "sum_doc_freq": 3,        "doc_count": 1,        "sum_ttf": 3      },      "terms": {        "elasticsearch": {          "doc_freq": 1,          "ttf": 1,          "term_freq": 1,          "tokens": [            {              "position": 2,              "start_offset": 11,              "end_offset": 24            }          ]        },        "hello": {          "doc_freq": 1,          "ttf": 1,          "term_freq": 1,          "tokens": [            {              "position": 0,              "start_offset": 0,              "end_offset": 5            }          ]        },        "java": {          "doc_freq": 1,          "ttf": 1,          "term_freq": 1,          "tokens": [            {              "position": 1,              "start_offset": 6,              "end_offset": 10            }          ]        }      }    }  }}

一段完整的term vector信息，term vector是按field为维度来统计的，主要包含三个部分：

field statistics
term statistics
term information

field statistics

指该索引和type下所有的document，对这个field所有term的统计信息，注意document的范围，不是某一条，是指定index/type下的所有document。

sum_doc_freq(sum of document frequency)：这个field中所有的term的df之和。
doc_count(document count)：有多少document包含这个field，有些document可能没有这个field。
sum_ttf(sum of total term frequency)：这个field中所有的term的tf之和。

term statistics

hello为当前document中，text field字段分词后的term，查询时设置term_statistics=true时生效。

doc_freq(document frequency)：有多少document包含这个term。
ttf(total term frequency)：这个term在所有document中出现的频率。
term_freq(term frequency in the field)：这个term在当前document中出现的频率。

term information

示例中tokens里面的内容，tokens里面是个数组

position：这个term在field里的正排索引位置，如果有多个相同的term，tokens下面会有多条记录。
start_offset：这个term在field里的偏移，表示起始位置偏移量。
end_offset：这个term在field里的偏移量，表示结束位置偏移量。

term vector使用案例

建立索引music，type命名为children，指定text字段为index-time，fullname字段为query-time

PUT /music{  "mappings": {    "children": {      "properties": {        "content": {            "type": "text",            "term_vector": "with_positions_offsets",            "store" : true,            "analyzer" : "standard"         },         "fullname": {            "type": "text",            "analyzer" : "standard"        }      }    }  }}

添加3条示例数据

PUT /music/children/1{  "fullname" : "Jean Ritchie",  "content" : "Love Somebody"}PUT /music/children/2{  "fullname" : "John Smith",  "content" : "wake me, shark me ..."}PUT /music/children/3{  "fullname" : "Peter Raffi",  "content" : "brush your teeth"}

对document id为1这条数据进行term vector探查

GET /music/children/1/_termvectors{  "fields" : ["content"],  "offsets" : true,  "positions" : true,  "term_statistics" : true,  "field_statistics" : true}

得到的结果即为上文的term vector示例。另外可以提一下，用这3个document的id进行查询，field_statistics部分是一样的。

term vector常见用法

除了上一节的标准查询用法，还有一些参数可以丰富term vector的查询。

doc参数

GET /music/children/_termvectors{  "doc" : {    "fullname" : "Peter Raffi",    "content" : "brush your teeth"  },  "fields" : ["content"],  "offsets" : true,  "positions" : true,  "term_statistics" : true,  "field_statistics" : true}

这个语法的含义是针对指定的doc进行term vector分析，doc里的内容可以随意指定，特别实用。

per_field_analyzer参数
可以指定字段的分词器进行探查

GET /music/children/_termvectors{  "doc" : {    "fullname" : "Jimmie Davis",    "content" : "you are my sunshine"  },  "fields" : ["content"],  "offsets" : true,  "positions" : true,  "term_statistics" : true,  "field_statistics" : true,  "per_field_analyzer" : {    "text": "standard"  }}

filter参数
对term vector统计结果进行过滤

GET /music/children/_termvectors{  "doc" : {    "fullname" : "Jimmie Davis",    "content" : "you are my sunshine"  },  "fields" : ["content"],  "offsets" : true,  "positions" : true,  "term_statistics" : true,  "field_statistics" : true,  "filter" : {      "max_num_terms" : 3,      "min_term_freq" : 1,      "min_doc_freq" : 1    }}

根据term统计信息，过滤出你想要看到的term vector统计结果。也挺有用的，比如你探查数据可以过滤掉一些出现频率过低的term。

docs参数
允许你同时对多个doc进行探查，这个使用频率看个人习惯。

GET _mtermvectors{   "docs": [      {         "_index": "music",         "_type": "children",         "_id": "2",         "term_statistics": true      },      {         "_index": "music",         "_type": "children",         "_id": "1",         "fields": [            "content"         ]      }   ]}

term vector使用建议

有两种方式可以得到term vector信息，一种是像上面案例，建立时指定，另一种是直接查询时生成

index-time，在mapping里配置，建立索引的时候，就直接给你生成这些term和field的统计信息，如果term_vector设置为with_positions_offsets，索引所占的空间是不设置term vector时的2倍。
query-time，你之前没有生成过任何的Term vector信息，然后在查看term vector的时候，直接就可以看到了，会on the fly，现场计算出各种统计信息，然后返回给你。

这两种方式采用哪种取决于对term vector的使用期望，query-time更常用一些，毕竟这个工具的用处是协助定位问题，实时计算就行。

到此，相信大家对“Java中Term Vector的概念和使用方法是什么”有了更深的了解，不妨来实际操作一番吧！这里是亿速云网站，更多相关内容可以进入相关频道进行查询，关注我们，继续学习！

亿速云「云服务器」，即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘，价格低至29元/月。点击查看>>

向AI问一下细节

Java中Term Vector的概念和使用方法是什么

field statistics

term statistics

term information

term vector常见用法

猜你喜欢

最新资讯

相关推荐

开发者交流群：

相关标签