Elasticsearch常用操作：映射篇

发布时间：2020-05-25 05:17:01 来源：网络阅读：3511 作者：xpleaf 栏目：大数据

[TOC]

其实就是es的字段类型是由es来做自动检测还是由我们自己来指定，因此会分为动态映射和静态映射。

1 动态映射

1.1 映射规则

JSON格式的数据	自动推测的字段类型
null	没有字段被添加
true or false	boolean类型
浮点类型数字	float类型
数字	long类型
JSON对象	object类型
数组	由数组中第一个非空值决定
string	有可能是date类型（开启日期检测）、double或long类型、text类型、keyword类型

1.2 日期检测

默认是开启的（es5.4），测试案例如下：

PUT myblog

GET myblog/_mapping

PUT myblog/article/1
{
  "id":1,
  "postdate":"2018-10-27"
}

GET myblog/_mapping
{
  "myblog": {
    "mappings": {
      "article": {
        "properties": {
          "id": {
            "type": "long"
          },
          "postdate": {
            "type": "date"
          }
        }
      }
    }
  }
}

关闭日期检测后，则不会检测为日期，如下：

PUT myblog
{
  "mappings": {
    "article": {
      "date_detection": false
    }
  }
}

GET myblog/_mapping

PUT myblog/article/1
{
  "id":1,
  "postdate":"2018-10-27"
}

GET myblog/_mapping
{
  "myblog": {
    "mappings": {
      "article": {
        "date_detection": false,
        "properties": {
          "id": {
            "type": "long"
          },
          "postdate": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

2 静态映射

2.1 基本案例

PUT myblog
{
  "mappings": {
    "article": {
      "properties": {
        "id":{"type": "long"},
        "title":{"type": "text"},
        "postdate":{"type": "date"}
      }
    }
  }
}

GET myblog/_mapping

PUT myblog/article/1
{
  "id":1,
  "title":"elasticsearch is wonderful!",
  "postdate":"2018-10-27"
}

GET myblog/_mapping
{
  "myblog": {
    "mappings": {
      "article": {
        "properties": {
          "id": {
            "type": "long"
          },
          "postdate": {
            "type": "date"
          },
          "title": {
            "type": "text"
          }
        }
      }
    }
  }
}

2.2 dynamic属性

默认情况下，当添加一份文档时，如果出现新的字段，es也会添加进去，不过这个是可以进行控制的，通过dynamic来进行设置：

dynamic值	说明
true	默认值为true，自动添加字段
false	忽略新的字段
strict	严格模式，发现新的字段抛出异常

PUT myblog
{
  "mappings": {
    "article": {
      "dynamic":"strict",
      "properties": {
        "id":{"type": "long"},
        "title":{"type": "text"},
        "postdate":{"type": "date"}
      }
    }
  }
}

GET myblog/_mapping

PUT myblog/article/1
{
  "id":1,
  "title":"elasticsearch is wonderful!",
  "content":"a long text",
  "postdate":"2018-10-27"
}

{
  "error": {
    "root_cause": [
      {
        "type": "strict_dynamic_mapping_exception",
        "reason": "mapping set to strict, dynamic introduction of [content] within [article] is not allowed"
      }
    ],
    "type": "strict_dynamic_mapping_exception",
    "reason": "mapping set to strict, dynamic introduction of [content] within [article] is not allowed"
  },
  "status": 400
}

3 字段类型

3.1 普通字段类型

一级分类	二级分类	具体类型
核心类型	字符串类型	string、text、keyword
	数字类型	long、intger、short、byte、double、float、half_float、scaled_float
	日期类型	date
	布尔类型	boolean
	二进制类型	binary
	范围类型	range
复合类型	数组类型	array
	对象类型	object
	嵌套类型	nested
地理类型	地理坐标	geo_point
	地理图形	geo_shape
特殊类型	IP类型	ip
	范围类型	completion
	令牌计数类型	token_count
	附件类型	attachment
	抽取类型	percolator

下面只会列出一些在个人工作中常用的，详细的可以参考官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html。

3.1.1 string

ex 5.x之后不支持，但仍可以添加，由text或keyword替代。

3.1.2 text

用于做全文搜索的字段，其字段内容会被分词器分析，在生成倒排索引前，字符串会被分词器分成一个个的词项。

实际应用中，text多用在长文本的字段中，如article的content，显然，这样的字段用于排序和聚合都是没有太大意义的。

3.1.3 keyword

只能通过精确值搜索到，区别于text类型。

其索引的词项都是字段内容本身，因此在实际应用中，会用来比较、排序、聚合等操作。

3.1.4 数字类型

具体注意的细节问题可以考虑官方文档，一般的使用都能满足需求。

3.1.5 date

json中没有日期类型，所以默认情况es的时间的形式可以为：

1."yyyy-MM-dd"或"yyyy-MM-ddTHH:mm:ssZ"
- 也就是说"yyyy-MM-dd HH:mm:ss"需要写成："2018-10-22T23:12:22Z"的形式，其实就是加了时区；
2.表示毫秒的timestamp的长整型数
3.表示秒的timestamp的整型数

es内部存储的是毫秒计时的长整型数。

当然上面只是默认情况下的，在设置字段的类型时，我们也可以设置自己定义的时间格式：

PUT myblog
{
  "mappings": {
    "article": {
      "properties": {
        "postdate":{
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        }
      }
    }
  }
}

format也可以指定多个日期格式，使用"||"分隔开：

"format": "yyyy-MM-dd HH:mm:ss||yyyy/MM/dd HH:mm:ss"

之后就可以写入定义的时间格式的数据了：

PUT myblog/article/1
{
  "postdate":"2017-09-23 23:12:22"
}

在我的工作场景中，如果需要存入的为时间，很多时候会先把其处理为毫秒值的timestamp，然后再存入es中，取出显示时再处理为时间字符串。

3.1.6 boolean

设置字段类型为boolean后，可以填入的值为：true、false、"true"、"false"。

3.1.7 binary

binary类型接受base64编码的字符串。

3.1.8 array

es没有专用的数组类型，默认情况下任何字段都可以包含一个或者多个值，但是一个数组中的值必须是同一种类型。动态添加数据时，数组的第一个值的类型决定整个数组的类型（其实也就是这个字段的类型），混合数组是不支持的。数组可以包含null值，空数组[]会被当作missing field对待。另外在文档中使用array类型不需要提前做任何配置，默认支持。

比如添加下面一个数组的字段数据：

DELETE my_index

PUT my_index/my_type/1
{
  "lists":[
    {
      "name":"xpleaf",
      "job":"es"
    }
  ]
}

其实际上该字段的类型就会被动态映射为text：

GET my_index/my_type/_mapping

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "lists": {
            "properties": {
              "job": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

直接搜索也是支持的：

GET my_index/my_type/_search
{
  "query": {
    "term": {
      "lists.name": {
        "value": "xpleaf"
      }
    }
  }
}

返回结果：

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "lists": [
            {
              "name": "xpleaf",
              "job": "es"
            }
          ]
        }
      }
    ]
  }
}

3.1.9 object

可以直接将一个json对象写入es中，如下：

DELETE my_index

PUT my_index/my_type/1
{
  "object":{
    "name":"xpleaf",
    "job":"es"
  }
}

其实际上该字段的类型就会被动态映射为text：

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "object": {
            "properties": {
              "job": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

直接搜索也是可以的：

GET my_index/my_type/_search
{
  "query": {
    "term": {
      "object.name": {
        "value": "xpleaf"
      }
    }
  }
}

返回结果：

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "object": {
            "name": "xpleaf",
            "job": "es"
          }
        }
      }
    ]
  }
}

object对象，实际上在es内部会被扁平化处理，如上面的，在es中实际为：

{"object.name":"xpleaf", "object.job":"es"}

3.1.10 nested

nested类型是object类型中的一个特例，可以让对象数组独立索引和查询。Lucene没有内部对象的概念，所以es将对象层次扁平化，转化成字段名字和值构成的简单列表。

虽然是object类型中的一个特例，但是其字段的type是固定的，也就是nested，这是与object的最大不同。

那么为什么要使用nested类型呢，使用object不就可以了吗？这里贴一下官方提供的一个例子来进行说明（https://www.elastic.co/guide/en/elasticsearch/reference/5.6/nested.html）：

Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values. For instance, the following document:

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

would be transformed internally into a document that looks more like this:

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

The user.first and user.last fields are flattened into multi-value fields, and the association between alice and white is lost. This document would incorrectly match a query for alice AND smith:

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

上面是直接使用object而导致的问题，也就是说实际上进行上面的搜索时，该文档是不应该被匹配出来的，但是确匹配出来了。使用nested对象类型就可以保持数组中每个对象的独立性，nested类型将数组中每个对象作为独立隐藏文档来索引，这意味着每个嵌套对象都可以独立被搜索。

If you need to index arrays of objects and to maintain the independence of each object in the array, you should use the nested datatype instead of the object datatype. Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others, with the nested query:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}

索引一个包含100个nested字段的文档实际上就是索引101个文档，每个嵌套文档都作为一个独立文档来索引。为了防止过度定义嵌套字段的数量，每个索引可以定义的嵌套字段被限制在50个。

3.1.11 range

range类型及其取值范围如下：

类型	范围
integer_range	-2^31~2^31-1
float_range	32-bit IEEE 754
long_range	-2^63~2^63-1
double_range	64-bit IEEE 754
date_range	64位整数，毫秒计时

3.2 元字段

元字段就是描述文档本身的字段，其分类及说明如下：

元字段分类	具体属性	作用
文档属性的元字段	_index	文档所属索引
	_uid	包含`_type`和`_id`的复合字段(取值为`{type}#{id}`)
	_type	文档的类型
	_id	文档的id
源文档的元字段	_source	文档的原始JSON字符串
	_size	_source字段的大小
	_all	包含索引全部字段的超级字段
	_field_names	文档中包含非空值的所有字段
路由的元字段	_parent	指定文档间的父子关系
	_routing	将文档路由到特定分片的自定义路由值
自定义元字段	_meta	用于自定义元数据

各个字段的详细说明，可以参考：https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping-fields.html。

4 映射参数

参考：https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping-params.html。

向AI问一下细节