Hive视图和索引简单介绍

发布时间：2021-09-15 21:47:07 阅读：122 作者：chen 栏目：大数据

开发者测试专用服务器限时活动，0元免费领，库存有限，领完即止！点击查看>>

本篇内容主要讲解“Hive视图和索引简单介绍”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“Hive视图和索引简单介绍”吧!

一、Hive视图

1.1 简介

Hive 中的视图和 RDBMS 中视图的概念一致，都是一组数据的逻辑表示，本质上就是一条 SELECT 语句的结果集。视图是纯粹的逻辑对象，没有关联的存储 (Hive 3.0.0 引入的物化视图除外)，当查询引用视图时，Hive 可以将视图的定义与查询结合起来，例如将查询中的过滤器推送到视图中。

1.2 特点

不支持物化视图
只查询，不能做加载数据操作
视图的创建，只是保存一份元数据，查询视图才执行对应的子查询
view定义中若包含了ORDER BY/LIMIT语句，当查询视图时也进行了ORDER BY/LIMIT语句操作，view当中定义的优先级更高；
Hive视图支持迭代视图

1.3 创建视图

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name   -- 视图名称
  [(column_name [COMMENT column_comment], ...) ]    --列名
  [COMMENT view_comment]  --视图注释
  [TBLPROPERTIES (property_name = property_value, ...)]  --额外信息
  AS SELECT ...;

创建视图注意事项

CREATE VIEW创建具有给定名称的视图。如果已经存在具有相同名称的表或视图，则会引发错误。您可以使用IF NOT EXISTS跳过该错误。
删除基表并不会删除视图，需要手动删除视图；
视图是只读的，不能用作LOAD / INSERT / ALTER的目标
创建视图时，如果未提供列名，则将从 SELECT 语句中自动派生列名；
一个视图可能包含ORDER BY和LIMIT子句。如果参照查询还包含这些条款，查询级别子句进行评估后视图条款（和之后在查询的任何其它操作）。例如，如果视图指定LIMIT 5，并且引用查询执行为（从v LIMIT 10中选择*），那么最多将返回5行。

准备数据

-- 创建测试表create  table default.user(
   id string , -- 主键
   sex string, -- 性别
   name string -- 名称);-- 导入数据insert into default.user (id, sex, name)values ("1","男","张三"),("2","女","小花"),("3","男","赵柳"),("4","男","李嘿嘿");

创建一个测试视图

hive (default)> create view if not exists  default.user_view as select * from default.user;OK
id      sex     nameTime taken: 0.181 seconds

1.4 查询视图

-- 查询视图内容呢select * from default.user_view;-- 查询视图结构desc default.user_view;-- 查询视图详细信息desc formatted default.user_view;-- 查询视图 没有指定的方式跟查询所有表一样show tables;

1.5 删除视图

-- 模板DROP VIEW [IF EXISTS] [db_name.]view_name;-- 删除视图 DROP  VIEW IF EXISTS user_view;

1.6 修改视图属性

语法：

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties; table_properties:
  : (property_name = property_value, property_name = property_value, ...)

示例：

alter  view  default.user_view set tblproperties ('name'='DSJLG','GZH'='DSJLG')

通过 desc formatted default.user_view;详情信息

Hive视图和索引简单介绍

二、索引

2.1 简介

Hive 在 0.7.0 引入了索引的功能，索引的设计目标是提高表某些列的查询速度。如果没有索引，带有谓词的查询（如’WHERE table1.column = 10’）会加载整个表或分区并处理所有行。但是如果 column 存在索引，则只需要加载和处理文件的一部分。

2.2 创建索引模板

CREATE INDEX index_name     --索引名称
  ON TABLE base_table_name (col_name, ...)  --建立索引的列
  AS index_type    --索引类型
  [WITH DEFERRED REBUILD]    --重建索引
  [IDXPROPERTIES (property_name=property_value, ...)]  --索引额外属性
  [IN TABLE index_table_name]    --索引表的名字
  [ [ ROW FORMAT ...] STORED AS ...  
     | STORED BY ...
  ]   --索引表行分隔符 、 存储格式
  [LOCATION hdfs_path]  --索引表存储位置
  [TBLPROPERTIES (...)]   --索引表表属性
  [COMMENT "index comment"];  --索引注释

2.3 创建索引

我们在使用之前上面创建好的user表对id字段创建名字为user_index,索引存储在user_index_table索引表中

create index user_index on table user(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'with deferred rebuild  in table user_index_table;

此时索引表中是没有数据的，需要重建索引才会有索引的数据。

2.4 重建索引

hive (default)> ALTER index user_index on user rebuild ;Query ID = root_20201015081313_879ce697-a6a4-4c38-a1a9-0e72a52feb6b
Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:  set hive.exec.reducers.max=<number>In order to set a constant number of reducers:  set mapreduce.job.reduces=<number>Starting Job = job_1602711568359_0002, Tracking URL = http://node01:8088/proxy/application_1602711568359_0002/Kill Command = /export/servers/hadoop-2.6.0-cdh6.14.0/bin/hadoop job  -kill job_1602711568359_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12020-10-15 08:13:47,425 Stage-1 map = 0%,  reduce = 0%2020-10-15 08:13:48,546 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.66 sec2020-10-15 08:13:49,576 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.5 sec
MapReduce Total cumulative CPU time: 2 seconds 500 msec
Ended Job = job_1602711568359_0002
Loading data to table default.user_index_tableTable default.user_index_table stats: [numFiles=1, numRows=4, totalSize=231, rawDataSize=227]MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.5 sec   HDFS Read: 12945 HDFS Write: 581944 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 500 msec
OKTime taken: 12.85 seconds

Hive 会启动 MapReduce 作业去建立索引，建立好后查看索引表数据如下。三个表字段分别代表：索引列的值、该值对应的 HDFS 文件路径、该值在文件中的偏移量。

hive (default)> select * from user_index_table; OK
user_index_table.id     user_index_table._bucketname    user_index_table._offsets1       hdfs://node01:8020/user/hive/warehouse/user/000000_0 [0]2       hdfs://node01:8020/user/hive/warehouse/user/000000_0 [13]3       hdfs://node01:8020/user/hive/warehouse/user/000000_0 [26]4       hdfs://node01:8020/user/hive/warehouse/user/000000_0 [39]Time taken: 0.047 seconds, Fetched: 4 row(s)

2.5 自动使用索引

默认情况下，虽然建立了索引，但是 Hive 在查询时候是不会自动去使用索引的，需要开启相关配置。开启配置后，涉及到索引列的查询就会使用索引功能去优化查询。

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;SET hive.optimize.index.filter=true;SET hive.optimize.index.filter.compact.minsize=0;

2.6 查看索引

show index on user;
Hive视图和索引简单介绍

2.7 删除索引

删除索引会删除对应的索引表。

DROP INDEX [IF EXISTS] index_name ON table_name;

如果存在索引的表被删除了，其对应的索引和索引表都会被删除。如果被索引表的某个分区被删除了，那么分区对应的分区索引也会被删除。

2.8 索引的原理

在指定列上建立索引，会产生一张索引表(Hive的一张物理表)，里面字段包括：索引列的值、该值对应的 HDFS 文件路径、该值在文件中的偏移量。
在执行索引字段查询时候，首先额外生成一个MapReduce job，根据对索引列的过滤条件，从索引表中过滤出索引列的值对应的hdfs文件路径及偏移量，输出到hdfs上的一个文件中，然后根据这些文件中的hdfs路径和偏移量，筛选原始input文件，生成新的split,作为整个job的split,这样就达到不用全表扫描的目的。

到此，相信大家对“Hive视图和索引简单介绍”有了更深的了解，不妨来实际操作一番吧！这里是亿速云网站，更多相关内容可以进入相关频道进行查询，关注我们，继续学习！

亿速云「云服务器」，即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘，价格低至29元/月。点击查看>>

向AI问一下细节

Hive视图和索引简单介绍

一、Hive视图

1.1 简介

1.2 特点

1.3 创建视图

1.4 查询视图

1.5 删除视图

1.6 修改视图属性

二、索引

2.1 简介

2.2 创建索引模板

2.3 创建索引

2.4 重建索引

2.5 自动使用索引

2.6 查看索引

2.7 删除索引

2.8 索引的原理

猜你喜欢

最新资讯

相关推荐

开发者交流群：

相关标签