本篇内容主要讲解“Hive的底层执行流程”,感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷,实用性强。下面就让小编来带大家学习“Hive的底层执行流程”吧!
我们以下面这个SQL为例
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;
整个编译过程分为六个阶段:
1.Antlr定义SQL的语法规则,完成SQL词法,语法解析,将SQL
转化为抽象语法树AST Tree
HiveLexerX,HiveParser分别是Antlr对SQL编译后自动生成的词法解析和语法解析类,在这两个类中进行复杂的解析。
例子中的AST tree为
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF src))
(TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1))
(TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key))
(TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4))))
(TOK_GROUPBY (TOK_COLREF src key))))
2.遍历AST Tree,抽象出查询的基本组成单元QueryBlock
AST Tree 仍然非常复杂,不够结构化,不方便直接翻译为 MapReduce 程序, AST
Tree 转化为 QueryBlock(QB)就是将 SQL 进一部抽象和结构化。
AST Tree 生成 QueryBlock 的过程是一个递归的过程,先序遍历 AST Tree ,遇到不
同的Token 节点(理解为特殊标记),保存到相应的属性中,主要包含以下几个过程
TOK_QUERY => 创建 QB 对象,循环递归子节点
TOK_FROM => 将表名语法部分保存到 QB 对象的 aliasToTabs 等属性中
TOK_INSERT => 循环递归子节点
TOK_DESTINATION => 将输出目标的语法部分保存在 QBParseInfo 对象的nameToDest 属性中
TOK_SELECT => 分别将查询表达式的语法部分保存在 destToSelExpr 、destToAggregationExprs 、 destToDistinctFuncExprs 三个属性中
TOK_WHERE => 将 Where 部分的语法保存在 QBParseInfo 对象的destToWhereExpr 属性中
3.遍历QueryBlock,翻译为执行操作树OperatorTree
Hive 最终生成的 MapReduce 任务, Map 阶段和 Reduce 阶段均由 Operator Tree
组成。逻辑操作符,就是在 Map 阶段或者 Reduce 阶段完成单一特定的操作。
基本的操作符包括
TableScanOperator、SelectOperator、FilterOperator、JoinOperator、GroupByOperator、ReduceSinkOperator
QueryBlock 生成 Operator Tree 就是遍历上一个过程中生成的 QB 和 QBParseInfo
对象的保存
语法的属性,包含如下几个步骤:
QB#aliasToSubq => 有子查询,递归调用
QB#aliasToTabs => TableScanOperator
QBParseInfo#joinExpr => QBJoinTree => ReduceSinkOperator + JoinOperator
QBParseInfo#destToWhereExpr => FilterOperator
QBParseInfo#destToGroupby => ReduceSinkOperator +GroupByOperator
QBParseInfo#destToOrderby => ReduceSinkOperator + ExtractOperator
由于 Join/GroupBy/OrderBy 均需要在 Reduce 阶段完成,所以在生成相应操作的Operator 之前都会先生成一个 ReduceSinkOperator ,将字段组合并序列化为 Reduce Key/value,Partition Key
SQL例子翻译成OperatorTree
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: rand()
type: double
tag: -1
value expressions:
expr: substr(value, 4)
type: string
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(UDFToDouble(VALUE.0))
keys:
expr: KEY.0
type: string
mode: partial1
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.mapred.SequenceFileOutputFormat
name: binary_table
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
/tmp/hive-zshao/67494501/106593589.10001
Reduce Output Operator
key expressions:
expr: 0
type: string
sort order: +
Map-reduce partition columns:
expr: 0
type: string
tag: -1
value expressions:
expr: 1
type: double
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE.0)
keys:
expr: KEY.0
type: string
mode: final
Select Operator
expressions:
expr: 0
type: string
expr: 1
type: double
Select Operator
expressions:
expr: UDFToInteger(0)
type: int
expr: 1
type: double
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
4.Logical Optimizer进行OperatorTree变换,合并不必要的
使用ReduceSinkOperator,减少shuffle数据量。大部分逻辑层优化器通过变换 OperatorTree ,合并操作符,达到减少 MapReduce Job ,减少 shuffle 数据量的目的。
5.遍历OperatorTree,翻译为Task tree
OperatorTree 转化为 Task tree的过程分为下面几个阶段
对输出表生成 MoveTask
从 OperatorTree 的其中一个根节点向下深度优先遍历
ReduceSinkOperator 标示 Map/Reduce 的界限,多个 Job 间的界限
遍历其他根节点,遇过碰到 JoinOperator 合并 MapReduceTask
生成 StatTask 更新元数据
剪断 Map 与 Reduce 间的 Operator 的关系
6.PhysicalOptimizer 对Task tree优化,生成最终的执行计划
7、执行
以上就是HiveSQL的底层执行流程
我们在开发中,可以使用下面这个语句来打印SQL语句的相关运行信息
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
注:我的版本是hive-1.1.0-cdh6.7.0,所以只可用三个可选属性,如果您版本比较高的话,可以去官网查阅对应属性
下面我对三种可选属性进行简单介绍
EXTENDED:打印SQL解析成AST&Operator Tree最全面的信息
hive (g6_hadoop)> explain EXTENDED insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
OK
Explain
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME
g6_access_orc
TOK_INSERT
TOK_DESTINATION
TOK_TAB
TOK_TABNAME
g6_access_orc_explain
TOK_SELECT
TOK_SELEXPR
TOK_TABLE_OR_COL
domain
TOK_SELEXPR
TOK_FUNCTION
count
1
num
TOK_WHERE
>
TOK_TABLE_OR_COL
traffic
'99900'
TOK_GROUPBY
TOK_TABLE_OR_COL
domain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: g6_access_orc
Statistics: Num rows: 260326 Data size: 188215698 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (traffic > 99900) (type: boolean)
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: domain (type: string)
outputColumnNames: domain
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1)
keys: domain (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
tag: -1
value expressions: _col1 (type: bigint)
auto parallelism: false
Path -> Alias:
hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc [g6_access_orc]
Path -> Partition:
hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
Partition
base file name: g6_access_orc
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns cdn,region,level,time,ip,domain,url,traffic
columns.comments
columns.types string:string:string:string:string:string:string:bigint
field.delim
file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
name g6_hadoop.g6_access_orc
numFiles 1
numRows 260326
rawDataSize 188215698
serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}
serialization.format
serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 8567798
transient_lastDdlTime 1557676635
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns cdn,region,level,time,ip,domain,url,traffic
columns.comments
columns.types string:string:string:string:string:string:string:bigint
field.delim
file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
name g6_hadoop.g6_access_orc
numFiles 1
numRows 260326
rawDataSize 188215698
serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}
serialization.format
serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 8567798
transient_lastDdlTime 1557676635
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: g6_hadoop.g6_access_orc
name: g6_hadoop.g6_access_orc
Truncated Path -> Alias:
/g6_hadoop.db/g6_access_orc [g6_access_orc]
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 1
directory: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000
NumFilesPerFileSink: 1
Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns domain,num
columns.comments
columns.types string:bigint
field.delim |
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain
name g6_hadoop.g6_access_orc_explain
numFiles 1
numRows 7
rawDataSize 149
serialization.ddl struct g6_access_orc_explain { string domain, i64 num}
serialization.format |
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 156
transient_lastDdlTime 1558661108
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: g6_hadoop.g6_access_orc_explain
TotalFiles: 1
GatherStats: true
MultiFileSpray: false
Stage: Stage-0
Move Operator
tables:
replace: true
source: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns domain,num
columns.comments
columns.types string:bigint
field.delim |
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain
name g6_hadoop.g6_access_orc_explain
numFiles 1
numRows 7
rawDataSize 149
serialization.ddl struct g6_access_orc_explain { string domain, i64 num}
serialization.format |
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 156
transient_lastDdlTime 1558661108
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: g6_hadoop.g6_access_orc_explain
Stage: Stage-2
Stats-Aggr Operator
Stats Aggregation Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/
Time taken: 1.359 seconds, Fetched: 198 row(s)
AUTHORIZATION :打印SQL运行相关权限
hive (g6_hadoop)> explain AUTHORIZATION insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
OK
Explain
INPUTS:
g6_hadoop@g6_access_orc
OUTPUTS:
g6_hadoop@g6_access_orc_explain
CURRENT_USER:
hadoop
OPERATION:
QUERY
AUTHORIZATION_FAILURES:
No privilege 'Update' found for outputs { database:g6_hadoop, table:g6_access_orc_explain}
No privilege 'Select' found for inputs { database:g6_hadoop, table:g6_access_orc, columnName:domain}
Time taken: 0.599 seconds, Fetched: 11 row(s)
DEPENDENCY:打印SQL输入表的相关信息
hive (g6_hadoop)> explain DEPENDENCY insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
Explain
{"input_partitions":[],"input_tables":[{"tablename":"g6_hadoop@g6_access_orc","tabletype":"MANAGED_TABLE"}]}
Time taken: 0.135 seconds, Fetched: 1 row(s)
到此,相信大家对“Hive的底层执行流程”有了更深的了解,不妨来实际操作一番吧!这里是亿速云网站,更多相关内容可以进入相关频道进行查询,关注我们,继续学习!
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。