这期内容当中小编将会给大家带来有关Hive常用查询命令和使用方法,文章内容丰富且以专业的角度为大家分析和叙述,阅读完这篇文章希望大家可以有所收获。
1. 将日志文件传到HDFS
```bash
hdfs dfs -mkdir /user/hive/warehouse/original_access_logs_0104
hdfs dfs -put access.log /user/hive/warehouse/original_access_logs_0104
```
检查文件是否已正确拷贝
```bash
hdfs dfs -ls /user/hive/warehouse/original_access_logs_0104
```
2. 建立Hive外部表对应于日志文件
```sql
DROP TABLE IF EXISTS original_access_logs;
CREATE EXTERNAL TABLE original_access_logs (
ip STRING,
request_time STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")
LOCATION '/user/hive/warehouse/original_access_logs_0104';
```
3. 将TEXT表转换为PARQUET表
```sql
DROP TABLE IF EXISTS pq_access_logs;
CREATE TABLE pq_access_logs (
ip STRING,
request_time STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING,
`timestamp` int)
STORED AS PARQUET;
#ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
#ADD JAR /opt/cloudera/parcels/CDH/lib/hive/contrib/hive-contrib-2.1.1-cdh7.3.2.jar
INSERT OVERWRITE TABLE pq_access_logs
SELECT
ip,
from_unixtime(unix_timestamp(request_time, 'dd/MMM/yyyy:HH:mm:ss z'), 'yyyy-MM-dd HH:mm:ss z'),
method,
url,
http_version,
code1,
code2,
dash,
user_agent,
unix_timestamp(request_time, 'dd/MMM/yyyy:HH:mm:ss z')
FROM original_access_logs;
```
4. 统计最多访问的5个IP
```sql
select ip, count(*) cnt
from pq_access_logs
group by ip
order by cnt desc
limit 5
```
注意观察Hive Job拆分成Map Reduce Job并执行
如何查看Hive Job执行的日志
## 演示 - 分区表
### 步骤
1. 创建分区表
```sql
DROP TABLE IF EXISTS partitioned_access_logs;
CREATE EXTERNAL TABLE partitioned_access_logs (
ip STRING,
request_time STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING,
`timestamp` int)
PARTITIONED BY (request_date STRING)
STORED AS PARQUET
;
```
2. 将日志表写入分区表,使用动态分区插入
```sql
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE partitioned_access_logs
PARTITION (request_date)
SELECT ip, request_time, method, url, http_version, code1, code2, dash, user_agent, `timestamp`, to_date(request_time) as request_date
FROM pq_access_logs
;
```
默认分区:__HIVE_DEFAULT_PARTITION__, 没有匹配上的记录会放在这个分区
3. 观察分区表目录结构
```bash
hdfs dfs -ls /user/hive/warehouse/partitioned_access_logs
```
## 演示 - 分桶表
### 步骤
1. 创建日志分桶表
按IP的第一段分桶,然后按请求时间排序
```sql
DROP TABLE IF EXISTS bucketed_access_logs;
CREATE TABLE bucketed_access_logs (
first_ip_addr INT,
request_time STRING)
CLUSTERED BY (first_ip_addr)
SORTED BY (request_time)
INTO 10 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
;
!如果DISTRIBUTE BY和SORT BY不写,则需要设置hive参数 (2.0后不用,默认为true)
SET hive.enforce.sorting = true;
SET hive.enforce.bucketing = true;
INSERT OVERWRITE TABLE bucketed_access_logs
SELECT cast(split(ip, '\\.')[0] as int) as first_ip_addr, request_time
FROM pq_access_logs
--DISTRIBUTE BY first_ip_addr
--SORT BY request_time
;
```
2. 观察分桶表的物理存储结构
```bash
hdfs dfs -ls /user/hive/warehouse/bucketed_access_logs/
# 猜猜有几个文件?
hdfs dfs -cat /user/hive/warehouse/bucketed_access_logs/000000_0 | head
hdfs dfs -cat /user/hive/warehouse/bucketed_access_logs/000001_0 | head
hdfs dfs -cat /user/hive/warehouse/bucketed_access_logs/000009_0 | head
# 能看出分桶的规则吗?
```
## 演示 - ORC表的压缩
1. 新建一张访问日志的ORC表,插入数据时启用压缩
```sql
DROP TABLE IF EXISTS compressed_access_logs;
CREATE TABLE compressed_access_logs (
ip STRING,
request_time STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING,
`timestamp` int)
STORED AS ORC
TBLPROPERTIES ("orc.compression"="SNAPPY");
--SET hive.exec.compress.intermediate=true;
--SET mapreduce.map.output.compress=true;
INSERT OVERWRITE TABLE compressed_access_logs
SELECT * FROM pq_access_logs;
describe formatted compressed_access_logs;
```
2. 和原来不启用压缩的Parquet表进行比对
大小
原始TXT是38 MB.
```
hdfs dfs -ls /user/hive/warehouse/pq_access_logs/
```
Parquet无压缩: 4,158,592 (4.1 MB)
```
hdfs dfs -ls /user/hive/warehouse/compressed_access_logs/
```
Orc压缩后: 1,074,404 (1.0 MB)
压缩比: 约等于5:2 (4:1 - Parquet Raw: ORC Compressed)
注意: 数据备份时建议启用压缩,数据读多的情况下,启用压缩不一定能带来查询性能提升。
上述就是小编为大家分享的Hive常用查询命令和使用方法了,如果刚好有类似的疑惑,不妨参照上述分析进行理解。如果想知道更多相关知识,欢迎关注亿速云行业资讯频道。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。