鲁春利的工作笔记,谁说程序员不能有文艺范?
在hive中创建mywork数据库,以后的测试在该数据库中进行,避免每次都使用default数据库。
hive> create database mywork;
OK
Time taken: 0.487 seconds
hive> show databases;
OK
default
mywork
Time taken: 0.614 seconds, Fetched: 2 row(s)
hive>
hive> use mywork;
OK
Time taken: 0.064 seconds
hive> create table student(id int, name string);
OK
Time taken: 0.519 seconds
hive>
查看Hive在HDFS上的存储
[hadoop@dnode1 ~]$ hdfs dfs -ls -R /user/hive
drwxrw-rw- - hadoop hadoop 0 2015-12-08 21:37 /user/hive/warehouse
drwxrw-rw- - hadoop hadoop 0 2015-12-08 21:36 /user/hive/warehouse/mywork.db
drwxrw-rw- - hadoop hadoop 0 2015-12-08 21:36 /user/hive/warehouse/mywork.db/student
[hadoop@dnode1 ~]$
Hive支持的数据类型如下:
原生类型:
TINYINT 1字节
SMALLINT 2字节
INT 4字节
BIGINT 8字节
BOOLEAN true/false
FLOAT 4字节
DOUBLE 8字节
STRING 字符串
BINARY (Hive 0.8.0以上才可用)
TIMESTAMP (Hive 0.8.0以上才可用)
复合类型:
arrays: ARRAY<data_type> 有序字段,类型必须相同
maps: MAP<primitive_type, data_type> 无序的键/值对
structs: STRUCT<col_name : data_type [COMMENT col_comment], ...> 一组命名的字段
union: UNIONTYPE<data_type, data_type, ...>
说明:ARRAY数据类型通过下标来获取值,如arrays[0],MAP通过["指定域名称"]访问, STRUCT类型通过点方式访问(如structs.col_name)。
建表示例:
hive> create table employee (
> eno int comment 'the no of employee',
> ename string comment 'name of employee',
> salary float comment 'salary of employee',
> subordinates array<string> comment 'employees managed by current employee',
> deductions map<string, float> comment 'deductions',
> address struct<province : string, city : string, street : string, zip : int> comment 'address'
> ) comment 'This is table of employee info';
OK
Time taken: 0.33 seconds
hive>
在Hive中各列之间,以及复合类型内部使用了不同的分隔符来指定,每行数据对应一条记录。
在${HIVE_HOME}/data目录下创建文件data_default.txt文件,采用默认分隔符,内容为:
Hive默认的字段分隔符为ascii码的控制符\001,建表的时候用fields terminated by '\001'。造数据在vi 打开文件里面,用ctrl+v然后再ctrl+a可以输入这个控制符\001(即^A)。按顺序,\002的输入方式为ctrl+v,ctrl+b。以此类推。
说明:
1000 员工编号
zhangsan 员工姓名
5000.0 员工工资
lisi^Bwangwu 下属员工
ptax^C200^Bpension^C200 工资扣除金额(如税收等)
shandong^Bheze^Bdingtao^B274106 家庭住址(struct结构只需指定值即可)
加载数据
hive> load data local inpath 'data/data_default.txt' into table employee;
Loading data to table mywork.employee
Table mywork.employee stats: [numFiles=1, numRows=0, totalSize=83, rawDataSize=0]
OK
Time taken: 0.426 seconds
hive> select * from employee;
OK
1000 zhangsan 5000.0 ["lisi","wangwu"] {"ptax":200.0,"pension":200.0} {"province":"shandong","city":"heze","street":"dingtao","zip":274106}
Time taken: 0.114 seconds, Fetched: 1 row(s)
hive>
# 对于复合类型数据查询方式如下
hive> select eno, ename, salary, subordinates[0], deductions['ptax'], address.province from employee;
OK
1000 zhangsan 5000.0 lisi 200.0 shandong
Time taken: 0.129 seconds, Fetched: 1 row(s)
hive>
查看HDFS数据结构
[hadoop@dnode1 ~]$ hdfs dfs -ls -R /user/hive/warehouse/
drwxrw-rw- - hadoop hadoop 0 2015-12-09 00:00 /user/hive/warehouse/mywork.db
drwxrw-rw- - hadoop hadoop 0 2015-12-09 00:00 /user/hive/warehouse/mywork.db/employee
-rwxrw-rw- 2 hadoop hadoop 83 2015-12-09 00:00 /user/hive/warehouse/mywork.db/employee/data_default.txt
drwxrw-rw- - hadoop hadoop 0 2015-12-08 23:03 /user/hive/warehouse/mywork.db/student
[hadoop@dnode1 ~]$ hdfs dfs -text /user/hive/warehouse/mywork.db/employee/data_default.txt
1000zhangsan5000.0lisiwangwuptax200pension200shandonghezedingtao274106
[hadoop@dnode1 ~]$
自定义分隔符:
hive> create table employee_02 (
> eno int comment 'the no of employee',
> ename string comment 'name of employee',
> salary float comment 'salary of employee',
> subordinates array<string> comment 'employees managed by current employee',
> deductions map<string, float> comment 'deductions',
> address struct<province : string, city : string, street : string, zip : int> comment 'address'
> ) comment 'This is table of employee info'
> row format delimited fields terminated by '\t'
> collection items terminated by ','
> map keys terminated by ':'
> lines terminated by '\n';
OK
Time taken: 0.228 seconds
hive> load data local inpath 'data/data_employee02.txt' into table employee_02;
Loading data to table mywork.employee_02
Table mywork.employee_02 stats: [numFiles=1, totalSize=99]
OK
Time taken: 0.371 seconds
hive> select * from employee_02;
OK
1000 'zhangsan' 5000.0 ["'lisi'","'wangwu'"] {"'ptax'":200.0,"'pension'":200.0} {"province":"'shandong'","city":"'heze'","street":"'dingtao'","zip":274106}
Time taken: 0.101 seconds, Fetched: 1 row(s)
hive>
data/employee02.txt文件内容为
[hadoop@nnode data]$ pwd
/usr/local/hive1.2.0/data
[hadoop@nnode data]$ cat data_employee02.txt
1000 'zhangsan' 5000.0 'lisi','wangwu' 'ptax':200,'pension':200 'shandong','heze','dingtao',274106
[hadoop@nnode data]$
说明:由于在文本文件中包含有单引号,在load到hive的表之后表示方式为属性加双引号,这里的单引号被认为了是属性或值的一部分了,需要注意。
查看详细表定义
# 建表时为默认设置
hive> describe formatted employee;
OK
# col_name data_type comment
eno int the no of employee
ename string name of employee
salary float salary of employee
subordinates array<string> employees managed by current employee
deductions map<string,float> deductions
address struct<province:string,city:string,street:string,zip:int> address
# Detailed Table Information
Database: mywork
Owner: hadoop
CreateTime: Tue Dec 08 23:10:07 CST 2015
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cluster/user/hive/warehouse/mywork.db/employee
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
comment This is table of employee info
numFiles 1
numRows 0
rawDataSize 0
totalSize 83
transient_lastDdlTime 1449590423
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.098 seconds, Fetched: 37 row(s)
# 建表时自定义了分隔符
hive> describe formatted employee_02;
OK
# col_name data_type comment
eno int the no of employee
ename string name of employee
salary float salary of employee
subordinates array<string> employees managed by current employee
deductions map<string,float> deductions
address struct<province:string,city:string,street:string,zip:int> address
# Detailed Table Information
Database: mywork
Owner: hadoop
CreateTime: Wed Dec 09 00:12:53 CST 2015
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cluster/user/hive/warehouse/mywork.db/employee_02
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
comment This is table of employee info
numFiles 1
totalSize 99
transient_lastDdlTime 1449591260
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim ,
field.delim \t
line.delim \n
mapkey.delim :
serialization.format \t
Time taken: 0.116 seconds, Fetched: 39 row(s)
hive>
遗留问题:
hive> delete from student;
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
hive>
注意事项:如果sql语句中含有tab格式的内容,则会出现如下问题
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。