鲁春利的工作笔记,谁说程序员不能有文艺范?
通过hadoop shell与java api访问hdfs
工作笔记之Hadoop2.6集群搭建已经将集群环境搭建好了,下面来进行一些HDFS的操作
1、HDFS的shell访问
HDFS设计主要用来对海量数据进行处理,即HDFS上存储大量文件。HDFS将这些文件进行分割后存储在不同的DataNode上。HDFS提供了一个shell接口,屏蔽了block存储的内部细节,所有的Hadoop操作均由bin/hadoop脚本引发。
不指定任何参数的hadoop命令将打印所有命令的描述,与hdfs文件相关的操作为hadoop fs(hadoop脚本其他的命令此处不涉及)。
[hadoop@nnode ~]$ hadoop fs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
hadoop2.6版本中提示hadoop fs为“Deprecated, use hdfs dfs instead.”(2.6之前的版本未接触过,这里就没有深究从哪一个版本开始的,但是hadoop fs仍然可以使用)。
[hadoop@nnode ~]$ hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
如:
[hadoop@nnode ~]$ hdfs dfs -ls -R /user/hadoop
-rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 /user/hadoop/20130913152700.txt.gz
-rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 /user/hadoop/20130913160307.txt.gz
-rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 /user/hadoop/apache-hive-1.2.0-bin.tar.gz
-rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 /user/hadoop/httpInterceptor_192.168.1.101_1_20130913160307.txt
-rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 /user/hadoop/lucl.gz
-rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 /user/hadoop/lucl.txt
-rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 /user/hadoop/scalog.txt
-rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 /user/hadoop/scalog.txt.gz
-rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 /user/hadoop/test.txt.gz
-rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 /user/hadoop/zookeeper.out
[hadoop@nnode ~]$
# 这里的点为当前目录,我是通过hadoop用户操作的因此类似于/user/hadoop
# hdfs默认具有/user/{hadoop-user},但是在/下也可以自己通过mkdir命令来创建自己的目录
[hadoop@nnode ~]$ hdfs dfs -ls -R .
-rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 20130913152700.txt.gz
-rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 20130913160307.txt.gz
-rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 apache-hive-1.2.0-bin.tar.gz
-rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 httpInterceptor_192.168.1.101_1_20130913160307.txt
-rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 lucl.gz
-rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 lucl.txt
-rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 scalog.txt
-rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 scalog.txt.gz
-rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 test.txt.gz
-rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 zookeeper.out
[hadoop@nnode ~]$
如果不清楚hdfs命令的详细操作,可以查看帮助信息:
[hadoop@nnode ~]$ hdfs dfs -help ls
-ls [-d] [-h] [-R] [<path> ...] :
List the contents that match the specified file pattern. If path is not
specified, the contents of /user/<currentUser> will be listed. Directory entries are of the form:
permissions - userId groupId sizeOfDirectory(in bytes)
modificationDate(yyyy-MM-dd HH:mm) directoryName
and file entries are of the form:
permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
modificationDate(yyyy-MM-dd HH:mm) fileName
-d Directories are listed as plain files.
-h Formats the sizes of files in a human-readable fashion rather than a number of bytes.
-R Recursively list the contents of directories.
[hadoop@nnode ~]$
2、HDFS的Java API访问
Hadoop中通过DataNode节点存储数据,而NameNode节点则记录数据的存储位置。Hadoop中各部分的通信基于RPC来实现,NameNode也是hadoop中RPC的server端(dfs.namenode.rpc-address
说明了rpc端的主机名和端口号),而Hadoop提供的FileSystem类为hadoop中RPC Client的抽象实现。
a.) 通过java.util.URL来读取hdfs的数据
为了让java程序能够识别Hadoop的hdfs URL需要通过URL的setURLStreamHandlerFactory(...);
每个Java虚拟机只能调用依次这个方法,因此通常在静态方法中调用。
package com.invic.hdfs;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;
/**
*
* @author lucl
* @ 通过java api来访问hdfs上特定的数据
*
*/
public class MyHdfsOfJavaApi {
static {
/**
* 为了让java程序能够识别hadoop的hdfs url需要配置额外的URLStreamHandlerFactory
* 如下方法java虚拟机只能调用一次,若原有的其他程序已经声明过该factory,则我的java程序将无法从hadoop中读取数据
*/
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws IOException {
String path = "hdfs://nnode:8020/user/hadoop/lucl.txt";
InputStream in = new URL(path).openStream();
OutputStream ou = System.out;
int buffer = 4096;
boolean close = false;
IOUtils.copyBytes(in, ou, buffer, close);
IOUtils.closeStream(in);
}
}
b.) 通过Hadoop的FileSystem来访问HDFS
Hadoop有一个抽象的文件系统概念,HDFS只是其中的一个实现。java抽象类org.apache.hadoop.fs.FileSystem定义了Hadoop中的一个文件系统接口。
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.hadoop.fs.FileSystem
|--org.apache.hadoop.fs.FilterFileSystem
|----org.apache.hadoop.fs.ChecksumFileSystem
|----org.apache.hadoop.fs.LocalFileSystem
|--org.apache.hadoop.fs.ftp.FTPFileSystem
|--org.apache.hadoop.fs.s3native.NativeS3FileSystem
|--org.apache.hadoop.fs.RawLocalFileSystem
|--org.apache.hadoop.fs.viewfs.ViewFileSystem
package com.invic.hdfs;
import java.io.IOException;
import java.io.OutputStream;
import java.net.URI;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.fs.RemoteIterator;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
/**
*
* @author lucl
* @ 通过FileSystem API来实现
* FileSystem get(Configuration) 通过设置配置文件core-site.xml读取类路径来实现,默认本地文件系统
* FileSystem get(URI, Configuration) 通过URI来设定要使用的文件系统
* FileSystem get(URI, Configuration, user) 作为给定用户来访问文件系统,对安全来说至关重要
*/
public class MyHdfsOfFS {
private static String HOST = "hdfs://nnode";
private static String PORT = "8020";
private static String NAMENODE = HOST + ":" + PORT;
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
String path = NAMENODE + "/user/";
/**
* 由于这里设计的为hadoop的user目录,默认会查询hdfs的用户家目录下的文件
*/
String user = "hadoop";
FileSystem fs = null;
try {
fs = FileSystem.get(URI.create(path), conf, user);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (null == fs) {
return;
}
/**
* 递归创建目录
*/
boolean mkdirs = fs.mkdirs(new Path("invic/test/mvtech"));
if (mkdirs) {
System.out.println("Dir ‘invic/test/mvtech’ create success.");
}
/**
* 判断目录是否存在
*/
boolean exists = fs.exists(new Path("invic/test/mvtech"));
if (exists) {
System.out.println("Dir ‘invic/test/mvtech’ exists.");
}
/**
* FSDataInputStream支持随意位置访问
* 这里的lucl.txt默认查找路径为/user/Administrator/lucl.txt
因为我是windows的eclipse
* 如果我上面的get方法最后指定了user
则查询的路径为/user/get方法指定的user/lucl.txt
*/
FSDataInputStream in = fs.open(new Path("lucl.txt"));
OutputStream os = System.out;
int buffSize = 4098;
boolean close = false;
IOUtils.copyBytes(in, os, buffSize, close);
System.out.println("\r\n跳到文件开始重新读取文件。。。。。。");
in.seek(0);
IOUtils.copyBytes(in, os, buffSize, close);
IOUtils.closeStream(in);
/**
* 创建文件
*/
FSDataOutputStream create = fs.create(new Path("sample.txt"));
create.write("This is my first sample file.".getBytes());
create.flush();
create.close();
/**
* 文件拷贝
*/
fs.copyFromLocalFile(new Path("F:\\Mvtech\\ftpfile\\cg-10086.com.csv"),
new Path("cg-10086.com.csv"));
/**
* 文件追加
*/
FSDataOutputStream append = fs.append(new Path("sample.txt"));
append.writeChars("\r\n");
append.writeChars("New day, new World.");
append.writeChars("\r\n");
IOUtils.closeStream(append);
/**
* progress的使用
*/
FSDataOutputStream progress = fs.create(new Path("progress.txt"),
new Progressable() {
@Override
public void progress() {
System.out.println("write is in progress......");
}
});
// 接收键盘输入到hdfs上
Scanner sc = new Scanner(System.in);
System.out.print("Please type your enter : ");
String name = sc.nextLine();
while (!"quit".equals(name)) {
if (null == name || "".equals(name.trim())) {
continue;
}
progress.writeChars(name);
System.out.print("Please type your enter : ");
name = sc.nextLine();
}
/**
* 递归列出文件
*/
RemoteIterator<LocatedFileStatus> it = fs.listFiles(new Path(path), true);
while (it.hasNext()) {
LocatedFileStatus loc = it.next();
System.out.println(loc.getPath().getName() + "|" + loc.getLen() + "|"
+ loc.getOwner());
}
/**
* 文件或目录元数据:文件长度、块大小、复本、修改时间、所有者及权限信息
*/
FileStatus status = fs.getFileStatus(new Path("lucl.txt"));
System.out.println(status.getPath().getName() + "|" +
status.getPath().getParent().getName() + "|" + status.getBlockSize() + "|"
+ status.getReplication() + "|" + status.getOwner());
/**
* 列出目录中文件listStatus,若参数为文件则以数组方式返回长度为1的FileStatus对象
*/
fs.listStatus(new Path(path));
fs.listStatus(new Path(path), new PathFilter() {
@Override
public boolean accept(Path tmpPath) {
String tmpName = tmpPath.getName();
if (tmpName.endsWith(".txt")) {
return true;
}
return false;
}
});
// 可以传入一组路径,会最终累计合并成一个数组返回
// fs.listStatus(Path [] files);
FileStatus [] mergeStatus = fs.listStatus(new Path[]{new Path("lucl.txt"),
new Path("progress.txt"), new Path("sample.txt")});
Path [] listPaths = FileUtil.stat2Paths(mergeStatus);
for (Path p : listPaths) {
System.out.println(p);
}
/**
* 文件模式匹配
*/
FileStatus [] patternStatus = fs.globStatus(new Path("*.txt"));
for (FileStatus stat : patternStatus) {
System.out.println(stat.getPath());
}
/**
* 删除数据
*/
boolean recursive = true;
fs.delete(new Path("demo.txt"), recursive);
fs.close();
}
}
c.) 访问HDFS集群
package com.invic.hdfs;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
import org.apache.log4j.Logger;
/**
*
* @author lucl
* @ 通过访问hadoop集群来访问hdfs
*
*/
public class MyClusterHdfs {
public static void main(String[] args) throws IOException {
System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");
Logger logger = Logger.getLogger(MyClusterHdfs.class);
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://cluster");
conf.set("dfs.nameservices", "cluster");
conf.set("dfs.ha.namenodes.cluster", "nn1,nn2");
conf.set("dfs.namenode.rpc-address.cluster.nn1", "nnode:8020");
conf.set("dfs.namenode.rpc-address.cluster.nn2", "dnode1:8020");
conf.set("dfs.client.failover.proxy.provider.cluster",
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
FileSystem fs = FileSystem.get(conf);
RemoteIterator<LocatedFileStatus> it = fs.listFiles(new Path("/"), true);
while (it.hasNext()) {
LocatedFileStatus loc = it.next();
logger.info(loc.getPath().getName() + "|" + loc.getLen() + loc.getOwner());
}
/*for (int i = 0; i < 500; i++) {
String str = "the sequence is " + i;
logger.info(str);
}*/
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.exit(0);
}
}
说明:
System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");
# 在main方法的第一行配置hadoop的home路径,否则在Windows下可能报错如下:
15/07/19 22:05:54 DEBUG util.Shell: Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:327)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)
15/07/19 22:05:54 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。