通过hadoop shell与java api访问hdfs
不指定任何参数的hadoop命令将打印所有命令的描述,与hdfs文件相关的操作为hadoop fs(hadoop脚本其他的命令此处不涉及)。
[hadoop@nnode ~]$ hadoop fs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
hadoop2.6版本中提示hadoop fs为“Deprecated, use hdfs dfs instead.”(2.6之前的版本未接触过,这里就没有深究从哪一个版本开始的,但是hadoop fs仍然可以使用)。
[hadoop@nnode ~]$ hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
[hadoop@nnode ~]$ hdfs dfs -ls -R /user/hadoop
-rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 /user/hadoop/20130913152700.txt.gz
-rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 /user/hadoop/20130913160307.txt.gz
-rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 /user/hadoop/apache-hive-1.2.0-bin.tar.gz
-rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 /user/hadoop/httpInterceptor_192.168.1.101_1_20130913160307.txt
-rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 /user/hadoop/lucl.gz
-rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 /user/hadoop/lucl.txt
-rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 /user/hadoop/scalog.txt
-rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 /user/hadoop/scalog.txt.gz
-rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 /user/hadoop/test.txt.gz
-rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 /user/hadoop/zookeeper.out
[hadoop@nnode ~]$
# 这里的点为当前目录,我是通过hadoop用户操作的因此类似于/user/hadoop
# hdfs默认具有/user/{hadoop-user},但是在/下也可以自己通过mkdir命令来创建自己的目录
[hadoop@nnode ~]$ hdfs dfs -ls -R .
-rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 20130913152700.txt.gz
-rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 20130913160307.txt.gz
-rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 apache-hive-1.2.0-bin.tar.gz
-rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 httpInterceptor_192.168.1.101_1_20130913160307.txt
-rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 lucl.gz
-rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 lucl.txt
-rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 scalog.txt
-rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 scalog.txt.gz
-rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 test.txt.gz
-rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 zookeeper.out
[hadoop@nnode ~]$
[hadoop@nnode ~]$ hdfs dfs -help ls
-ls [-d] [-h] [-R] [<path> ...] :
List the contents that match the specified file pattern. If path is not
specified, the contents of /user/<currentUser> will be listed. Directory entries are of the form:
permissions - userId groupId sizeOfDirectory(in bytes)
modificationDate(yyyy-MM-dd HH:mm) directoryName
and file entries are of the form:
permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
modificationDate(yyyy-MM-dd HH:mm) fileName
-d Directories are listed as plain files.
-h Formats the sizes of files in a human-readable fashion rather than a number of bytes.
-R Recursively list the contents of directories.
[hadoop@nnode ~]$
2、HDFS的Java API访问
说明了rpc端的主机名和端口号),而Hadoop提供的FileSystem类为hadoop中RPC Client的抽象实现。
a.) 通过java.util.URL来读取hdfs的数据
为了让java程序能够识别Hadoop的hdfs URL需要通过URL的setURLStreamHandlerFactory(...);
package com.invic.hdfs;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;
* @author lucl
* @ 通过java api来访问hdfs上特定的数据
public class MyHdfsOfJavaApi {
static {
* 为了让java程序能够识别hadoop的hdfs url需要配置额外的URLStreamHandlerFactory
* 如下方法java虚拟机只能调用一次,若原有的其他程序已经声明过该factory,则我的java程序将无法从hadoop中读取数据
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
public static void main(String[] args) throws IOException {
String path = "hdfs://nnode:8020/user/hadoop/lucl.txt";
InputStream in = new URL(path).openStream();
OutputStream ou = System.out;
int buffer = 4096;
boolean close = false;
IOUtils.copyBytes(in, ou, buffer, close);
b.) 通过Hadoop的FileSystem来访问HDFS
package com.invic.hdfs;
import java.io.IOException;
import java.io.OutputStream;
import java.net.URI;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.fs.RemoteIterator;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
* @author lucl
* @ 通过FileSystem API来实现
* FileSystem get(Configuration) 通过设置配置文件core-site.xml读取类路径来实现,默认本地文件系统
* FileSystem get(URI, Configuration) 通过URI来设定要使用的文件系统
* FileSystem get(URI, Configuration, user) 作为给定用户来访问文件系统,对安全来说至关重要
public class MyHdfsOfFS {
private static String HOST = "hdfs://nnode";
private static String PORT = "8020";
private static String NAMENODE = HOST + ":" + PORT;
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
String path = NAMENODE + "/user/";
* 由于这里设计的为hadoop的user目录,默认会查询hdfs的用户家目录下的文件
String user = "hadoop";
FileSystem fs = null;
try {
fs = FileSystem.get(URI.create(path), conf, user);
} catch (InterruptedException e) {
if (null == fs) {
* 递归创建目录
boolean mkdirs = fs.mkdirs(new Path("invic/test/mvtech"));
if (mkdirs) {
System.out.println("Dir ‘invic/test/mvtech’ create success.");
* 判断目录是否存在
boolean exists = fs.exists(new Path("invic/test/mvtech"));
if (exists) {
System.out.println("Dir ‘invic/test/mvtech’ exists.");
* FSDataInputStream支持随意位置访问
* 这里的lucl.txt默认查找路径为/user/Administrator/lucl.txt
* 如果我上面的get方法最后指定了user
FSDataInputStream in = fs.open(new Path("lucl.txt"));
OutputStream os = System.out;
int buffSize = 4098;
boolean close = false;
IOUtils.copyBytes(in, os, buffSize, close);
IOUtils.copyBytes(in, os, buffSize, close);
* 创建文件
FSDataOutputStream create = fs.create(new Path("sample.txt"));
create.write("This is my first sample file.".getBytes());
* 文件拷贝
fs.copyFromLocalFile(new Path("F:\\Mvtech\\ftpfile\\cg-10086.com.csv"),
new Path("cg-10086.com.csv"));
* 文件追加
FSDataOutputStream append = fs.append(new Path("sample.txt"));
append.writeChars("New day, new World.");
* progress的使用
FSDataOutputStream progress = fs.create(new Path("progress.txt"),
new Progressable() {
public void progress() {
System.out.println("write is in progress......");
// 接收键盘输入到hdfs上
Scanner sc = new Scanner(System.in);
System.out.print("Please type your enter : ");
String name = sc.nextLine();
while (!"quit".equals(name)) {
if (null == name || "".equals(name.trim())) {
System.out.print("Please type your enter : ");
name = sc.nextLine();
* 递归列出文件
RemoteIterator<LocatedFileStatus> it = fs.listFiles(new Path(path), true);
while (it.hasNext()) {
LocatedFileStatus loc = it.next();
System.out.println(loc.getPath().getName() + "|" + loc.getLen() + "|"
+ loc.getOwner());
* 文件或目录元数据:文件长度、块大小、复本、修改时间、所有者及权限信息
FileStatus status = fs.getFileStatus(new Path("lucl.txt"));
System.out.println(status.getPath().getName() + "|" +
status.getPath().getParent().getName() + "|" + status.getBlockSize() + "|"
+ status.getReplication() + "|" + status.getOwner());
* 列出目录中文件listStatus,若参数为文件则以数组方式返回长度为1的FileStatus对象
fs.listStatus(new Path(path));
fs.listStatus(new Path(path), new PathFilter() {
public boolean accept(Path tmpPath) {
String tmpName = tmpPath.getName();
if (tmpName.endsWith(".txt")) {
return true;
return false;
// 可以传入一组路径,会最终累计合并成一个数组返回
// fs.listStatus(Path [] files);
FileStatus [] mergeStatus = fs.listStatus(new Path[]{new Path("lucl.txt"),
new Path("progress.txt"), new Path("sample.txt")});
Path [] listPaths = FileUtil.stat2Paths(mergeStatus);
for (Path p : listPaths) {
* 文件模式匹配
FileStatus [] patternStatus = fs.globStatus(new Path("*.txt"));
for (FileStatus stat : patternStatus) {
* 删除数据
boolean recursive = true;
fs.delete(new Path("demo.txt"), recursive);
c.) 访问HDFS集群
package com.invic.hdfs;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
import org.apache.log4j.Logger;
* @author lucl
* @ 通过访问hadoop集群来访问hdfs
public class MyClusterHdfs {
public static void main(String[] args) throws IOException {
System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");
Logger logger = Logger.getLogger(MyClusterHdfs.class);
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://cluster");
conf.set("dfs.nameservices", "cluster");
conf.set("dfs.ha.namenodes.cluster", "nn1,nn2");
conf.set("dfs.namenode.rpc-address.cluster.nn1", "nnode:8020");
conf.set("dfs.namenode.rpc-address.cluster.nn2", "dnode1:8020");
FileSystem fs = FileSystem.get(conf);
RemoteIterator<LocatedFileStatus> it = fs.listFiles(new Path("/"), true);
while (it.hasNext()) {
LocatedFileStatus loc = it.next();
logger.info(loc.getPath().getName() + "|" + loc.getLen() + loc.getOwner());
/*for (int i = 0; i < 500; i++) {
String str = "the sequence is " + i;
try {
} catch (InterruptedException e) {
System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");
# 在main方法的第一行配置hadoop的home路径,否则在Windows下可能报错如下:
15/07/19 22:05:54 DEBUG util.Shell: Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:327)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)
15/07/19 22:05:54 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>