本篇内容主要讲解“Hadoop WritableSerialization是什么”,感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷,实用性强。下面就让小编来带大家学习“Hadoop WritableSerialization是什么”吧!
Hadoop有一个可替换的serialization framework API. 一个序列化框架用一个Serialization的实现来表示。
WritableSerialization是对Writable类型的Serialization的实现。
package: org.apache.hadoop.io.serializer public class WritableSerialization extends Configured implements Serialization<Writable> { static class WritableSerializer extends Configured implements Serializer<Writable> { @Override public void serialize(Writable w) throws IOException {} } static class WritableDeserializer extends Configured implements Deserializer<Writable> { @Override public Writable deserialize(Writable w) throws IOException {} } @Override public Serializer<Writable> getSerializer(Class<Writable> c) { return new WritableSerializer(); } @InterfaceAudience.Private @Override public Deserializer<Writable> getDeserializer(Class<Writable> c) { return new WritableDeserializer(getConf(), c); } }
JavaSerialization是对Serializable类型的Serialization的实现。它使用标准的Java Object Serialization。尽管它有利于方便的使用标准java类型,但是Java的标准化,效率较差。
package: org.apache.hadoop.io.serializer
public class JavaSerialization extends Object implements Serialization<Serializable>{}
为什么不用Java Object Serialization?
1. Not Compact. 每次序列化,都要写入类的名字,同一个类的后续实列只引用第一个次出现的句柄。这不太适合随即访问,排序 ,切分。
2. Nor Fast. 每次需要创建新的实例,浪费空间。
3. Extensible. 这个可以有,支持演化的新类型。目前无法和writable支持。
4. Interoperational. 理论可行,但是目前只有Java实现。Writable 也是如此。
Avro是一个独立于编程语言的数据序列化系统,它使用接口定义语言(IDL)定义Schema,然后可以生成其他语言的原生代码。Avro Schema通常用JSON来编写,数据通常用二进制格式编码。
Avro有很强的Data Schema Resolution能力,就是说读数据和写数据的Schema不必完全相同,Avro支持数据演化。
和其他序列化系统(Thrift 和 google Protocol Buffers)相比, Avro的性能更好。
Primitive Datatype
null, boolean,int,long,float,double,bytes,string
Complex Datatype
array,排过序的同类型对象集合
{ "name":"myarray","type":"array", "items":"long" }
map,未排序的k-v对,key必须是string,schema只定义value
{ "name":"mymap","type":"map", "values":"string" }
record, 类似于struct,这个在数据格式中非常常用。
{ "type":"record","name":"weather-record","doc":"a weather reading.",
"fields":[
{"name":"myint","type":"int"},
{"name":"mynull","type":"null"}
]
}
enum,命名集合
{ "type":"enum",
"name":"title",
"symbols":["engineer","Manager","vp"]
}
fixed,固定8位无符号字节
{ "type":"fixed","name":"md5"}
union,Schema的并集,使用json数组标志。数据必须与并集的一个类型匹配。
[ "type":"int","type":"long",{"type":"array", "items":"long" }]
表示数据必须是int,long, 或者long数组中的一个。
Avro的演化,略。
问题:
Avro如何排序?
Avro如何splitable?
Avro的数据文件结构如下:
四字节, ASCII 'O', 'b', 'j', followed by 1.
file metadata
The 16-byte, randomly-generated sync marker for this file.
All metadata properties that start with "avro." are reserved.
avro.schema contains the schema of objects stored in the file, as JSON data (required).
avro.codec the name of the compression codec used to compress blocks, as a string. Implementations are required to support the following codecs: "null" and "deflate". If codec is absent, it is assumed to be "null". The codecs are described with more detail below.
Required Codecs
null
The "null" codec simply passes through data uncompressed.
deflate
The "deflate" codec writes the data block using the deflate algorithm as specified in RFC 1951, and typically implemented using the zlib library. Note that this format (unlike the "zlib format" in RFC 1950) does not have a checksum.
Optional Codecs
snappy
The "snappy" codec uses Google's Snappy compression library. Each compressed block is followed by the 4-byte, big-endian CRC32
checksum of the uncompressed data in the block.
A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that codec.
The file's 16-byte sync marker.
Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents. The combination of block size, object counts, and sync markers enable detection of corrupt blocks and help ensure data integrity.
Schema和数据访问可以用GenericRecord,也可以使用SpecificRecord,需要用到Avro-Tools来生成对象类
% hadoop jar /usr/lib/avro/avro-tools.jar compile schema /pair.avsc /home/cloudera/workspace/
Schema, namespace会被注入到生成类中.
{ "namespace":"com.jinbao.hadoop.hdfs.avro.compile", "type":"record", "name":"MyAvro", "fields":[ { "name":"name","type":"string" }, { "name":"age","type":"int" }, { "name":"isman","type":"boolean" } ] }
代码如下
public class AvroTest { private static String avscfile = "/home/cloudera/pair.avsc"; private static String avrofile = "/home/cloudera/pair.avro"; /** * @param args * @throws IOException */ public static void main(String[] args) throws IOException { //schemaReadWrite(); // WriteData(); ReadData(); } private static void schemaReadWrite() throws IOException { /// Read Schema from schema file Parser ps = new Schema.Parser(); Schema schema = ps.parse(new File(avscfile)); if(schema != null){ System.out.println(schema.getName()); System.out.println(schema.getType()); System.out.println(schema.getDoc()); System.out.println(schema.getFields()); } /// construct a record. GenericRecord datum = new GenericData.Record(schema); datum.put("left", new String("mother")); datum.put("right", new String("father")); /// write to outputstream ByteArrayOutputStream out = new ByteArrayOutputStream(); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); Encoder encoder = EncoderFactory.get().binaryEncoder(out, null); writer.write(datum, encoder); encoder.flush(); out.close(); /// read from inputstream DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema); Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null); GenericRecord record = reader.read(null, decoder); System.out.print(record.get("left")); System.out.print(record.get("right")); } public static void WriteData() throws IOException{ Parser ps = new Schema.Parser(); Schema schema = ps.parse(new File(avscfile)); File file = new File(avrofile); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(writer); fileWriter.create(schema, file); MyAvro datum = new MyAvro(); for(int i = 0;i<5;i++){ datum.setName("name1" + i); datum.setAge(10 + i); datum.setIsman( i % 2 == 0); fileWriter.append(datum); } fileWriter.close(); } public static void ReadData() throws IOException{ File file = new File(avrofile); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(file,reader); Schema schema = fileReader.getSchema(); System.out.println(fileReader.getSchema()); GenericRecord record = null; MyAvro datum = null; while(fileReader.hasNext()){ record = fileReader.next(); System.out.println(record.toString()); // Convert GenericRecord to SpecificRecord datum = (MyAvro) SpecificData.get().deepCopy(schema, record); System.out.println(datum.toString()); } fileReader.seek(0); fileReader.sync(0); fileReader.close(); } }
到此,相信大家对“Hadoop WritableSerialization是什么”有了更深的了解,不妨来实际操作一番吧!这里是亿速云网站,更多相关内容可以进入相关频道进行查询,关注我们,继续学习!
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。