Apache Avro: Tutorial and Course

Apache Avro: Tutorial and Course - Apache Avro Tutorial and Apache Avro Course, The Ultimate Guide to Apache Avro. Learn Apache Avro Tutorial and Apache Avro Course at Apache Avro Tutorial and Course.

Apache Avro Tutorial and Apache Avro Course

Apache Avro: Overview

Apache Avro Tutorial and Course - Apache Avro tutorial and Apache Avro course, the ultimate guide to Apache Avro, including facts and information about Apache Avro. Apache Avro Tutorial and Course is one of the ultimate created by to help you learn and understand Apache Avro and the related cloud computing technologies, as well as facts and information about Apache Avro.

Apache Avro: Tutorial and Course - Apache Avro Tutorial and Apache Avro Course by , The Ultimate Guide to Apache Avro.

Apache Avro: Tutorial and Course

Apache Avro is a language-neutral data serialization system. The project was created by Doug Cutting (the creator of Hadoop) to address the major downside of Hadoop Writables: lack of language portability. Having a data format that can be processed by many languages (currently C, C++, C#, Java, JavaScript, Perl, PHP, Python, and Ruby) makes it easier to share datasets with a wider audience than one tied to a single language. It is also more future-proof, allowing data to potentially outlive the language used to read and write it.

But why a new data serialization system? Avro has a set of features that, taken together, differentiate it from other systems such as Apache Thrift or Google's Protocol Buffers. Like in these systems and others, Avro data is described using a language-independent schema. However, unlike in some other systems, code generation is optional in Avro, which means you can read and write data that conforms to a given schema even if your code has not seen that particular schema before. To achieve this, Avro assumes that the schema is always present — at both read and write time — which makes for a very compact encoding, since encoded values do not need to be tagged with a field identifier.

Apache Avro schemas are usually written in JSON, and data is usually encoded using a binary format, but there are other options, too. There is a higher-level language called Avro IDL for writing schemas in a C-like language that is more familiar to developers. There is also a JSON-based data encoder, which, being human readable, is useful for prototyping and debugging Avro data.

The Avro specification precisely defines the binary format that all implementations must support. It also specifies many of the other features of Avro that implementations should support. One area that the specification does not rule on, however, is APIs: implementations have complete latitude in the APIs they expose for working with Avro data, since each one is necessarily language specific. The fact that there is only one binary format is significant, because it means the barrier for implementing a new language binding is lower and avoids the problem of a combinatorial explosion of languages and formats, which would harm interoperability.

Avro has rich schema resolution capabilities. Within certain carefully defined constraints, the schema used to read data need not be identical to the schema that was used to write the data. This is the mechanism by which Avro supports schema evolution. For example, a new, optional field may be added to a record by declaring it in the schema used to read the old data. New and old clients alike will be able to read the old data, while new clients can write new data that uses the new field. Conversely, if an old client sees newly encoded data, it will gracefully ignore the new field and carry on processing as it would have done with old data.

Apache Avro specifies an object container format for sequences of objects, similar to Hadoop's sequence file. An Avro datafile has a metadata section where the schema is stored, which makes the file self-describing. Avro datafiles support compression and are splittable, which is crucial for a MapReduce data input format. In fact, support goes beyond MapReduce: all of the data processing frameworks such as Pig, Hive, Crunch, and Spark can read and write Avro datafiles.

Apache Avro can be used for RPC, too, although this isn't covered here. More information is in the specification of Apache Avro.

Apache Avro: Further Reading