Apache Commons Çorbası: Avro

Apache Avro

Designing Data Intensive Applications kitabından bazı cümleler şöyle.

Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers and Thrift. It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop’s use cases.
Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily machine-readable.

Apache Avro Thrift ve Protobuf'tan farklı olarak tag number'lar kullanmıyor. Designing Data Intensive Applications kitabından bazı cümleler şöyle.

The writer’s schema and the reader’s schema
With Avro, when an application wants to encode some data (to write it to a file or database, to send it over the network, etc.), it encodes the data using whatever version of the schema it knows about—for example, that schema may be compiled into the application. This is known as the writer’s schema.

When an application wants to decode some data (read it from a file or database,receive it from the network, etc.), it is expecting the data to be in some schema, which is known as the reader’s schema. That is the schema the application code is relying on —code may have been generated from that schema during the application’s build process.

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible. When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the
reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema. The Avro specification [20] defines exactly how this resolution works,...

For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a different order, because the schema resolution matches up the fields by field name. If the code reading the data encounters a field that appears in the writer’s schema but not in the reader’s schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.

Avro vs Parquet

Açıklaması şöyle

Avro and Parquet are both compact binary storage formats that require a schema to structure the data that is being encoded. The difference is that Avro stores data in row format and Parquet stores data in a columnar format.
In my experience, these two formats are pretty much interchangeable. In fact, Parquet natively supports Avro schemas i.e., you could send Avro data to a Parquet reader and it would work just fine.

Avro Maven Plugin

Avro dosyalarından java kodu üretmek içindir. Avro Maven Plugin yazısına taşıdım

Maven

Şu satırı dahil ederiz

<dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro</artifactId>
  <version>${avro.version}</version>
</dependency>

Eğer Apache Kafka ile kullanmak istiyorsak şu satırı dahil ederiz

<dependency>
  <groupId>io.confluent</groupId>
  <artifactId>kafka-avro-serializer</artifactId>
  <version>5.5.1</version>
</dependency>

avsc Dosyası

Sınıf ismi.avsc şeklindedir. Plugin ayarlarında sourceDirectory olarak /src/main/avro belirtildiği için

bu dizindedir.

Aslında avro dosyası bir JSON nesnesidir. Açıklaması şöyle

Any Avro record has the following fields:
1. type
2. namespace
3. name
4. version
5. fields

- type alanı record olarak belirtilir. Başka ne değer alıyor bilmiyorum.

- name alanı sınıf ismini belirtir

- namespace sınıfın paketini belirtir

- version bu dosyanın sürüm sayısıdır

- fields alanı bir dizidir. Her field için name, type ve doc değerleri girilir.

field type type olarak

int

string

kullanılabilir.

Örnek

Şöyle yaparız

{
  "type": "record",
  "name": "SupplierRiskData",
  "fields": [
    {"name": "supplierId", "type": "string"},
    {"name": "riskLevel", "type": "string"},
    {"name": "rating", "type": "int"}
  ]
}

Örnek

Şöyle yaparız. Bu avsc dosyasının üreteceği sınıf com.purnima.jain.avro.dto.PersonDto olur.

{        "namespace": "com.purnima.jain.avro.dto",
  "type": "record",
  "name": "PersonDto",
  "fields": [
    {
      "name": "firstName",
      "type": "string",
      "doc": "the first name of a person"
    },
    {
      "name": "lastName",
      "type": "string",
      "doc": "the last name of a person"
    }	
  ]
}

Sınıfı kullanmak için şöyle yaparız

PersonDto personDto = PersonDto.newBuilder()
  .setFirstName(...)
  .setLastName(...)
  .build();

Örnek

Şöyle yaparız

{
  "name": "Transaction",
  "type": "record",
  "namespace": "demo.camel",
  "fields": [
    {
      "name": "userid",
      "type": "string"
    },
    {
      "name": "transactionid",
      "type": "string"
    },
    {
      "name": "transactiontype",
      "type": "string"
    },
    {
      "name": "currency",
      "type": "string"
    },
    {
      "name": "amt",
      "type": "string"
    }
  ]
}

Örnek

Açıklaması şöyle

If we later want to add another field to the schema, we usually want to stay compatible with applications that only have an older version of the schema. In our example we add a field “data” that can either be a string or null (that is, the field data is optional).

Şöyle yaparız. Burada data isimli alan optional

{
  "namespace": "at.willhaben.tech.avro",
  "type": "record",
  "name": "SomeRecord",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "data", "type": ["null","string"], "default":null}
  ]
}

Apache Commons Çorbası

5 Kasım 2020 Perşembe

Avro

Hiç yorum yok:

Yorum Gönder