Apache Commons Çorbası: Apache Arrow

Giriş

Açıklaması şöyle

Apache Arrow Flight is a high-performance RPC framework designed specifically for transferring large amounts of columnar data over a network. Unlike ODBC/JDBC, it eliminates the need for intermediate serialization steps, significantly reducing transfer latency and increasing throughput.

Arrow Flight SQL

Açıklaması şöyle

Apache Arrow Flight SQL extends Arrow Flight by providing a standardized interface for SQL-based interactions with databases. This means that developers can benefit from Arrow’s high-speed data transfers while maintaining a familiar SQL interface. Unlike traditional database protocols, Flight SQL enables direct execution of SQL queries over the high-performance Arrow Flight transport layer, eliminating unnecessary serialization overhead and reducing query latency.

Apache Arrow Database Connectivity (ADBC)

Açıklaması şöyle

ADBC provides a standardized API for database interactions, making it easier for developers to query and work with databases using Arrow-native data (with/without Flight SQL).

Adoption of ADBC

DuckDB, dbt, Snowflake ADBC destekliyor.

Apache Arrow vs Apache Parquet

Açıklaması şöyle

The Apache Arrow format project began in February 2016, focusing on columnar in-memory analytics workload. Unlike file formats like Parquet or CSV, which specify how data is organized on disk, Arrow focuses on how data is organized in memory.

Maven

Şu satırı dahil ederiz

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-memory</artifactId>
  <version>6.0.1</version>
</dependency>

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-vector</artifactId>
  <version>6.0.1</version>
</dependency>

Örnek

Yazma için şöyle yaparız

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.*;
import org.apache.arrow.vector.ipc.*;
import org.apache.arrow.vector.util.*;

// Set up the allocator and the schema for the vector
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  VarCharVector vector = new VarCharVector("vector", allocator);
  ArrowWriter writer = new ArrowWriter(vector, new Schema(Collections.
    singletonList(vector.getField())))) {

  // Write data to the vector
  vector.setSafe(0, "Apache".getBytes());
  vector.setSafe(1, "Arrow".getBytes());
  vector.setSafe(2, "Java".getBytes());
  vector.setValueCount(3);

  // Write vector to a file
  try (FileOutputStream out = new FileOutputStream("arrow-data.arrow")) {
    writer.writeArrow(out.getChannel());
  }
}

Okuma için şöyle yaparız

// Now, let's read the data we just wrote
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  ArrowReader reader = new ArrowReader(new FileInputStream("arrow-data.arrow")
    .getChannel(), allocator)) {

  // Read schema and load the data
  reader.loadNextBatch();

  // Get the vector
  try (VarCharVector vector = (VarCharVector) reader.getVectorSchemaRoot()
    .getVector("vector")) {
    // Iterate over the values in the vector
    for (int i = 0; i < vector.getValueCount(); i++) {
      System.out.println(new String(vector.get(i)));
    }
  }
}

Apache Commons Çorbası

30 Ekim 2024 Çarşamba

Apache Arrow - Columnar Memory Format

Hiç yorum yok:

Yorum Gönder