30 Ekim 2024 Çarşamba

Apache Arrow - Columnar Memory Format

Giriş
Açıklaması şöyle
Apache Arrow is a cross-language development framework for in-memory data. It provides a standardized columnar memory format for efficient data sharing and fast analytics. Arrow employs a language-agnostic approach, designed to eliminate the need for data serialization and deserialization, improving the performance and interoperability between complex data processes and systems.
Apache Arrow vs Apache Parquet 
Açıklaması şöyle
The Apache Arrow format project began in February 2016, focusing on columnar in-memory analytics workload. Unlike file formats like Parquet or CSV, which specify how data is organized on disk, Arrow focuses on how data is organized in memory.
Maven
Şu satırı dahil ederiz
<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-memory</artifactId>
  <version>6.0.1</version>
</dependency>

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-vector</artifactId>
  <version>6.0.1</version>
</dependency>
Örnek
Yazma için şöyle yaparız
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.*;
import org.apache.arrow.vector.ipc.*;
import org.apache.arrow.vector.util.*;

// Set up the allocator and the schema for the vector
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  VarCharVector vector = new VarCharVector("vector", allocator);
  ArrowWriter writer = new ArrowWriter(vector, new Schema(Collections. singletonList(vector.getField())))) {

  // Write data to the vector
  vector.setSafe(0, "Apache".getBytes());
  vector.setSafe(1, "Arrow".getBytes());
  vector.setSafe(2, "Java".getBytes());
  vector.setValueCount(3);

  // Write vector to a file
  try (FileOutputStream out = new FileOutputStream("arrow-data.arrow")) {
    writer.writeArrow(out.getChannel());
  }
}
Okuma için şöyle yaparız
// Now, let's read the data we just wrote
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  ArrowReader reader = new ArrowReader(new FileInputStream("arrow-data.arrow") .getChannel(), allocator)) {

  // Read schema and load the data
  reader.loadNextBatch();

  // Get the vector
  try (VarCharVector vector = (VarCharVector) reader.getVectorSchemaRoot() .getVector("vector")) {
    // Iterate over the values in the vector
    for (int i = 0; i < vector.getValueCount(); i++) {
      System.out.println(new String(vector.get(i)));
    } } }

Hiç yorum yok:

Yorum Gönder