Apache Commons Çorbası

2 Mayıs 2025 Cuma

Apache Airflow - Orchestrator

11 Nisan 2025 Cuma

Apache Doris

Giriş

Açıklaması şöyle

Apache Doris employs a typical MPP (Massively Parallel Processing) distributed architecture, tailored for high-concurrency, low-latency real-time online analytical processing (OLAP) scenarios. It comprises front-end and back-end components, leveraging multi-node parallel computing and columnar storage to efficiently manage massive datasets. This design enables Doris to deliver query results in sub-seconds, making it ideal for complex aggregations and analytical queries on large datasets.

Dialects

Açıklaması şöyle

The SQL dialect conversion feature of Apache Doris understands more than ten SQL dialects, including Presto, Trino, Hive, and ClickHouse.

Full-text Search

Açıklaması şöyle

In version 2.0, Apache Doris introduced inverted index and started to support full-text search.

29 Kasım 2024 Cuma

Apache Paimon

Giriş

Açıklaması şöyle

Apache Paimon is a new data lakehouse format that focuses on solving the challenges of streaming scenarios, but also supports batch processing.

Merge files

Açıklaması şöyle

Paimon is designed with a built-in merge mechanism, and many other optimizations for mass writes, making it more adaptable to streaming scenarios.

21 Kasım 2024 Perşembe

Apache Iceberg - Data Lake İçin Open Table Format

Giriş

Açıklaması şöyle

Apache Iceberg defines a table format that separates how data is stored from how data is queried. Any engine that implements the Iceberg integration — Spark, Flink, Trino, DuckDB, Snowflake, RisingWave — can read and/or write Iceberg data directly.

Mimari Değişiyor

Açıklaması şöyle

This changes the architecture. You don’t need to move data between systems anymore. You don’t need to reprocess or convert formats. You can process data using one engine and query it using another.

İkili Format

Açıklaması şöyle

The Future is Dual-Format
I think the long-term architecture for most databases is going to be dual-format:

1. A proprietary format, optimized for internal performance — low-latency access, in-memory workloads, transaction processing, etc.
2. An open format, like Iceberg, for interoperability — long-term storage, external access, and sharing across systems.

Iceberg vs Parquet

Açıklaması şöyle

Iceberg is a table format while Parquet is a file format. Iceberg tables on built on Parquet files. They offer different levels of abstraction.

Kim Iceberg Kullanıyor

Açıklaması şöyle

Iceberg is the most widely supported by various open-source engines, including pure query engines (e.g., Trino), New SQL databases (e.g., StarRocks, Doris), and streaming frameworks (e.g., Flink, Spark), all of which support Iceberg.

Iceberg Problems

Stream write işleminde dosyaları birleştirmek gerekiyor. Açıklaması şöyle

Iceberg faces several problems in streaming scenarios, the most serious one is the fragmentation of small files. Queries in data lakehouses rely heavily on file reads, and if a query has to scan many files at once, it will of course perform poorly.

To address this issue, an external orchestrator is required to regularly merge files.

What’s Coming for Iceberg in 2025?

Açıklama burada

1. RBAC Catalog: Fixing Permissions at Scale

2. Change Data Capture (CDC): Iceberg’s Streaming Evolution

3. Materialized Views: Simplifying Derived Data

AWS S3

Açıklaması şöyle

Without Iceberg, trying to find specific information in your raw data files on S3 can be like searching for a needle in a haystack. Tools like AWS Athena can query files, but managing the structure of your data (schema) and controlling who has access (access control) requires manual setup. Iceberg transforms your S3 buckets into well-structured, queryable datasets with proper access controls, making them compatible with any modern query engine. By layering Iceberg on top of S3, businesses gain a cohesive way to organize and make sense of sprawling data lakes, which would otherwise remain chaotic and unmanageable.

30 Ekim 2024 Çarşamba

Apache Arrow - Columnar Memory Format

Giriş

Açıklaması şöyle

Apache Arrow Flight is a high-performance RPC framework designed specifically for transferring large amounts of columnar data over a network. Unlike ODBC/JDBC, it eliminates the need for intermediate serialization steps, significantly reducing transfer latency and increasing throughput.

Arrow Flight SQL

Açıklaması şöyle

Apache Arrow Flight SQL extends Arrow Flight by providing a standardized interface for SQL-based interactions with databases. This means that developers can benefit from Arrow’s high-speed data transfers while maintaining a familiar SQL interface. Unlike traditional database protocols, Flight SQL enables direct execution of SQL queries over the high-performance Arrow Flight transport layer, eliminating unnecessary serialization overhead and reducing query latency.

Apache Arrow Database Connectivity (ADBC)

Açıklaması şöyle

ADBC provides a standardized API for database interactions, making it easier for developers to query and work with databases using Arrow-native data (with/without Flight SQL).

Adoption of ADBC

DuckDB, dbt, Snowflake ADBC destekliyor.

Apache Arrow vs Apache Parquet

Açıklaması şöyle

The Apache Arrow format project began in February 2016, focusing on columnar in-memory analytics workload. Unlike file formats like Parquet or CSV, which specify how data is organized on disk, Arrow focuses on how data is organized in memory.

Maven

Şu satırı dahil ederiz

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-memory</artifactId>
  <version>6.0.1</version>
</dependency>

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-vector</artifactId>
  <version>6.0.1</version>
</dependency>

Örnek

Yazma için şöyle yaparız

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.*;
import org.apache.arrow.vector.ipc.*;
import org.apache.arrow.vector.util.*;

// Set up the allocator and the schema for the vector
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  VarCharVector vector = new VarCharVector("vector", allocator);
  ArrowWriter writer = new ArrowWriter(vector, new Schema(Collections.
    singletonList(vector.getField())))) {

  // Write data to the vector
  vector.setSafe(0, "Apache".getBytes());
  vector.setSafe(1, "Arrow".getBytes());
  vector.setSafe(2, "Java".getBytes());
  vector.setValueCount(3);

  // Write vector to a file
  try (FileOutputStream out = new FileOutputStream("arrow-data.arrow")) {
    writer.writeArrow(out.getChannel());
  }
}

Okuma için şöyle yaparız

// Now, let's read the data we just wrote
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  ArrowReader reader = new ArrowReader(new FileInputStream("arrow-data.arrow")
    .getChannel(), allocator)) {

  // Read schema and load the data
  reader.loadNextBatch();

  // Get the vector
  try (VarCharVector vector = (VarCharVector) reader.getVectorSchemaRoot()
    .getVector("vector")) {
    // Iterate over the values in the vector
    for (int i = 0; i < vector.getValueCount(); i++) {
      System.out.println(new String(vector.get(i)));
    }
  }
}

18 Temmuz 2023 Salı

Apache CarbonData

Giriş

Açıklaması şöyle

Apache CarbonData is an indexed columnar data format that is developed specifically for big data scenarios where fast analytics and real-time insights are critical.

Deep Integration with Spark

Açıklaması şöyle

CarbonData has been deeply integrated with Apache Spark, providing Spark SQL’s query optimization techniques and using its Code Generation capabilities. This makes it possible to directly query CarbonData files using Spark SQL, hence giving faster and more efficient query results.

Multi-Layered Structure

Açıklaması şöyle

Apache CarbonData is structured in multiple layers, which includes the table, segment, block, and page levels. This hierarchical structure allows efficient data retrieval by skipping irrelevant data during the query execution.

Table: A table is a collection of segments, and each segment represents a set of data files.

Segment: A segment contains multiple data blocks, where each block can store a significant amount of data.

Block: A block is divided into blocklets. Each blocklet holds a series of column pages, which are organized column-wise.

Page: The page level is where the actual data is stored. The data in these pages is encoded and compressed, making data retrieval efficient.

Avro Compabitability

Giriş

Açıklamaların detayı burada

1. Geriye Uyumluluk

İleride olanları ilgilendirir. Geriye uyumluluk, en son sürüm, geriden gelen sürümü ile üretilen veriyi okuyabilir demek

BACKWARD

En son iki schema geriye doğru uyumludur.

BACKWARD_TRANSITIVE

Tüm schema'lar geriye doğru uyumludur.

2. İleriye Uyumluluk

Geriden gelenleri ilgilendirir. İleriye Uyumluluk, gerideki schema, ilerideki schema tarafından üretilen veriyi okuyabilir demek

FORWARD

En son iki schema ileriye doğru uyumludur. En son schema tarafından üretilen veriyi sondan bir önceki okuyabilir demek.

FORWARD_TRANSITIVE

Tüm schema'lar ileriye doğru uyumludur.

3. Full compatibility

FULL

En son 2 schema birbirlerinin verilerini okuyabilirler

FULL_TRANSITIVE

Herkes birbirlerinin verilerini okuyabilirler