29 Kasım 2024 Cuma

Apache Paimon

Giriş
Açıklaması şöyle
Apache Paimon is a new data lakehouse format that focuses on solving the challenges of streaming scenarios, but also supports batch processing.
Merge files
Açıklaması şöyle
Paimon is designed with a built-in merge mechanism, and many other optimizations for mass writes, making it more adaptable to streaming scenarios.


21 Kasım 2024 Perşembe

Apache Iceberg

Giriş
Açıklaması şöyle
Iceberg is the most widely supported by various open-source engines, including pure query engines (e.g., Trino), New SQL databases (e.g., StarRocks, Doris), and streaming frameworks (e.g., Flink, Spark), all of which support Iceberg.
Iceberg Problems 
Stream write işleminde dosyaları birleştirmek gerekiyor. Açıklaması şöyle
Iceberg faces several problems in streaming scenarios, the most serious one is the fragmentation of small files. Queries in data lakehouses rely heavily on file reads, and if a query has to scan many files at once, it will of course perform poorly.

To address this issue, an external orchestrator is required to regularly merge files. 
What’s Coming for Iceberg in 2025?
Açıklama burada
1. RBAC Catalog: Fixing Permissions at Scale
2. Change Data Capture (CDC): Iceberg’s Streaming Evolution
3. Materialized Views: Simplifying Derived Data

30 Ekim 2024 Çarşamba

Apache Arrow - Columnar Memory Format

Giriş
Açıklaması şöyle
Apache Arrow is a cross-language development framework for in-memory data. It provides a standardized columnar memory format for efficient data sharing and fast analytics. Arrow employs a language-agnostic approach, designed to eliminate the need for data serialization and deserialization, improving the performance and interoperability between complex data processes and systems.
Apache Arrow vs Apache Parquet 
Açıklaması şöyle
The Apache Arrow format project began in February 2016, focusing on columnar in-memory analytics workload. Unlike file formats like Parquet or CSV, which specify how data is organized on disk, Arrow focuses on how data is organized in memory.
Maven
Şu satırı dahil ederiz
<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-memory</artifactId>
  <version>6.0.1</version>
</dependency>

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-vector</artifactId>
  <version>6.0.1</version>
</dependency>
Örnek
Yazma için şöyle yaparız
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.*;
import org.apache.arrow.vector.ipc.*;
import org.apache.arrow.vector.util.*;

// Set up the allocator and the schema for the vector
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  VarCharVector vector = new VarCharVector("vector", allocator);
  ArrowWriter writer = new ArrowWriter(vector, new Schema(Collections. singletonList(vector.getField())))) {

  // Write data to the vector
  vector.setSafe(0, "Apache".getBytes());
  vector.setSafe(1, "Arrow".getBytes());
  vector.setSafe(2, "Java".getBytes());
  vector.setValueCount(3);

  // Write vector to a file
  try (FileOutputStream out = new FileOutputStream("arrow-data.arrow")) {
    writer.writeArrow(out.getChannel());
  }
}
Okuma için şöyle yaparız
// Now, let's read the data we just wrote
try (RootAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
  ArrowReader reader = new ArrowReader(new FileInputStream("arrow-data.arrow") .getChannel(), allocator)) {

  // Read schema and load the data
  reader.loadNextBatch();

  // Get the vector
  try (VarCharVector vector = (VarCharVector) reader.getVectorSchemaRoot() .getVector("vector")) {
    // Iterate over the values in the vector
    for (int i = 0; i < vector.getValueCount(); i++) {
      System.out.println(new String(vector.get(i)));
    } } }

18 Temmuz 2023 Salı

Apache CarbonData

Giriş
Açıklaması şöyle
Apache CarbonData is an indexed columnar data format that is developed specifically for big data scenarios where fast analytics and real-time insights are critical.
Deep Integration with Spark
Açıklaması şöyle
CarbonData has been deeply integrated with Apache Spark, providing Spark SQL’s query optimization techniques and using its Code Generation capabilities. This makes it possible to directly query CarbonData files using Spark SQL, hence giving faster and more efficient query results.
Multi-Layered Structure
Açıklaması şöyle
Apache CarbonData is structured in multiple layers, which includes the table, segment, block, and page levels. This hierarchical structure allows efficient data retrieval by skipping irrelevant data during the query execution.

Table: A table is a collection of segments, and each segment represents a set of data files.

Segment: A segment contains multiple data blocks, where each block can store a significant amount of data.

Block: A block is divided into blocklets. Each blocklet holds a series of column pages, which are organized column-wise.

Page: The page level is where the actual data is stored. The data in these pages is encoded and compressed, making data retrieval efficient.

Avro Compabitability

Giriş
Açıklamaların detayı burada

1. Geriye Uyumluluk
İleride olanları ilgilendirir. Geriye uyumluluk, en son sürüm, geriden gelen sürümü ile üretilen veriyi okuyabilir demek

BACKWARD
En son iki schema geriye doğru uyumludur. 

BACKWARD_TRANSITIVE
Tüm schema'lar geriye doğru uyumludur. 

2. İleriye Uyumluluk
Geriden gelenleri ilgilendirir. İleriye Uyumluluk, gerideki schema, ilerideki schema tarafından üretilen veriyi okuyabilir demek

FORWARD
En son iki schema ileriye doğru uyumludur. En son schema tarafından üretilen veriyi sondan bir önceki okuyabilir demek.

FORWARD_TRANSITIVE
Tüm schema'lar ileriye doğru uyumludur. 

3. Full compatibility
FULL 
En son 2 schema birbirlerinin verilerini okuyabilirler

FULL_TRANSITIVE
Herkes birbirlerinin verilerini okuyabilirler

11 Haziran 2023 Pazar

CSV Kullanımı

Maven
Şöyle yaparız
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.8</version>
</dependency>
Gradle
Şöyle yaparız
compile "org.apache.commons:commons-csv:1.8"
CSVPrinter ile dosyaya veya belleğe yazılır

HttpComponents HttpClient Kullanımı

Giriş
1. CloseableHttpClient  nesnesi yaratılır. Bu sınıfı yaratmak için HttpClients kullanılır. CloseableHttpClient  sınıfı abstract'tır. HttpClient arayüzünden kalıtır.
2. HttpGet HttpPost gibi nesneleri gönderir.

Kullanım
Açıklaması şöyle
- Firstly, Create an instance of HttpClient class. (The HttpClient uses a HttpUriRequest to send and receive data.)
- Create an instance of HttpRequestBase class (e.g. HttpGetHttpPostHttpTraceHttpDelete, etc.) and set the necessary headers and parameters.
- Execute the request using HttpClient’s execute() method, and it returns an instance of HttpResponse class.
- Finally, Extract the response content using HttpResponse’s getEntity() method and process it as necessary.
Maven
Şu satırı dahil ederiz
<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.5.10</version>
</dependency>