29 Kasım 2024 Cuma

Apache Paimon

Giriş
Açıklaması şöyle
Apache Paimon is a new data lakehouse format that focuses on solving the challenges of streaming scenarios, but also supports batch processing.
Merge files
Açıklaması şöyle
Paimon is designed with a built-in merge mechanism, and many other optimizations for mass writes, making it more adaptable to streaming scenarios.


21 Kasım 2024 Perşembe

Apache Iceberg

Giriş
Açıklaması şöyle
Iceberg is the most widely supported by various open-source engines, including pure query engines (e.g., Trino), New SQL databases (e.g., StarRocks, Doris), and streaming frameworks (e.g., Flink, Spark), all of which support Iceberg.
Iceberg Problems 
Stream write işleminde dosyaları birleştirmek gerekiyor. Açıklaması şöyle
Iceberg faces several problems in streaming scenarios, the most serious one is the fragmentation of small files. Queries in data lakehouses rely heavily on file reads, and if a query has to scan many files at once, it will of course perform poorly.

To address this issue, an external orchestrator is required to regularly merge files. 
What’s Coming for Iceberg in 2025?
Açıklama burada
1. RBAC Catalog: Fixing Permissions at Scale
2. Change Data Capture (CDC): Iceberg’s Streaming Evolution
3. Materialized Views: Simplifying Derived Data