Giriş
Apache Iceberg defines a table format that separates how data is stored from how data is queried. Any engine that implements the Iceberg integration — Spark, Flink, Trino, DuckDB, Snowflake, RisingWave — can read and/or write Iceberg data directly.
Mimari Değişiyor
This changes the architecture. You don’t need to move data between systems anymore. You don’t need to reprocess or convert formats. You can process data using one engine and query it using another.
İkili Format
The Future is Dual-Format
I think the long-term architecture for most databases is going to be dual-format:
1. A proprietary format, optimized for internal performance — low-latency access, in-memory workloads, transaction processing, etc.
2. An open format, like Iceberg, for interoperability — long-term storage, external access, and sharing across systems.
Iceberg vs Parquet
Iceberg is a table format while Parquet is a file format. Iceberg tables on built on Parquet files. They offer different levels of abstraction.
Kim Iceberg Kullanıyor
Iceberg is the most widely supported by various open-source engines, including pure query engines (e.g., Trino), New SQL databases (e.g., StarRocks, Doris), and streaming frameworks (e.g., Flink, Spark), all of which support Iceberg.
Iceberg Problems
Stream write işleminde dosyaları birleştirmek gerekiyor. Açıklaması
şöyle
Iceberg faces several problems in streaming scenarios, the most serious one is the fragmentation of small files. Queries in data lakehouses rely heavily on file reads, and if a query has to scan many files at once, it will of course perform poorly.
To address this issue, an external orchestrator is required to regularly merge files.
What’s Coming for Iceberg in 2025?
1. RBAC Catalog: Fixing Permissions at Scale
2. Change Data Capture (CDC): Iceberg’s Streaming Evolution
3. Materialized Views: Simplifying Derived Data
AWS S3
Without Iceberg, trying to find specific information in your raw data files on S3 can be like searching for a needle in a haystack. Tools like AWS Athena can query files, but managing the structure of your data (schema) and controlling who has access (access control) requires manual setup. Iceberg transforms your S3 buckets into well-structured, queryable datasets with proper access controls, making them compatible with any modern query engine. By layering Iceberg on top of S3, businesses gain a cohesive way to organize and make sense of sprawling data lakes, which would otherwise remain chaotic and unmanageable.