Apache Iceberg Explained Simply

Apache Iceberg Explained Simply

Something has been quietly shifting in how serious data teams think about storing and querying large datasets. For years the conversation was about which query engine to use, Spark, Presto, Trino, Hive, and teams would pick one and build their infrastructure around it. The data lived in files on object storage and the query engine was the thing that made sense of it.

The problem with this arrangement revealed itself at scale. Different teams wanted to use different engines on the same data. Schema changes broke things in ways that were hard to predict. Reading data while someone else was writing it produced inconsistent results. Rolling back a bad transformation required restoring from a backup. Understanding what a dataset looked like last Tuesday required hoping someone had kept a copy.

Apache Iceberg is the answer the data industry converged on for most of these problems. It is not a database, not a query engine, and not a storage system. It is a table format, a specification for how to organize and track data files on object storage in a way that gives you database-like guarantees and capabilities without locking you into any particular database or query engine.

This guide explains what Iceberg is, how it works under the hood, and why it has become one of the most important pieces of the modern data lakehouse architecture, all in plain language.

The Problem Iceberg Was Built to Solve

To understand why Iceberg matters, you need to understand what life looked like before it.

The dominant way to store large datasets before Iceberg was to put files, usually Parquet or ORC files, in folders on object storage like Amazon S3 or Google Cloud Storage. A table called orders might be stored as thousands of Parquet files organized into folders by date. Hive Metastore kept track of which folders belonged to which table and query engines like Spark or Hive would scan those folders when you ran a query.

This worked reasonably well until you needed to do anything sophisticated with the data. Changing a column’s data type was risky because old files used the old type and new files used the new type and not all engines handled the mismatch gracefully. Adding a new column was manageable but dropping one was dangerous. Querying the data while a write job was halfway through could return a mix of old and new records. Deleting specific records for GDPR compliance meant rewriting entire Parquet files. Understanding the historical state of the data required external bookkeeping that most teams did not do rigorously.

These were not minor inconveniences. They were fundamental limitations that made Hive-style data lakes unreliable for use cases that needed consistency, flexibility, and auditability.

Iceberg was created at Netflix, where engineers were running into exactly these problems at massive scale across petabytes of data and hundreds of pipelines. The solution they designed and eventually open-sourced addressed the root cause rather than patching individual symptoms.

What a Table Format Actually Is

The term table format is easy to gloss over so it is worth being precise about what it means.

A table format is a specification that defines how a collection of data files on disk should be organized, tracked, and interpreted as a logical table. It answers questions like which files belong to this table right now, what schema do those files conform to, how are the files partitioned, what statistics are available about the data in each file, and what changes have been made to the table over time.

Without a table format, a collection of Parquet files on S3 is just files. With a table format like Iceberg, that same collection of files becomes a table with a schema, a history, transactional guarantees, and the ability to be queried by any engine that understands the Iceberg specification.

The critical insight is that the table format is separate from both the storage layer and the query engine. The data files live on S3, GCS, or Azure Data Lake Storage. The query engine is Spark, Trino, Flink, DuckDB, or any other engine with Iceberg support. Iceberg sits in between, providing the metadata layer that connects them and enforcing the guarantees that make the table behave reliably.

How Iceberg Works Under the Hood

Iceberg’s architecture has three layers of metadata that sit on top of the actual data files. Understanding these layers demystifies most of what Iceberg does.

The data files layer

At the bottom are the actual data files, typically Parquet, ORC, or Avro files stored on object storage. These files contain the rows of your table. Nothing about this layer is unique to Iceberg. The same Parquet files could in principle be read by any engine. What makes them an Iceberg table is the metadata structure that sits above them.

The manifest files layer

Above the data files are manifest files. A manifest file is a metadata file that lists a set of data files along with statistics about each one. For every data file it tracks, a manifest records the file’s location, its format, the number of rows it contains, and the minimum and maximum value for each column.

These column-level statistics are what enable Iceberg to skip files during query execution. If you run a query filtering for orders from January 2024 and a manifest shows that a particular data file only contains orders from March 2023, Iceberg knows it does not need to read that file at all. This file pruning happens before any data is read and dramatically reduces the amount of I/O required for selective queries on large tables.

The manifest list layer

Above the manifest files is the manifest list, sometimes called the snapshot file. A manifest list tracks all the manifests that belong to a particular snapshot of the table. Every time a write operation completes on an Iceberg table, a new snapshot is created with a new manifest list that reflects the current state of the table.

The catalog layer

At the top is the catalog, which tracks the current snapshot for each table. When a query engine opens an Iceberg table, it asks the catalog which snapshot is current, reads the manifest list for that snapshot, reads the relevant manifests, and then reads only the data files those manifests point to that pass the file pruning step.

This layered metadata structure is what enables almost everything interesting about Iceberg.

Time Travel and Snapshot Isolation

Because every write to an Iceberg table creates a new snapshot rather than overwriting the previous one, Iceberg maintains a complete history of every state the table has ever been in. This enables time travel queries.

You can query an Iceberg table as of a specific snapshot or a specific timestamp:

sql

-- Query the table as it existed at a specific point in time
SELECT * FROM orders
FOR SYSTEM_TIME AS OF '2024-01-15 09:00:00';

-- Query a specific snapshot by ID
SELECT * FROM orders
FOR VERSION AS OF 4729384756;

This is not a backup and restore operation. It is just pointing the query at an older manifest list. The historical data files are still there and the older manifest list still references them. Reading historical state is as fast as reading current state.

Time travel is valuable in several concrete scenarios. When a transformation job produces incorrect results and overwrites a table, you can query the previous state to understand what changed and recover the correct data. When a regulatory audit requires proving what data looked like at a specific point in time, you can query it directly. When a data quality issue is discovered and you need to understand when it was introduced, you can compare snapshots to isolate the change.

Old snapshots are eventually cleaned up by a process called snapshot expiration, which removes snapshot metadata and the data files that are no longer referenced by any current snapshot. Expiration is configurable and gives you control over how much history to retain.

Schema Evolution Without Breaking Things

Schema evolution is the ability to change a table’s schema over time without breaking existing queries or requiring existing data files to be rewritten. Iceberg handles schema evolution correctly in ways that Hive-style tables historically did not.

The key to Iceberg’s schema evolution is that columns are tracked by a unique integer ID rather than by name. When a query engine reads an Iceberg table, it maps column IDs in the metadata to column names in the schema. This indirection means that renaming a column is a metadata operation that updates the mapping rather than a data operation that rewrites files.

The following schema changes are fully supported in Iceberg without rewriting any data files:

Adding a new column appends it to the schema. Old files that do not contain the new column return null for that column when queried. New files written after the column addition contain the column.

Renaming a column updates the name in the schema metadata. The underlying column ID stays the same so both old and new files are read correctly under the new name.

Dropping a column removes it from the schema. The data in old files still physically contains the column but it is simply not read. No data rewriting required.

Widening a column type, changing an integer to a long for example, is supported for compatible type promotions. Old files with the narrower type are read and promoted at query time.

Reordering columns changes their position in the schema without any file modification.

For data teams that manage tables with hundreds of columns that evolve over months and years, this capability eliminates an entire class of painful migrations.

Partitioning Done Right

Partitioning is how large tables are divided into smaller groups of files to make queries faster. In a Hive-style table, partitioning is part of the directory structure. A table partitioned by date stores files in folders named date=2024-01-01, date=2024-01-02, and so on. The partition column values are encoded in the folder paths.

This works but it has a significant problem. The partition layout is baked into the physical directory structure and changing it requires rewriting all the data. If you originally partitioned by day and the table has grown large enough that you need to partition by month, you are looking at a full rewrite of potentially petabytes of data. And if someone runs a query without specifying the partition column, the engine has no way to prune files and has to scan everything.

Iceberg introduces hidden partitioning, which separates the logical partition structure from the physical file layout. You define partition transforms in the table metadata and Iceberg applies them automatically without requiring partition columns to appear in queries or in file paths.

Iceberg supports several partition transforms. identity partitions on the raw value of a column. year, month, day, and hour extract the corresponding time component from a timestamp column. bucket hashes a column value into a configurable number of buckets. truncate truncates a string or integer to a specified length or value.

sql

CREATE TABLE orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date TIMESTAMP,
    amount DECIMAL(10, 2)
)
USING iceberg
PARTITIONED BY (month(order_date));

Queries that filter on order_date automatically benefit from partition pruning even without the user specifying the partition column explicitly. Iceberg reads the partition metadata and skips any file that cannot contain matching records.

Even more powerfully, Iceberg supports partition evolution. You can change the partition scheme of a table without rewriting existing data. Old files retain their original partition layout and new files use the new layout. The query engine reads both correctly. This makes it possible to progressively change how data is organized as a table grows and query patterns evolve.

ACID Transactions and Concurrent Writes

One of the most practically important features of Iceberg is its support for serializable isolation, which prevents the data consistency problems that plague Hive-style tables when multiple jobs read and write simultaneously.

In a Hive-style table, if two jobs try to write to the same partition at the same time, you can end up with a corrupted or inconsistent state because there is no coordination between them. One job might read a partial write from the other and produce incorrect results.

Iceberg uses optimistic concurrency control. When a writer commits a transaction, it checks whether any conflicting operations have happened since it started. If a conflict is detected, the commit fails and the writer can retry. This is the same approach used by many relational databases and it ensures that concurrent writes produce correct results.

The practical consequence is that you can run multiple Spark jobs writing to the same Iceberg table simultaneously without coordinating them externally. You can read from an Iceberg table while a write job is in progress and you will always see either the complete old state or the complete new state, never a mix of the two.

This is what makes Iceberg suitable for use cases that Hive-style lakes could not reliably serve, like streaming writes from Flink alongside batch reads from Trino, or multiple teams writing to shared tables without a central write coordinator.

Row-Level Deletes and Updates

Deleting or updating specific rows in a Hive-style Parquet table requires reading the affected Parquet files, removing or modifying the relevant rows, and writing new files to replace the old ones. For tables with billions of rows spread across thousands of files, this is expensive even when only a handful of rows need to change.

Iceberg supports two approaches to row-level operations that avoid the need to rewrite entire files immediately.

Copy-on-write rewrites affected data files when a delete or update is committed. This is the simpler approach and produces clean files but has high write amplification when changes affect many files.

Merge-on-read writes small delete files that record which rows in which data files should be considered deleted. At query time, the engine reads both the data files and the delete files and applies the deletions in memory. This makes writes fast but adds overhead to reads until the table is compacted.

Both approaches make GDPR-compliant data deletion practical on data lake tables, something that was technically possible but operationally painful with Hive-style storage.

Iceberg vs Delta Lake vs Apache Hudi

Iceberg is not the only open table format. Delta Lake, created by Databricks, and Apache Hudi, created at Uber, address the same fundamental problems. Understanding how they differ helps when choosing between them.

Delta Lake is the most tightly integrated with Spark and the Databricks ecosystem. It has excellent Spark performance, strong community support, and is the natural choice for teams already invested in Databricks. Its main limitation historically was weaker support for non-Spark engines, though this has improved significantly.

Apache Hudi was optimized from the start for streaming use cases and incremental processing. It has strong support for upsert-heavy workloads and was designed specifically for the challenge of keeping large tables up to date from streaming sources. Its ecosystem support is narrower than Iceberg or Delta Lake.

Apache Iceberg has the broadest engine support of the three and the strongest commitment to being a truly open, engine-agnostic standard. Spark, Trino, Flink, Presto, DuckDB, Hive, Impala, and virtually every major query engine has Iceberg support. The specification is managed by the Apache Software Foundation with contributions from a wide range of companies including Netflix, Apple, LinkedIn, and Snowflake. For teams that want to avoid vendor lock-in and use multiple engines on the same data, Iceberg is generally the most flexible choice.

Why Data Teams Are Adopting Iceberg

The practical reasons data teams adopt Iceberg cluster around a few consistent themes.

Multi-engine flexibility is the most commonly cited reason. Teams that want to use Spark for heavy transformation, Trino for interactive queries, Flink for streaming, and DuckDB for local development can point all of them at the same Iceberg tables without any duplication or synchronization overhead.

Avoiding vendor lock-in matters increasingly as the data tool landscape has consolidated. Iceberg’s status as an open Apache project with broad industry support means that adopting it does not tie you to any particular cloud provider or commercial vendor.

Reliable data lake operations, time travel for recovery, schema evolution without migrations, ACID transactions for concurrent writes, and row-level deletes for compliance, eliminate entire categories of operational problems that Hive-style lakes require custom solutions for.

The lakehouse architecture, using object storage with a table format to get database-like capabilities at data lake scale and cost, has become the dominant pattern for large-scale analytical data storage and Iceberg is the table format most commonly chosen for new lakehouse implementations.

Apache Iceberg Cheat Sheet

FeatureWhat It DoesWhy It Matters
Snapshot isolationEvery write creates a new snapshotConsistent reads during writes
Time travelQuery any historical snapshotRecovery, auditing, debugging
Schema evolutionAdd, rename, drop columns safelyNo data rewrites on schema changes
Hidden partitioningPartition transforms in metadataNo partition columns in queries
Partition evolutionChange partitioning without rewriteAdapt layout as data grows
Row-level deletesDelete specific rows efficientlyGDPR compliance, corrections
File pruningSkip files using column statisticsFaster queries on large tables
Multi-engine supportWorks with Spark, Trino, Flink, etcNo engine lock-in
ACID transactionsOptimistic concurrency controlSafe concurrent writes
Open specificationApache-governed, vendor neutralAvoid proprietary lock-in

Common Misconceptions

Iceberg is not a database. It does not have a server process, it does not accept connections, and it does not execute queries. It is a specification for organizing metadata about data files. The query execution is handled by whichever engine you point at your Iceberg tables.

Iceberg is not a replacement for your query engine. You still need Spark, Trino, Flink, or another engine to actually read and write Iceberg tables. Iceberg defines what the tables look like and how they behave. The engine is what does the work.

Iceberg does not store data differently from Hive-style tables at the file level. The data is still Parquet, ORC, or Avro files on S3. What Iceberg adds is the metadata layer above those files that enables transactions, time travel, schema evolution, and reliable partitioning.

Migrating to Iceberg does not require rewriting your data. Most major engines support in-place migration of existing Hive-style Parquet tables to Iceberg format by creating Iceberg metadata that points to the existing files.

FAQs

What is Apache Iceberg in simple terms?

Apache Iceberg is a table format for large datasets stored on object storage like Amazon S3. It adds a metadata layer above data files that gives them database-like capabilities including ACID transactions, time travel to query historical states, schema evolution without rewriting data, and reliable performance at scale. It works with any query engine that supports the Iceberg specification including Spark, Trino, Flink, and DuckDB.

What is the difference between Apache Iceberg and Delta Lake?

Both are open table formats that solve similar problems around reliability, schema evolution, and time travel on data lake storage. Delta Lake is more tightly integrated with the Spark and Databricks ecosystem and is the natural choice for Databricks-heavy teams. Apache Iceberg has broader support across more query engines and is governed as an open Apache project with contributions from a wider range of companies. Teams prioritizing multi-engine flexibility and vendor neutrality tend to prefer Iceberg.

What is time travel in Apache Iceberg?

Time travel is the ability to query an Iceberg table as it existed at a specific point in the past. Because every write creates a new snapshot rather than overwriting old data, Iceberg maintains a complete history of the table’s state. You can query any historical snapshot using a timestamp or snapshot ID. This is useful for recovering from bad transformations, auditing data for compliance, and debugging data quality issues by comparing historical states.

Does Apache Iceberg require Spark?

No. Iceberg was designed to be engine-agnostic. Spark, Trino, Flink, Presto, DuckDB, Hive, and many other engines all support reading and writing Iceberg tables. One of Iceberg’s primary design goals is that different engines can work on the same tables simultaneously without conflicts or data duplication.

When should a data team adopt Apache Iceberg?

Iceberg is worth adopting when your data team is managing large datasets on object storage and running into limitations around schema evolution, concurrent writes, historical data access, or the need to use multiple query engines on the same data. For teams just starting out with small datasets, the operational overhead of managing Iceberg metadata may not be justified yet. For teams operating at scale with multiple pipelines touching shared datasets, Iceberg’s guarantees become increasingly valuable.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top