Strong Consistency
Strong Consistency
Strong Consistency is the most demanding correctness guarantee a distributed storage system can provide. It asserts that after any write operation completes, all subsequent read operations — from any node, in any region, at any point in time — will return the value written by that most recent write. The data behaves as if it exists on a single, perfect, infinitely reliable disk in a single location, even when it is physically stored across hundreds of servers distributed across multiple continents.
This guarantee sounds simple and intuitive. It is precisely the behavior that engineers schooled on single-node relational databases take for granted. When you execute an UPDATE statement against a PostgreSQL database and the transaction commits, the next SELECT will return the new value. Always. Without qualification. In a distributed system at cloud scale, achieving this property requires solving problems of extraordinary complexity, and the cost of that solution is real, measurable, and must be paid in every architecture decision.
Understanding Strong Consistency — what it means formally, how it is implemented, what it costs, and where it is the correct choice in data lakehouse architectures — is essential for any engineer designing data systems that multiple teams and engines will trust with critical business decisions.
Formal Definitions: A Hierarchy of Consistency Models
Distributed systems research has produced a rich taxonomy of consistency models, ranging from the weakest (eventual consistency) to the strongest (linearizability). Understanding where Strong Consistency sits within this hierarchy requires examining the two most important points on the spectrum.
Linearizability: The Strongest Practical Guarantee
Linearizability, defined by Herlihy and Wing in their 1990 paper “Linearizability: A Correctness Condition for Concurrent Objects,” is the formal definition of what most engineers mean when they say “strong consistency.” It provides two guarantees:
1. Global Total Order: All operations on a shared data object appear to execute in a single, total sequential order that is consistent across all clients and all nodes. If Client A sees that operation X happened before operation Y, then every other client in the entire distributed system also sees X before Y. There are no divergent views.
2. Real-Time Ordering: The global order respects real-world wall-clock time. If operation X completes (returns a response) before operation Y is invoked (starts), then X must appear before Y in the global order. This is the critical property that distinguishes linearizability from weaker models: it enforces that historical order is preserved, not just logical order within a single thread.
The combination of these two properties creates the fundamental illusion of a single-copy system. From the perspective of any client, the distributed system behaves as if there is exactly one copy of the data, executing one operation at a time, in real-time clock order.
Sequential Consistency: A Slightly Weaker Guarantee
Sequential Consistency, defined by Lamport in 1979, relaxes linearizability’s real-time ordering requirement. It requires that all clients see operations in the same global total order, and that this order is consistent with each individual thread’s program order (the order in which that thread issued its operations). However, it does not require that the global order reflect the real-world wall-clock time at which operations were issued from different threads.
The practical implication: two concurrent writes from different clients might appear in the global order in either order — whichever the system chooses — as long as all clients see the same chosen order. Linearizability constrains this choice by also requiring that the real-time relationship between operations is preserved.
Sequential consistency is sufficient for most multi-threaded programming correctness guarantees (most CPU memory models are sequentially consistent or stronger). For distributed systems, linearizability is the stronger, more expensive, and more commonly required model for safety-critical coordination.
Causal Consistency and Read-Your-Writes
Below sequential consistency, weaker models like Causal Consistency and Read-Your-Writes provide more limited guarantees that are sufficient for many use cases while requiring less coordination:
Read-Your-Writes (RYW): A client is guaranteed to see its own previous writes when it subsequently reads from the same key. Other clients may still see stale values. RYW is one of the minimum consistency guarantees needed for user-facing applications where a user should see the result of their own actions immediately.
Causal Consistency: Operations that are causally related (A happened because of B) are guaranteed to appear in the causal order on all clients. Concurrent, causally unrelated operations may appear in any order.
These weaker models are relevant to lakehouse architecture because different components in the system may need different consistency levels — the catalog needs stronger consistency than the analytical query result cache, for example.
The Cost of Strong Consistency
Linearizability is not free. The CAP Theorem (discussed in the Eventual Consistency article) tells us that during network partitions, a strongly consistent system must sacrifice availability. But even without partitions, strong consistency imposes measurable costs in normal operation.
Coordination Overhead
To ensure all nodes agree on the same global total order of operations, they must communicate — every write must be acknowledged by a quorum of nodes before it is considered committed, and every read may need to verify that it is seeing the most recent value by consulting a quorum. This coordination adds latency to every operation, proportional to the network round-trip time between the coordinating nodes.
In a single data center, this round-trip might be a few milliseconds. Across availability zones in the same cloud region, it may be 5–10ms. Across geographic regions, it may be 50–200ms. For a system committing thousands of transactions per second, this coordination overhead translates directly into lower throughput and higher latency per transaction.
Consensus Protocols: Paxos and Raft
The theoretical foundation for implementing strong consistency in distributed systems is consensus protocols — algorithms by which a group of processes that may fail or lose messages can agree on a single value.
Paxos, introduced by Leslie Lamport in 1989, was the first practically implementable consensus algorithm. It is notoriously difficult to understand and implement correctly, but it underlies some of the most widely deployed strongly consistent systems, including Apache Zookeeper (which is used to coordinate distributed systems like Kafka and HBase).
Raft, introduced in 2014 by Ongaro and Ousterhout as an explicit attempt to make consensus more understandable, is now more widely adopted in new systems due to its cleaner conceptual model. Raft elects a single leader node that handles all writes. The leader replicates writes to a quorum of follower nodes before acknowledging the write as committed. etcd (the backing store for Kubernetes) is the most prominent Raft-based system and is also used as the backing store for some Iceberg REST Catalog implementations.
Throughput Ceilings
Because every write must pass through a consensus round, the throughput of a strongly consistent system is bounded by the throughput of its consensus protocol — specifically, by the time it takes to complete a consensus round multiplied by the maximum number of concurrent consensus rounds the system can manage. Under high concurrency, consensus protocols serialize competing writes, creating a bottleneck that does not exist in eventually consistent systems where each node can accept writes independently.
For data lakehouse commit operations, which are typically low-frequency (perhaps dozens to hundreds per second across all concurrent ETL jobs), this throughput ceiling is rarely the binding constraint. The bottleneck is more often in the data file writing path (writing Parquet files to S3) than in the metadata commit path.
Strong Consistency in Object Storage
Amazon S3’s consistency model has evolved significantly. Before December 2020, S3 provided only eventual consistency for overwrite PUTs and DELETEs (though new PUTs were strongly consistent). This made building strongly consistent systems on S3 significantly more difficult — a write acknowledged by S3 might not be immediately visible to all readers.
Since December 2020, Amazon S3 provides strong read-after-write consistency for all operations — PUTs, DELETEs, and overwrites. This means:
- After a successful PUT, any subsequent GET or LIST for that key will return the new object.
- After a successful DELETE, any subsequent GET for that key will return a 404.
- LIST operations will include all objects that have been successfully PUT.
This change was enormously significant for the Open Table Format ecosystem. It means that the foundational file-creation operations that Iceberg and Delta use for their atomic commit protocols are now strongly consistent on S3 by default, without requiring additional DynamoDB-based locking workarounds that pre-2020 architectures required.
Google Cloud Storage and Azure Blob Storage have provided strong consistency for all operations since their inception, making them more straightforward platforms for lakehouse implementations from the start.
Strong Consistency in Data Lakehouse Catalog Implementations
The primary location where Strong Consistency is required and actively implemented in the data lakehouse ecosystem is the Catalog — the service responsible for tracking the current metadata state of every table and performing the atomic commit operations that advance the table from one version to the next.
The Atomic Compare-and-Swap as a Linearizable Operation
The core commit operation in Apache Iceberg’s REST Catalog protocol is a compare-and-swap: a writer presents the catalog with its new metadata file path and asserts that the current metadata file path is X. The catalog atomically checks this assertion and, if it holds, updates the pointer to the new path. If the pointer has already advanced to a different path, the catalog rejects the commit.
This compare-and-swap is, by definition, a linearizable operation: it appears to execute at a single instant, and its outcome (success or failure) reflects the true current state of the pointer at that instant. The catalog service that implements this operation must itself be strongly consistent.
Catalog services implement this strong consistency requirement through several mechanisms:
Relational database-backed catalogs: A catalog that stores its metadata pointer in a PostgreSQL or MySQL database with SELECT ... FOR UPDATE locking before the compare-and-swap inherits the strong consistency of the relational database’s serializable isolation mode. Every catalog commit is a serializable transaction against the backing database.
DynamoDB-backed catalogs: AWS DynamoDB’s ConditionalWrite operation provides strongly consistent, linearizable compare-and-swap semantics. An Iceberg catalog implementation can store the current metadata pointer in a DynamoDB item and use a conditional write (if current_pointer == expected_pointer, set to new_pointer) to perform the atomic catalog commit with linearizable consistency.
Zookeeper-backed catalogs: Apache Zookeeper’s setData operation with a version check provides strongly consistent compare-and-swap semantics. Hive Metastore-based Iceberg catalogs often use Zookeeper-mediated locks to achieve linearizable commit ordering.
etcd-backed catalogs: etcd provides Raft-based strong consistency. A Polaris or custom REST Catalog implementation backed by etcd gets linearizable compare-and-swap for catalog pointers as a first-class feature of the etcd data model.
Nessie’s Git-Like Consistency Model
Project Nessie takes a different and innovative approach to catalog consistency. Rather than providing a single mutable pointer that must be atomically updated, Nessie models the catalog as a Git-like version-controlled repository where every commit creates an immutable, content-addressable object.
Nessie’s consistency model is strongly consistent at the branch pointer level — the pointer from a named branch (e.g., main) to its current commit hash is updated atomically using a compare-and-swap on the branch’s current commit hash. This is linearizable.
However, Nessie’s multi-branch design introduces a form of intentional isolation between branches. The main branch has strong consistency relative to itself, and the experimental branch has strong consistency relative to itself, but transactions on main and experimental are completely isolated from each other — reads on experimental do not see uncommitted writes on main. This is not a consistency weakness; it is a deliberate design choice that enables the Git-like workflow (isolated development, merge before publish) that gives Nessie its distinctive value proposition.
S3 Tables and Object Storage Native Consistency
AWS S3 Tables (announced in 2024) provides a managed Apache Iceberg table service where the atomic commit protocol is implemented directly within the S3 service, using S3’s own strong consistency guarantees. Because the catalog commit and the data file writes both happen within the same strongly consistent object storage system, the need for an external catalog service to coordinate commit operations is eliminated for this specific deployment model.
Practical Implications for Lakehouse Architecture Design
When Strong Consistency Is Non-Negotiable
Strong consistency in the catalog is non-negotiable for any multi-writer scenario. If two Spark clusters can simultaneously attempt to commit to the same Iceberg table, the catalog’s compare-and-swap must be linearizable — otherwise, both commits might succeed, pointing the metadata to two different new states, and the table’s metadata will be in an inconsistent state where one of the commits’ data files are permanently orphaned.
This is not a theoretical concern. In production deployments, multiple ETL jobs targeting the same table is extremely common — a micro-batch streaming job appending new records and a daily batch correction job updating historical records may both be targeting the same table simultaneously. The catalog’s linearizable compare-and-swap is the sole safety mechanism preventing data corruption in this scenario.
Performance Optimization Under Strong Consistency
Because catalog commits are low-frequency relative to data file writes (a job that writes 1,000 Parquet files still only performs a single catalog commit), the latency cost of strong consistency at the catalog level is typically acceptable in production. The overwhelming majority of the pipeline’s time is spent writing data files, not committing metadata.
Where strong consistency latency becomes a concern is in extremely high-frequency streaming pipelines that commit every few seconds. A catalog backed by a geographically distributed consensus system (e.g., a Raft cluster spread across three availability zones) will add 5–10ms of consensus latency to each commit. For a pipeline committing every 5 seconds, this is 0.2% overhead — completely acceptable. For a pipeline attempting to commit hundreds of times per second, the consensus overhead becomes the binding throughput constraint and requires architectural solutions such as commit batching.
Read Consistency for Analytical Queries
An important nuance: while catalog commits must be strongly consistent (linearizable), individual read operations by analytical query engines do not necessarily need to read the latest committed state. An analyst running a business intelligence query doesn’t need their query to automatically see a commit that happened 50 milliseconds ago during their query execution. Snapshot Isolation (discussed in the ACID Transactions article) provides all the read consistency guarantees that analytical workloads require.
This means the strong consistency requirement is specifically: catalog writes (commits) must be strongly consistent; catalog reads (query planning) need only be consistent with the snapshot the query started from. This asymmetry — strong consistency for writes, snapshot consistency for reads — is the specific consistency model that modern lakehouse architectures are optimized for, and it is the reason they can deliver both correctness and analytical performance simultaneously.
The Relationship Between Strong Consistency and Idempotency
An important operational property that complements strong consistency is idempotency — the guarantee that performing the same operation multiple times has the same effect as performing it once. In distributed systems, network failures can cause a client to be uncertain whether its write was committed (the response was lost, but the write may have succeeded). The client’s only safe option is to retry.
If catalog commits are idempotent, retrying a commit that may have already succeeded is safe. Iceberg’s compare-and-swap achieves idempotency by construction: if a commit is retried with the same new metadata file path and the same expected current metadata file path, and the catalog already committed the first attempt (advancing the pointer to the new path), the retry’s compare-and-swap will fail (because the pointer is now at the new path, not the expected old path). The writer can detect this case and determine that its commit actually succeeded, without needing to issue a new commit.
This combination of strong consistency (the compare-and-swap is linearizable) and effective idempotency (retry-safe commit protocol) is what gives Iceberg’s commit protocol its production reliability in the presence of network failures and job restarts.
Conclusion
Strong Consistency is not a luxury or an over-engineering choice — it is the minimum required correctness guarantee for the catalog commit operations that govern data integrity in a multi-writer lakehouse environment. Its formal definition (linearizability) is precise: all operations appear in a single, real-time-ordered global sequence. Its implementation (consensus protocols, relational database transactions, DynamoDB conditional writes) is well-understood. Its cost (latency, coordination overhead, throughput ceilings) is real but manageable at the commit frequencies typical of data lakehouse workloads. And its absence — the corruption that results from non-linearizable catalog commits in concurrent writer scenarios — is the most catastrophic class of data engineering failure: silent, difficult to detect, and potentially irreversible without careful rollback procedures. For data engineers and architects designing catalogs and commit protocols for their lakehouse deployments, strong consistency at the catalog commit layer is the foundational guarantee from which all other data reliability properties flow.
Visual Architecture
