H

Cassandra - a decentralized structured storage system

Paper Summary: Lakshman, Avinash, and Prashant Malik. “Cassandra: a decentralized structured storage system.”, ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.

The authors present Cassandra, a scalable, fault-tolerant, and decentralized storage system designed to handle large volumes of structured data across distributed infrastructure. Originally developed at Facebook to support the Inbox Search feature for hundreds of millions of users, Cassandra addresses the shortcomings of applying Dynamo or BigTable to this system. It is highly available like Dynamo and has the data model of BigTable. Cassandra is capable of handling massive volumes of writes and supporting geographically distributed deployments. Cassandra is optimized for write-heavy workloads where high throughput and fault tolerance are critical. Unlike traditional relational databases that rely on strict consistency and centralized coordination, Cassandra relies on eventual consistency, enabling high availability and partition tolerance.

Cassandra’s data model is inspired by Bigtable and adapted for a fully decentralized architecture. Data is stored in keyspace-defined column families, each consisting of rows identified by unique keys, and each row can contain a dynamic number of columns. The model supports both simple and super column families, allowing for flexible, nested data representation. This structure enables developers to model hierarchical data without rigid schemas. To ensure scalability and fault tolerance, Cassandra uses consistent hashing to partition data across nodes in a logical ring much like Dynamo. This allows new nodes to join or leave with minimal disruption, requiring little data transfer. A node’s position in the ring is determined by a token, and Cassandra’s gossip protocol ensures all nodes maintain updated cluster state information. This decentralized approach removes the need for a central coordinator. The authors also introduce a failure detection mechanism for Cassandra based on heartbeats. This detector emits a suspicion level based on the observed heartbeat variance. For writes, Cassandra uses a commit log and memtable architecture: each write is appended to a commit log and stored in memory until it is flushed to disk. Over time, SSTables (sorted string tables) are written to disk and compacted in the background, improving read performance.

The authors design read operations to be efficient in Cassandra. It first checks the memtable, then consults SSTables from newest to oldest. Bloom filters and column-level indexes are used to avoid unnecessary disk access, while a background compaction process periodically merges SSTables. This architecture offers near lock-free reads and writes, crucial for high-concurrency workloads. The authors also demonstrate the impressive performance of Cassandra in the Inbox Search.

Cassandra’s approach does have notable trade-offs. Its eventual consistency model, while ideal for availability and partition tolerance, may not suit workloads requiring strong consistency or transactional guarantees. Cassandra’s operational complexity is non-trivial - managing replication strategies, tuning failure detectors, and handling compactions require experienced operators. Nevertheless, Cassandra’s architecture is highly feasible for distributed computing environments, particularly in cloud-native systems, due to its modular, decentralized, and scalable design. Its ability to elastically grow, tolerate faults, and support multi-datacenter replication makes it a robust choice for modern large-scale applications.