Data replication is the concept of having data, within a system, be geo-distributed; preferably through a non-interactive, reliable process. In traditional RDBMS databases, implementing any sort of replication is a struggle because these systems were not developed with horizontal scaling in mind. Instead, these systems can be backed up via a semi-manual process where live recovery wouldn’t be much of an issue. Even with live recovery not being much of an issue, it downplays the complexity of this setup. When dealing with today’s globally distributed data, the former colocated replication concepts will not suffice when implemented at geographic scale.
Today’s infrastructure requires systems that natively support active and real-time replication, achieved through transparent and simple configurations. The ability to dictate where and how your data is replicated via easily tunable settings, along with providing users with easily understood concepts is what modern day NoSQL databases strive to offer.
Apache Cassandra, built with native multi data center replication in mind, is one of the most overlooked because this level of infrastructure has been assimilated as “tribal knowledge” within the Cassandra community. For those new to Apache Cassandra, this page is meant to highlight the simple inner workings of how Cassandra excels in multi data center replication by simplifying the problem at a single-node level.
This page covers the fundamentals of Cassandra internals, multi-data center use cases, and a few caveats to keep in mind when expanding your cluster.
Great graphs and explanations of data replication works within Cassandra