When we revamped Messages in 2010 to integrate SMS, chat, email and Facebook Messages into one inbox, we built the product on open-source Apache HBase, a distributed key value data store running on top of HDFS, and extended it to meet our requirements. At the time, HBase was chosen as the underlying durable data store because it provided the high write throughput and low latency random read performance necessary for our Messages platform. In addition, it provided other important features, including horizontal scalability, strong consistency, and high availability via automatic failover. Since then, we’ve expanded the HBase footprint across Facebook, using it not only for point-read, online transaction processing workloads like Messages, but also for online analytics processing workloads where large data scans are prevalent. Today, in addition to Messages, HBase is used in production by other Facebook services, including our internal monitoring system, the recently launched Nearby Friends feature, search indexing, streaming data analysis, and data scraping for our internal data warehouses..
Any program that pulls data from a large HBase table containing terabytes of data spread over many nodes will need to put a bit of thought into the retrieval of this data. Failure to do this may mean waiting for and subsequently processing a lot of unnecessary data, to the point where it renders this program (whether a single-threaded client or a MapReduce job) useless. HBase’s Scan API helps in this aspect. It configures the parameters of the data retrieval, including the columns to include, start and stop rows and batch sizing.
Cassandra CLI is a useful tool for Cassandra administrators. It's a good example of how to implement a Cassandra client and CLI internals help us to develop custom Cassandra clients or even extend the CLI tool.
At Booking.com, we have very wide replication topologies. It is not uncommon to have more than fifty (and sometimes more than a hundred) slaves replicating from the same master. When reaching this number of slaves, one must be careful not to saturate the network interface of the master. A solution exists but it has its weaknesses. We came up with an alternative approach that better fits our needs: the Binlog Server. We think that the Binlog Server can also be used to simplify disaster recovery and to ease promoting a slave as a new master after failure. Read on for more details.
Performance is one of the most interesting characteristics of an HBase cluster\'s behavior. It is a challenging operation for administrators, because performance tuning requires deep understanding of not only HBase but also of Hadoop, Java Virtual M