Netflix is open sourcing a tool called Suro that the company uses to direct data from its source to its destination in real time. More than just serving a key role in the Netflix data pipeline, though, it’s also a great example of the impressive — if not sometimes redunant — ecosystem of open source data-analysis tools making their way out of large web properties.
Netflix’s various applications generate tens of billions of events per day, and Suro collects them all before sending them on their way. Most head to Hadoop (via Amazon S3) for batch processing, while others head to Druid and ElasticSearch (via Apache Kafka) for real-time analysis. According to the Netflix blog post explaining Suro (which goes into much more depth), the company is also looking at how it might use real-time processing engines such as Storm or Samza to perform machine learning on event data.