Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)
To solve a planning or optimization problem, some solvers tend to scale out poorly: As the problem has more variables and more constraints, they use a lot more RAM memory and CPU power. They can hit hardware memory limits at a few thousand variables and few million constraint matches. One way their users typically work around such hardware limits, is to use MapReduce. Let’s see what happens if we MapReduce a planning problem, such as the Traveling Salesman Problem.
ZooKeeper Resilience at Pinterest. Apache ZooKeeper is an open source distributed coordination service that’s popular for use cases like service discovery, dynamic configuration management and distributed locking. While it’s versatile and useful, it has failure modes that can be hard to prepare for and recover from, and if used for site critical functionality, can have a significant impact on site availability.
This article provides an overview of tools and libraries available for embedded data analytics and statistics, both stand-alone software packages and programming languages with statistical capabilities. The authors also discuss how to combine and integrate these embedded analytics technologies to handle big data.
This weekend I had the opportunity to attend Penn State’s “Teaching and Learning with Technology” Symposium. In addition to hearing some great talks about innovation, learning analytics, and the PSU strategy on MOOCs, I was energized to pick up with data/text mining in R. Learning analytics (LA) and the future of their use have fascinated me for quite sometime, and I have been eager to combine my developing R skills with data mining techniques.
Apache Spark, an in-memory data-processing framework, is now a top-level Apache project. That’s an important step for Spark’s stability as it increasingly replaces MapReduce in next-generation big data applications.
REEF stands for the Retainable Evaluator Execution Framework, and it is our approach to simplify and unify the lower layers of big data systems on modern resource managers like Apache YARN, Apache Mesos, Google Omega, and Facebook Corona. On these resource managers, REEF provides a centralized control plane abstraction that can be used to build a decentralized data plane for supporting big data systems, like those mentioned below. Special consideration is given to graph computation and machine learning applications, which require data retention on allocated resources, as they execute multiple passes over the data.Click here to edit the content
As I frequently travel in data science circles, I’m hearing more and more about a new kind of tech war: Python vs. R. I’ve lived through many tech wars in the past, e.g. Windows vs. Linux, iPhone vs. Android, etc., but this tech war seems to have a different flavor to it. What feels different in this case is that the application area is the same, namely performing work in data science where the solution often depends on the use of libraries that implement various machine learning algorithms. This being the case, the question is what language should you adopt as a data scientist?
Those of us who have spent years studying “data smart” companies believe we’ve already lived through two eras in the use of analytics. We might call them BBD and ABD—before big data and after big data. Or, to use a naming convention matched to the topic, we might say that Analytics 1.0 was followed by Analytics 2.0. Generally speaking, 2.0 releases don’t just add some bells and whistles or make minor performance tweaks. In contrast to, say, a 1.1 version, a 2.0 product is a more substantial overhaul based on new priorities and technical possibilities. When large numbers of companies began capitalizing on vast new sources of unstructured, fast-moving information—big data—that was surely the case.