Data can come in many shapes and forms, and can be described in many ways. Statistics like the mean and standard deviation of a sample provide descriptions of some of its important qualities. Less commonly used statistics such as skewness and kurtosis provide additional perspective into the data’s profile.
However, sometimes we can provide a much neater description for data by stating that a sample comes from a given distribution, which not only tells us things like the average value that we should expect, but effectively gives us the data’s “recipe” so that we can compute all sorts of useful information from it. As part of my summer internship at Cloudera, I added implementations to Apache Spark’s MLlib library of various statistical tests that can help us draw conclusions regarding how well a distribution fits data. Specifically, the implementations pertain to the Spark JIRAs SPARK-8598 and SPARK-8884.
In this post, I’ll offer an overview of the first two tests and take the 1-sample variant out for a spin on some simulated data.
Containing twenty-four design patterns and ten related guidance topics, this guide articulates the benefit of applying patterns by showing how each piece can fit into the big picture of cloud application architectures. It also discusses the benefits and considerations for each pattern. Most of the patterns have code samples or snippets that show how to implement the patterns using the features of Microsoft Azure. However the majority of topics described in this guide are equally relevant to all kinds of distributed systems, whether hosted on Azure or on other cloud platforms.
Pivotal announced that we donated the Pivotal HAWQ core to the Apache Software Foundation (ASF) and it is now an officially incubating project. Apache HAWQ (incubating) is a redesign of HAWQ architecture to enable greater elasticity to meet the requirements of a growing user base. With the addition of YARN support and its acceptance as an Apache project, HAWQ is now more than ever a truly Hadoop Native SQL Engine. This blog is a technical primer for the background and architecture Apache HAWQ.
The Google transport networking crew (QUIC, TCP, etc..) deserve a shout out for identifying and fixing a nearly decade old Linux kernel TCP bug that I think will have an outsized impact on performance and efficiency for the Internet.
This is an exciting time to be studying (Deep) Machine Learning, or Representation Learning, or for lack of a better term, simply Deep Learning!
Deep Learning is rapidly emerging as one of the most successful and widely applicable set of techniques across a range of applications (vision, language, speech, computational biology, robotics, etc), leading to some pretty significant commercial success.
This course will expose students to cutting-edge research — starting from a refresher in basics of neural networks, to recent developments. The emphasis will be on student-led paper presentations and discussions. Each “module” will begin with instructor lectures to present context and background material.
This is my second article about Apache Spark architecture and today I will be more specific and tell you about the shuffle, one of the most interesting topics in the overall Spark design. The previous part was mostly about general Spark architecture and its memory management. It can be accessed here. What is the shuffle [...]
A few days ago, we announced the release of Spark 1.5. This release contains major under-the-hood changes that improve Spark’s performance, usability, and operational stability. Besides these changes, we have been continuously improving DataFrame API. In this blog post, we’d like to highlight three major improvements to DataFrame API in Spark 1.5, which are:
New built-in functions;Time intervals; andExperimental user-defined aggregation function (UDAF) interface.
Sharing your scoops to your social media accounts is a must to distribute your curated content. Not only will it drive traffic and leads through your content, but it will help show your expertise with your followers.
How to integrate my topics' content to my website?
Integrating your curated content to your website or blog will allow you to increase your website visitors’ engagement, boost SEO and acquire new visitors. By redirecting your social media traffic to your website, Scoop.it will also help you generate more qualified traffic and leads from your curation work.
Distributing your curated content through a newsletter is a great way to nurture and engage your email subscribers will developing your traffic and visibility.
Creating engaging newsletters with your curated content is really easy.