As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others.
As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.
As someone who trained as a statistician, I've always struggled with that title. I love the rigor and insight that Statistics brings to data analysis, but let's face it: Statistics — the name — has always had a bit of a branding problem.
GraphLab-the-company wants to capitalize on the success of GraphLab-the-open-source-project by building a commercial product for applying advanced machine-learning to massive graph datasets, referring to its platform as a “Hadoop but for graphs” on a high level. The company promises to continue actively supporting the open-source project.
I've been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don't mean free as in some guy paid for a PDF version of an O'Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, "the publisher has graciously agreed to allow a full, free version of my book to be available on this site."
In the data world today, "big" dominates. But sometimes you don't need big. You need a small dose of exactly the right data. Data that bear precisely on the question at hand, that you understand deeply, and that you can trust. If such data are already at hand, great. But frequently they are not. And then, nothing beats a well-conceived, -designed, - controlled, -executed, and -analyzed experiment. Companies need to make sure experimentation is included in their "data toolkits," learn when to use it, and develop the skills to conduct effective experiments.
JobTracker.app is a Mac menu bar app interface to the Hadoop JobTracker. It provides Growl/Notification Center notices of starting, completed, and failed jobs and gives easy access to the detail pages of those jobs.
Although the science of prediction continues to improve, the work of making predictions in criminal justice is plagued by persistent shortcomings. Some stem from unfamiliarity with scientific strategies or an over-reliance on timeworn — but unreliable — prediction habits. If prediction in criminal justice is to take full advantage of the strength of these new tools, practitioners, analysts, researchers and others must avoid some commonplace mistakes and pitfalls in how they make predictions.
The story of how data became big starts many years before the current buzz around big data. Already seventy years ago we encounter the first attempts to quantify the growth rate in the volume of data or what has popularly been known as the “information explosion” (a term first used in 1941, according to the Oxford English Dictionary). The following are the major milestones in the history of sizing data volumes plus other “firsts” in the evolution of the idea of “big data” and observations pertaining to data or information explosion.
Forbes published this chart based on Wikibon data: It’s an $18 billion industry heading to $50 billion in five years, according to tech researchers at Wikibon. Make note of the names in the inner circle.
The big data market is still shaping. But soon (not very soon though), we’ll see some clear segments with leaders and challengers. And then…, then we will see a lot of acquisitions and mergers.
Cloudera believes that the future of Hadoop is as a Platform for Big Data that will complement, not replace, existing data management systems, enabling new ways of interacting with large and diverse data sets. Last week, for example, Cloudera announced the general availability of Cloudera Impala, the industry’s first and only open source interactive SQL framework for the Hadoop platform. Through innovations like Impala, Hadoop presents exciting new opportunities for the enterprise.
PernixData, a San Jose, Calif.-based storage software provider, is gearing up for the software-defined storage race. The startup is leveraging server-side flash in the hopes that the technology will give it a leg up on its competitors in the traditional enterprise information technology market.
This command line toolkit helps to extract text-based data from various sources. For example, "html2text http://nytimes.com | text2people" command extract Texts from the front page of New York Times and pipe into filtering only people names.
The community team at Revolution Analytics has just updated this list of resources to learn about R on the Web. Included is this list of the top 3 resources for absolute beginners getting started with R:
Today I am happy to announce a new suite of online statistics calculators, which I am hereby christening Evan's Awesome A/B Tools. I am calling these tools awesome because they are intuitive, visual, and easy-to-use. Unlike other online statistical calculators you've probably seen, they'll help you understand what's going on "under the hood" of common statistical tests, and by providing ample visual context, they make it easy for you to explain p-values and confidence intervals to your boss. (And they're free!)
Noam Ross recently shared a very useful guide to speeding up your R code.
Get a bigger computer (for example, renting an instance on the Amazon cloud for a few cents an hour)Use parallel programming techniquesUsing the R byte-compilerProfiling and benchmarking your codeUsing high-performance packages (like xts, for time series)And lastly, rewriting your code to use more efficient constructs
One other tip that can have some great performance benefits is linking R to parallel BLAS libraries (Revolution R does this by default). For more details on how to speed up your R code read Noam's excellent guide, linked below.
Noam Ross: FasteR! HigheR! StrongeR! - A Guide to Speeding Up R Code for Busy People
For 2013 and beyond, experts are anticipating the advent of the role of Chief Data Officer to better understand when business units should be looking for answers in the company's data, treating data as a strategic asset.