Big Data and Hadoop
76
Hadoop-related Big Data News and Articles
Curated by Cho Dong Hwan
Follow
Scooped by Cho Dong Hwan onto Big Data and Hadoop
Scoop.it!

An Overview of Cascading

Paco Nathan is a Data Scientist at Concurrent in SF and a committer on the Cascading.org open source project. In this video he will introduce Cascading, then examine the concept of a "workflow" as an abstraction for integrating Hadoop with other systems. We'll show new features including support for SQL-92, PMML, plus an application manager. This presentation was given on February 12th at the Nokia offices in Chicago, IL.

To view the accompanying slides on slideshare: slideshare.net/pacoid/chicago-hadoop-users-group-enterprise-data-workflows

No comment yet.
Discover Topics Cho Dong Hwan is following
Big Data News Quantified Self, Lifestyle Design, Digital Health, Personal Analytics, Big Data Corporate Challenge of Big Data BIG data, Data Mining, Predictive Modeling, Visualization Big Data, Cloud and Social everything Cloud & Big Data Platform
and 4 others
Your new post is loading...
Scooped by Cho Dong Hwan
Scoop.it!

Extending the Data Warehouse with Hadoop

Cloudera believes that the future of Hadoop is as a Platform for Big Data that will complement, not replace, existing data management systems, enabling new ways of interacting with large and diverse data sets. Last week, for example, Cloudera announced the general availability of Cloudera Impala, the industry’s first and only open source interactive SQL framework for the Hadoop platform. Through innovations like Impala, Hadoop presents exciting new opportunities for the enterprise.

Cho Dong Hwan's insight:

"Extending" Not "Replacing"

No comment yet.
Rescooped by Cho Dong Hwan from Corporate Challenge of Big Data
Scoop.it!

Introducing: Project Open Data | The White House

Introducing: Project Open Data | The White House | Big Data and Hadoop | Scoop.it

Last week, President Obama launched the Administration's new Open Data Policy and Executive Order aimed at ensuring that data released by the government will be as accessible and useful as possible.

Project Open Data is an online, public repository intended to foster collaboration and promote the continual improvement of the Open Data Policy.


Via Ian Sykes
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

The Next Big Thing in Big Data: People Analytics

The Next Big Thing in Big Data: People Analytics | Big Data and Hadoop | Scoop.it

By combining data from both real and virtual worlds, we can now understand behavior at a previously unimaginable scale.

When we use data to uncover the workplace behaviors that make people effective, happy, creative, experts, leaders, followers, early adopters, and so on, we are using “people analytics.”

Jacek Bugajski's curator insight, May 18, 5:31 AM

People Analytics - hmmm... Great idea for companies ;) 

Scooped by Cho Dong Hwan
Scoop.it!

Software Defined Storage Startup PernixData Raises $20M

PernixData, a San Jose, Calif.-based storage software provider, is gearing up for the software-defined storage race. The startup is leveraging server-side flash in the hopes that the technology will give it a leg up on its competitors in the traditional enterprise information technology market.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Data Science Toolkit

Data Science Toolkit | Big Data and Hadoop | Scoop.it

This command line toolkit helps to extract text-based data from various sources. For example, "html2text http://nytimes.com | text2people" command extract Texts from the front page of New York Times and pipe into filtering only people names.

Happy Hacking! http://www.datasciencetoolkit.org/developerdocs

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Top 3 R resources for beginners

The community team at Revolution Analytics has just updated this list of resources to learn about R on the Web. Included is this list of the top 3 resources for absolute beginners getting started with R:

 

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Announcing Evan's Awesome A/B Tools

Announcing Evan's Awesome A/B Tools | Big Data and Hadoop | Scoop.it

Today I am happy to announce a new suite of online statistics calculators, which I am hereby christening Evan's Awesome A/B Tools. I am calling these tools awesome because they are intuitive, visual, and easy-to-use. Unlike other online statistical calculators you've probably seen, they'll help you understand what's going on "under the hood" of common statistical tests, and by providing ample visual context, they make it easy for you to explain p-values and confidence intervals to your boss. (And they're free!)

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

How Reed Hastings' busy 2005 winter vacation led Netflix to embrace big data

How Reed Hastings' busy 2005 winter vacation led Netflix to embrace big data | Big Data and Hadoop | Scoop.it
Netflix CEO thought he could do a better job at developing a recommendation algorithm than his engineers. He failed – and the episode shaped the way the company has looked at data ever since.
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

A guide to speeding up R code

Noam Ross recently shared a very useful guide to speeding up your R code. 

Get a bigger computer (for example, renting an instance on the Amazon cloud for a few cents an hour)Use parallel programming techniquesUsing the R byte-compilerProfiling and benchmarking your codeUsing high-performance packages (like xts, for time series)And lastly, rewriting your code to use more efficient constructs

One other tip that can have some great performance benefits is linking R to parallel BLAS libraries (Revolution R does this by default). For more details on how to speed up your R code read Noam's excellent guide, linked below.

Noam Ross: FasteR! HigheR! StrongeR! - A Guide to Speeding Up R Code for Busy People

 
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Friends don't let friends calculate p-values (without fully understanding them)

Friends don't let friends calculate p-values (without fully understanding them) | Big Data and Hadoop | Scoop.it
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

The Chief Data Officer Rises

The Chief Data Officer Rises | Big Data and Hadoop | Scoop.it
For 2013 and beyond, experts are anticipating the advent of the role of Chief Data Officer to better understand when business units should be looking for answers in the company's data, treating data as a strategic asset.
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Correlation does not equal causation. So is correlation enough?

Correlation does not equal causation.  So is correlation enough? | Big Data and Hadoop | Scoop.it

In this article in Wired, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Chris Anderson discusses how big data is impacting the scientific method.  The scientific method is based on testing hypotheses and designing experiments to prove or disprove them. With massive amounts of data available, do scientists still need to follow this process?

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

[Recap] Outlier - Wikipedia

In statistics, an outlier[1] is an observation that is numerically distant from the rest of the data. Grubbs[2] defined an outlier as:

An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.

Outliers can occur by chance in any distribution, but they are often indicative either of measurement error or that the population has a heavy-tailed distribution. In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high kurtosis and that one should be very cautious in using tools or intuitions that assume a normal distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate 'correct trial' versus 'measurement error'; this is modeled by a mixture model.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Linfox crunches big data to keep trucks on time

Linfox crunches big data to keep trucks on time | Big Data and Hadoop | Scoop.it

Logistics giant Linfox is embarking on a big data-crunching exercise that will give its control centres the ability to predict hazards and help drivers navigate around them. 

The company is using a SAP HANA in-memory analytics database engine to crawl about 12 million real-time records generated by telematics equipment on a subset of its 5000-plus truck fleet.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

10 things to know about linear model summary in R

10 things to know about linear model summary in R | Big Data and Hadoop | Scoop.it

R makes it easy to fit a linear model to your data. The hard part is knowing whether the model you've built is worth keeping and, if so, figuring out what to do next.

This is a post about linear models in R, how to interpret lm results, and common rules of thumb to help side-step the most common mistakes.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

NYC announce new risk-based(data mined) fire inspections

MAYOR BLOOMBERG AND FIRE COMMISSIONER CASSANO ANNOUNCE NEW RISK-BASED FIRE INSPECTIONS CITYWIDE BASED ON DATA MINED FROM CITY RECORDS 

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Statistics vs Data Science vs BI(Just for Fun)

Statistics vs Data Science vs BI(Just for Fun) | Big Data and Hadoop | Scoop.it
As someone who trained as a statistician, I've always struggled with that title. I love the rigor and insight that Statistics brings to data analysis, but let's face it: Statistics — the name — has always had a bit of a branding problem.
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Dell takes SharePlex to Hadoop and beyond

Dell takes SharePlex to Hadoop and beyond | Big Data and Hadoop | Scoop.it
Dell Software's (formerly Quest's) SharePlex replication tool for Oracle now works with Hadoop...or anything else that can talk to a JMS queue.
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

GraphLab picks up $6.75m from Madrona and NEA to bolster its ‘Hadoop for graphs’

GraphLab picks up $6.75m from Madrona and NEA to bolster its ‘Hadoop for graphs’ | Big Data and Hadoop | Scoop.it

GraphLab-the-company wants to capitalize on the success of GraphLab-the-open-source-project by building a commercial product for applying advanced machine-learning to massive graph datasets, referring to its platform as a “Hadoop but for graphs” on a high level. The company promises to continue actively supporting the open-source project.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Free Datascience books

I've been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don't mean free as in some guy paid for a PDF version of an O'Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, "the publisher has graciously agreed to allow a full, free version of my book to be available on this site."

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

In a Big Data World, Don't Forget Experimentation

In a Big Data World, Don't Forget Experimentation | Big Data and Hadoop | Scoop.it

In the data world today, "big" dominates. But sometimes you don't need big. You need a small dose of exactly the right data. Data that bear precisely on the question at hand, that you understand deeply, and that you can trust. If such data are already at hand, great. But frequently they are not. And then, nothing beats a well-conceived, -designed, - controlled, -executed, and -analyzed experiment. Companies need to make sure experimentation is included in their "data toolkits," learn when to use it, and develop the skills to conduct effective experiments.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Tracking Hadoop Jobs from Your Mac: There’s an App for That

Tracking Hadoop Jobs from Your Mac: There’s an App for That | Big Data and Hadoop | Scoop.it

JobTracker.app is a Mac menu bar app interface to the Hadoop JobTracker. It provides Growl/Notification Center notices of starting, completed, and failed jobs and gives easy access to the detail pages of those jobs.

 
No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

The Pitfalls of Prediction | National Institute of Justice

The Pitfalls of Prediction | National Institute of Justice | Big Data and Hadoop | Scoop.it

Although the science of prediction continues to improve, the work of making predictions in criminal justice is plagued by persistent shortcomings. Some stem from unfamiliarity with scientific strategies or an over-reliance on timeworn — but unreliable — prediction habits. If prediction in criminal justice is to take full advantage of the strength of these new tools, practitioners, analysts, researchers and others must avoid some commonplace mistakes and pitfalls in how they make predictions.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

A Very Short History Of Big Data

A Very Short History Of Big Data | Big Data and Hadoop | Scoop.it

The story of how data became big starts many years before the current buzz around big data. Already seventy years ago we encounter the first attempts to quantify the growth rate in the volume of data or what has popularly been known as the “information explosion” (a term first used in 1941, according to the Oxford English Dictionary). The following are the major milestones in the history of sizing data volumes plus other “firsts” in the evolution of the idea of “big data” and observations pertaining to data or information explosion.

No comment yet.
Scooped by Cho Dong Hwan
Scoop.it!

Crisis Maps: Harnessing the Power of Big Data to Deliver Humanitarian Assistance

Crisis Maps: Harnessing the Power of Big Data to Deliver Humanitarian Assistance | Big Data and Hadoop | Scoop.it

Crisis-mapping technology has emerged in the past five years as a tool to help humanitarian organizations deliver assistance to victims of civil conflicts and natural disasters. Crisis-mapping platforms display eyewitness reports submitted via e-mail, text message, and social media. The reports are then plotted on interactive maps, creating a geospatial record of events in real time.

No comment yet.