EEDSP
Follow
Find tag "spark"
11.3K views | +3 today
EEDSP
Digital Signal Processing, Data Analytics, Big Data, HPC, Deep Learning, GPGPU
Curated by Shiwon Cho
Your new post is loading...
Your new post is loading...
Scooped by Shiwon Cho
Scoop.it!

Data Science 101: SparkR - Interactive R Programs at Scale - insideBIGDATA

Data Science 101: SparkR - Interactive R Programs at Scale - insideBIGDATA | EEDSP | Scoop.it

Data Science 101: SparkR - Interactive R Programs at Scale

SparkR is an open source R package developed at U.C. Berkeley AMPLab that allows data scientists to analyze large data sets and interactively run jobs on them from the R shell.

more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Building Lambda Architecture with Spark Streaming

The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world.

Few things help you concentrate like a last-minute change to a major project.

One time, after working with a customer for three weeks to design and implement a proof-of-concept data ingest pipeline, the customer’s chief architect told us:

You know, I really like the design – I like how data is validated on arrival. I like how we store the raw data to allow for exploratory analysis while giving the business analysts pre-computed aggregates for faster response times. I like how we automatically handle data that arrives late and changes to the data structure or algorithms.

But, he continued, I really wish there was a real-time component here. There is a one-hour delay between the point when data is collected until it’s available in our dashboards. I understand that this is to improve efficiency and protect us from unclean data. But for some of our use cases, being able to react immediately to new data is more important than being 100% certain of data validity.

more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

How-to: Use IPython Notebook with Apache Spark

How-to: Use IPython Notebook with Apache Spark | EEDSP | Scoop.it

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

 
more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

4 reasons why Spark could jolt Hadoop into hyperdrive

4 reasons why Spark could jolt Hadoop into hyperdrive | EEDSP | Scoop.it
Apache Spark might push MapReduce to the back burner faster than some people might like, but it will also boost the Hadoop overall ecosystem. The project’s co-creator Matei Zaharia explains why Spark is so popular now and where it fits into the big data ecosystem.
more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Apache Spark Resource Management and YARN App Models

Apache Spark Resource Management and YARN App Models | EEDSP | Scoop.it

A concise look at the differences between how Spark and MapReduce manage cluster resources under YARN

The most popular Apache YARN application after MapReduce itself is Apache Spark. At Cloudera, we have worked hard to stabilize Spark-on-YARN (SPARK-1101), and CDH 5.0.0 added support for Spark on YARN clusters.

In this post, you’ll learn about the differences between the Spark and MapReduce architectures, why you should care, and how they run on the YARN cluster ResourceManager.

more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

The Apache Software Foundation Announces Apache™ Spark™ as a Top-Level Project : The Apache Software Foundation Blog

The Apache Software Foundation Announces Apache™ Spark™ as a Top-Level Project

more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Spark Release 0.8.0 | Apache Spark

Spark Release 0.8.0 | Apache Spark | EEDSP | Scoop.it

Apache Spark 0.8.0 is a major release that includes many new capabilities and usability improvements. 

more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Seven reasons why I like Spark - O'Reilly Radar

Seven reasons why I like Spark - O'Reilly Radar | EEDSP | Scoop.it
A large portion of this week's Amp Camp at UC Berkeley, is devoted to an introduction to Spark - an open source, in-memory, cluster computing framework.
more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

How-to: Translate from MapReduce to Apache Spark

Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.

As Hadoop’s usage has broadened, it has become clear that MapReduce is not the best framework for all computations. Hadoop has made room for alternative architectures by extracting resource management into its own first-class component, YARN. And so, projects like Impala have been able to use new, specialized non-MapReduce architectures to add interactive SQL capability to the platform, for example.

Today, Apache Spark is another such alternative, and is said by many to succeed MapReduce as Hadoop’s general-purpose computation paradigm. But if MapReduce has been so useful, how can it suddenly be replaced? After all, there is still plenty of ETL-like work to be done on Hadoop, even if the platform now has other real-time capabilities as well.


more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

6 sparkling features of Apache Spark! - Big Data Analytics News

6 sparkling features of Apache Spark! - Big Data Analytics News | EEDSP | Scoop.it
What is Apache Spark? Why there is a serious buzz going-on about this? If you are into BigData analytics business then, should you really care about Spark? Hope this post will help to answer some of these questions which might have coming to your mind these days. Apache Spark is a powerful...Read more »
more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

The lab that created Spark wants to speed up everything, including cures for cancer

The lab that created Spark wants to speed up everything, including cures for cancer | EEDSP | Scoop.it

AMPLab, the University of California, Berkeley, research group responsible for making Spark a household name in big data, has a lot more tricks up its sleeve. They range from databases to machine learning, and even include tools that could help treat cancer.

more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Spark Release 1.0.0 | Apache Spark

Spark Release 1.0.0 | Apache Spark | EEDSP | Scoop.it

Spark 1.0.0 is a major release marking the start of the 1.X line. This release brings both a variety of new features and strong API compatibility guarantees throughout the 1.X line. Spark 1.0 adds a new major component, Spark SQL, for loading and manipulating structured data in Spark. It includes major extensions to all of Spark’s existing standard libraries (ML, Streaming, and GraphX) while also enhancing language support in Java and Python. Finally, Spark 1.0 brings operational improvements including full support for the Hadoop/YARN security model and a unified submission process for all supported cluster managers.

more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Spark for Data Science: A Case Study - Hortonworks

Spark for Data Science: A Case Study - Hortonworks | EEDSP | Scoop.it
I’m a pretty heavy Unix user and I tend to prefer doing things the Unix Way™, which is to say, composing many small command line oriented utilities. With composability comes power and with specialization comes simplicity. Although, sometimes if two utilities are used all the time, sometimes it makes sense for either:

A utility that specializes in a very common use-case
One utility to provide basic functionality from another utility

For example, one thing that I find myself doing a lot of is searching a directory recursively for files that contain an expression:
Despite the fact that you can do this, specialized utilities, such as ack have come up to simplify this style of querying. Turns out, there’s also power in not having to consult the man pages all the time. Another example, is the interaction between uniq and sort. uniq presumes sorted data.…
more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

As MapReduce fades, Apache Spark is now a top-level project

As MapReduce fades, Apache Spark is now a top-level project | EEDSP | Scoop.it
Apache Spark, an in-memory data-processing framework, is now a top-level Apache project. That’s an important step for Spark’s stability as it increasingly replaces MapReduce in next-generation big data applications.
more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Apache Spark for Big Analytics

Apache Spark for Big Analytics | EEDSP | Scoop.it
by Thomas Dinsmore, Director of Product Management at Revolution Analytics The emergence of Apache Spark is a key development for Big Analytics in 2013.
more...
No comment yet.
Scooped by Shiwon Cho
Scoop.it!

Databricks raises $14M from Andreessen Horowitz, wants to take on MapReduce with Spark

Databricks raises $14M from Andreessen Horowitz, wants to take on MapReduce with Spark | EEDSP | Scoop.it
A team of professors behind the open source Spark and Shark in-memory big data projects has raised $13.9 million to commercialize the products via a company called Databricks.
more...
No comment yet.