Big Data Engineering
2.4K views | +0 today
Follow
Big Data Engineering
Topic includes articles on Big Data Analytics
Curated by JN
Your new post is loading...
Your new post is loading...
Scooped by JN
Scoop.it!

Elastic Scaling in Kafka Streams

Elastic Scaling in Kafka Streams | Big Data Engineering | Scoop.it
Another blog in a series of posts focused on Kafka Streams, we focus on the elasticity and scalability of the new stream processing library.
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Apache Spark and Machine Learning

Big Data is now around for many years as the solution for nowadays challenges brought by the massive datasets available. The initial technologies were disruptive compared to legacy stacks, however they are now suffering the age, specially their usability is slowing down their introduction in the global market. Furthermore, the Data Science has been understood to be the underpinning needs required to leverage a good data management and their processing. Nevertheless, this brings more problems onto the tables, by shifting the needs from ETL to recurrent or stream processing. Apache Spark is rising out of the water with a new disruptive model allowing all kind of business to easily work with distributed technologies and process their Big or Fast Data. Hence, this seminar deeply covers the underlying concepts behind the Apache Spark project. Although the model is simpler than other technologies, it is still fundamental to grasp the ideas and the features in Apache Spark that will allow any business to unleash the power of their infrastructure and, or data. The focus in this seminar is to explain based on concrete and reproducible examples run interactively from the Spark Notebook. Not only Spark Core will be extensively decrypted but also the streaming and machine layers that are part of the global project. It’s a matter of fact that Spark is an important piece of modern architecture, but it cannot be the only one covering the whole pipeline, that’s why this seminar will also tackle the Spark ecosystem including its integration with the Apache project Kafka, Cassandra and Mesos.
by Andy Petrella
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Implementing Hadoop's Input and Output Format in Spark

Implementing Hadoop's Input and Output Format in Spark | Big Data Engineering | Scoop.it
In this post, we will be discussing how to implement Hadoop input and output formats in Spark.In order to understand the concepts explained here, it is best to have some basic knowledge of Apache Spark. We recommend you to go through the following posts before going through this post: Beginners Guide to Spark and Spark RDD's in Scala.
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Monitoring Apache Spark Streaming: Understanding Key Metrics - DZone Big Data

Monitoring Apache Spark Streaming: Understanding Key Metrics - DZone Big Data | Big Data Engineering | Scoop.it
Swaroop Ramachandra explains in the second part of his series how to understand key metrics for performance and health in Apache Spark.
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Using Apache Spark SQL to Explore S&P 500, and Oil Stock Prices

This post will use Apache Spark SQL and DataFrames to query, compare and explore S&P 500
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Introduction to Apache Kafka: Distributed Systems, Scalable Architecture, and Messaging Queues - DZone Integration

Introduction to Apache Kafka: Distributed Systems, Scalable Architecture, and Messaging Queues - DZone Integration | Big Data Engineering | Scoop.it
In my previous blog, I wrote about distributed systems and why we choose this path with today’s requirements. One of the most important parts of a scalable architecture is a messaging system which is used for communication between application components, log aggregation, event handling, etc. There are some standards that try to describe different protocols but I will focus on the architecture. Typically, we use publish-subscribe message brokers to handle our message queue and this is a great way to start. The logic behind this is that everything is an event and all the components either produce or consume events. It’s always easy to expand functionality, to add more publishers or subscribers and even base an architecture on events (event-driven architecture). There is a lot of material describing these concepts so I will not get into detail. The problem I'm focusing on is scalability and performance. One of the crucial parts of any system is data ingestion, especially if it has peaks (as it usually does) and a number of messages are moderate to high (more than 20k/s). In this case, the typical publish-subscribe broker systems are lacking performance and easily become hot-spots. Some of the broker implementations do scale but there are some conceptual problems. These problems are something that engineers at Linkedin solved with Apache Kafka. They solved the problem of a typical publish-subscribe system by implementing it as a commit log. What does this mean? Let's look at the RabbitMQ which is a great implementation of a messy AMQP standard and works great as a distributed message broker. It's robust, provides good overall performance and the cluster is transparent to the client. It really does a good job for the use case it’s built for. One of the key concepts is what requires a change. RabbitMQ (and most of the message brokers) presume by design that consumers are mostly online and messages in a queue, which are waiting to be consumed, are held opaquely. It is simply not designed to persist large amounts of messages on the broker. Apache Kafka was conceptually designed to partition and persist large amounts of messages disregarding if the consumers are online or not. Kafka presumes that producers generate messages at such a rate that it can be thought of as a stream. The main point is not to throttle down producers because consumers are failing to consume data fast enough but to provide a buffer between the flood of events and the system/consumers. Consumers in this concept can process events at their own pace either online, in batches or even offline. One of the key advantages is that Kafka partitions messages based on topic and across nodes and provides ordered delivery within a partition. AMQP standard defines that one of each producer, channel, exchange, queue and consumer are required for ordered delivery. This breaks the philosophy of no single point of failure. One of the important features with decoupled producers and consumers is that messages can be read on demand and at a consumer’s pace. Kafka provides the ability to replay events that enable fallback scenarios in fast data stacks with NoETL approach. This is possible because messages are persisted in Kafka for a configurable period of time whether they have been consumed or not. Image taken from official Kafka documentation Positions of a consumer in the partition are saved in the cluster but the consumer is responsible for changing this position (offset). At any given time, the consumer can decide to replay all the events in a partition. Partitioning itself provides a great way to scale on multiple nodes but one partition must fit on a single node. Topics, on the other hand, are partitioned and can span multiple nodes. Messages are consumed in two different ways. The first one is a queue where we have a pool of consumers and each message is consumed by one of them. The second one is publish-subscribe where messages are being broadcasted to all of the consumers. Kafka provides a couple of important guarantees. Message ordering is preserved based on a producer and consumers see these messages in the same sequence as they were produced. Replication is defined by replication factor and there is a fault tolerance for up to N-1 server failures. Kafka can guarantee "at least once" delivery semantics per partition. When looking at the performance, Kafka can sustain 1mil of messages produced per second on just a couple of nodes keeping the durability and ordered partitioning of data. This performance is considered high and only the top few companies have higher requirements than this. One of the down-sides is complexity as in any other distributed system. Kafka requires Zookeeper to run in order to keep nodes in sync. I’m not a fan of single point of failure spots in distributed systems, like Zookeeper is, especially when we introduce additional complexity in order to solve single instance problems but then again rely on something like Zookeeper to keep things running. Being distributed, Kafka has failover mechanisms where if master node is down, one of the existing nodes is automatically voted and promoted into master. Scalability is one of the key features and this is where Kafka excels and with a fault tolerance of RF-1 (Replication Factor) nodes it is a great choice for the backbone of data-intensive system. When opting for Kafka keep in mind that there aren’t many drivers outside the JVM stack at this point and that you need to run a cluster of nodes in order to benefit from fault tolerant replication. If you need to push large messages or if simplicity and ease of use are what you are after you should consider some of the lightweight brokers, but if you need reliability and performance at scale and are pushing large amounts of data through your system then Kafka is the perfect choice.
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Getting Started With Heron on Apache Mesos and Apache Kafka

Getting Started With Heron on Apache Mesos and Apache Kafka | Big Data Engineering | Scoop.it
Heron has been Open Sourced, woo! Heron is Twitter’s distributed stream computation system for running Storm compatible topologies in production.
A Heron topology is a directed acyclic graph used to process streams of data. Heron topologies consist of three basic components: spouts and bolts, which are connected via streams of tuples. Below is a visual illustration of a simple topology:
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from Technology Innovations
Scoop.it!

From Relational into Kafka - Ingest Tips

From Relational into Kafka - Ingest Tips | Big Data Engineering | Scoop.it
Overview of tools for migrating data from relational databases like MySQL, Postgres and Oracle to Apache Kafka.

Via Tony Shan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Apache Kafka – What Is It And Does It Compare To Amazon Kinesis?

What is Apache Kafka? Apache Kafka is an open-source, distributed, scalable publish-subscribe messaging system. The organization responsible for the software is the Apache Software Foundation. The code is written in Scala and was initially developed by the LinkedIn Company. It was open-sourced in 2011 and became a top-level Apache project. The project has the intention ...
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from Social Network Analysis #sna
Scoop.it!

A Visual Guide to Graph Traversal Algorithms

A Visual Guide to Graph Traversal Algorithms | Big Data Engineering | Scoop.it
A visual guide to Graph Traversal Algorithms

Via ukituki
more...
No comment yet.
Scooped by JN
Scoop.it!

3 Essential To-Dos for Successful Enterprise Adoption of Hadoop

3 Essential To-Dos for Successful Enterprise Adoption of Hadoop | Big Data Engineering | Scoop.it
Taking on Hadoop is the strongest step toward a successful big data endeavor. But what does it take to be successful with Hadoop?
more...
No comment yet.
Scooped by JN
Scoop.it!

Now available - The Forrester Wave™: Big Data Streaming Analytics,

Now available - The Forrester Wave™: Big Data Streaming Analytics, | Big Data Engineering | Scoop.it
The Forrester Wave™: Big Data Streaming Analytics, Q1 2016, reports “Streaming analytics are critical to building contextual insights for Internet of Things, Mobile, Web, and Enterprise Applications.”...
more...
No comment yet.
Scooped by JN
Scoop.it!

What is Hadoop?

What is Hadoop? | Big Data Engineering | Scoop.it
Dig into this breakdown of Hadoop components to gain an understanding of just how flexible the open source Hadoop framework is for performing big data analytics.
more...
No comment yet.
Rescooped by JN from Learning*Education*Technology
Scoop.it!

An introduction to Spark Streaming | Opensource.com

An introduction to Spark Streaming | Opensource.com | Big Data Engineering | Scoop.it
A guide to Apache's Spark Streaming

Via Skip Zalneraitis
more...
No comment yet.
Rescooped by JN from Technology Innovations
Scoop.it!

Apache Spark: 3 Promising Use-Cases - InformationWeek

Apache Spark: 3 Promising Use-Cases - InformationWeek | Big Data Engineering | Scoop.it
Spark is the shiny new thing in big data, but how will it stand out? Here's a look at fog computing, cloud computing, and streaming data-analysis scenarios.

Via Tony Shan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Monitoring Apache Spark: Why Is It Challenging? - DZone Big Data

Monitoring Apache Spark: Why Is It Challenging? - DZone Big Data | Big Data Engineering | Scoop.it
Swaroop Ramachandra presents the first part of his series on monitoring in Apache Spark and the need for a monitoring program at three levels.
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Using MapR, Mesos, Marathon, Docker, and Apache Spark to Deploy and Run Your First Jobs and Containers

This blog post describes steps for deploying Mesos, Marathon, Docker, and Spark on a MapR cluster
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

A Beginner's Guide to Apache Kafka - DZone Integration

A Beginner's Guide to Apache Kafka - DZone Integration | Big Data Engineering | Scoop.it
A bare bones, bare necessities guide to what Apache Kafka can do and why it is popular.
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Apache Spark and Machine Learning

Big Data is now around for many years as the solution for nowadays challenges brought by the massive datasets available. The initial technologies were disruptive compared to legacy stacks, however they are now suffering the age, specially their usability is slowing down their introduction in the global market. Furthermore, the Data Science has been understood to be the underpinning needs required to leverage a good data management and their processing. Nevertheless, this brings more problems onto the tables, by shifting the needs from ETL to recurrent or stream processing. Apache Spark is rising out of the water with a new disruptive model allowing all kind of business to easily work with distributed technologies and process their Big or Fast Data. Hence, this seminar deeply covers the underlying concepts behind the Apache Spark project. Although the model is simpler than other technologies, it is still fundamental to grasp the ideas and the features in Apache Spark that will allow any business to unleash the power of their infrastructure and, or data. The focus in this seminar is to explain based on concrete and reproducible examples run interactively from the Spark Notebook. Not only Spark Core will be extensively decrypted but also the streaming and machine layers that are part of the global project. It’s a matter of fact that Spark is an important piece of modern architecture, but it cannot be the only one covering the whole pipeline, that’s why this seminar will also tackle the Spark ecosystem including its integration with the Apache project Kafka, Cassandra and Mesos.
by Andy Petrella
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from I can explain it to you, but I can't understand it for you.
Scoop.it!

Neha Narkhede: Large-Scale Stream Processing with Apache Kafka

In her presentation "Large-Scale Stream Processing with Apache Kafka" at QCon New York 2016 Neha Narkhede introduces Kafka Stream, a new feature of Kafka for processing streaming data. According to Narkhede stream processing has become popular, because unbounded datasets can be found in many places. It is no longer a niche problem like for example machine learning. By Ralph Winzinger
Via Riaz Khan
more...
No comment yet.
Rescooped by JN from Social Network Analysis #sna
Scoop.it!

Eigencentrality based on dissimilarity measures reveals central nodes in complex networks

Eigencentrality based on dissimilarity measures reveals central nodes in complex networks | Big Data Engineering | Scoop.it
One of the most important problems in complex network’s theory is the location of the entities that are essential or have a main role within the network.
Via ukituki
more...
No comment yet.
Scooped by JN
Scoop.it!

Caravel: Airbnb’s data exploration platform — Airbnb Engineering & Data Science

Caravel: Airbnb's data exploration platform - Airbnb Engineering & Data Science - Medium
By Maxime Beauchemin
more...
No comment yet.
Scooped by JN
Scoop.it!

Picking the Right SQL-on-Hadoop Tool for the Job

Picking the Right SQL-on-Hadoop Tool for the Job | Big Data Engineering | Scoop.it
SQL is, arguably, the biggest workload many organizations run on their Hadoop clusters. And there's good reason why: The combination of a familiar interfac
more...
No comment yet.
Scooped by JN
Scoop.it!

Spark: The operating system for big data analytics

Spark: The operating system for big data analytics | Big Data Engineering | Scoop.it
See why Spark’s technical advancements make it a true operating system for big data analytics.
more...
No comment yet.