5G, IoT, Big Data, Analytics, AI & Cloud
191 views | +1 today
Follow
Your new post is loading...
Your new post is loading...
Scooped by Al Sedghi
Scoop.it!

The Modern Data Platform

The Modern Data Platform | 5G, IoT, Big Data, Analytics, AI & Cloud | Scoop.it

“The Modern Data Platform” starts off with a look back that builds the ground for the nowadays challenges to follow.

Al Sedghi's insight:
Here are some of the main factors for getting the Big Data Analytics platform right: 

1. Raging speed 

Opportunities around data are higher than ever. Business users and customers need results almost instantly, but meeting those expectations can be challenging, especially with legacy systems. Speed is not the only factor in executing a Big Data analytics strategy, but it is on top of the list. I was working with a customer running queries on a 10-terabyte data set. With this solution, that query would take 48 hours to come back with an answer and after 48 hours that question is almost moot. There's no benefit to having the answer that is received too late since the time to action has ended.

2. Massive—and growing—capacity

Your Big Data analytics solution must be able to adapt to huge quantities of data, but it must also be able to organically grow as that data increases. You must be able to grow your database in line with your data growth, and do it in a way that is clear to the data consumer or analyst. A modern analytics solution presents very little downtime, if any at all. Capacity and computer expansion happens in the background 

3. Easy integration with legacy tools 

A significant part of an analytics strategy is to ensure that it works with what you have—but also to identify which tools must be replaced, and at what point of time.. A lot of people have made investments in older tools (e.g. extract, transform, load (ETL) tools)  Of course, it is vital to support those legacy tools, but at scale, and as the need for data and analysis grows, you may realize that scaling those ETL solutions becomes a costly problem. It might make more sense to re-tool your ETL with a more modern and more parallel solution.

4. Working well with Hadoop

For many organizations, this open-source Big Data framework has become synonymous with Big Data analytics. But Hadoop alone is not enough. At the end of the day, Hadoop is a batch processing system, which means that when I start a job to analyze data, I go into a queue, and it finishes when it finishes. When you're dealing with high-concurrency analytics, Hadoop is going to show its weaknesses. You may have heard this: What's needed is a way to harness the advantages of Hadoop without incurring the performance penalties and potential disruptions of Hadoop.

5. Support for data scientists

Enterprises should help their experts—and most in-demand—data workers, by investing in tools that allow them to conduct more robust analysis on larger sets of data. What's important is that you want to move towards a solution where the data scientists can work on the data in place in the database. For instance, if they have SQL Server, they're pulling a subset or sample of data out of the database, transferring it on their local machine, and running their analysis there. If they are able to run statistical models in-database, they're no longer sampling, and thus will be able to get their answer much quicker. It's a significantly more efficient process.

6. Advanced analytics features

As organizations move toward predictive analytics, they have more needs and demand from their data technology. It's beyond just reporting. It's beyond the aggregates of the data in the data warehouse. You may be looking for a complicated query of the data in your database—predictive, geospatial, and sentiment focused.
more...
No comment yet.
Scooped by Al Sedghi
Scoop.it!

Taming Big Data with Spark Streaming for Real-time Data Processing

Taming Big Data with Spark Streaming for Real-time Data Processing | 5G, IoT, Big Data, Analytics, AI & Cloud | Scoop.it
Learn how Spark Streaming capabilities help handle big and fast data challenges through stream processing by letting developers write streaming jobs.
Al Sedghi's insight:
Spark is a technology well worth considering and learning about. It has a growing open-source community and is the most active Apache project at the moment. 

Spark offers a faster and more general data processing platform and  lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop.  Other reasons to consider using Spark is that It is highly scalable, simpler and modular, This could be why it is being adopted by key players like Amazon, eBay, Netflix, Uber, Pinterest, Yahoo, etc..

Last year, Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte. Spark also enables you to write code faster as you have over 80 high-level operators at your disposal. 

Additional major features of Spark include: 

- Provides APIs in Scala, Java, and Python, with support for other languages (such as R) on the way 

- Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.)

- Can run on clusters managed by Hadoop YARN or Apache Mesos, and can also run standalone

Overall, Spark simplifies the challenging and compute-intensive task of handling high volumes of real-time or archived data, both structured and unstructured, effortlessly integrating relevant complex capabilities such as machine learning and graph algorithms.
more...
No comment yet.