Big Data Technology, Semantics and Analytics
12.1K views | +2 today
Big Data Technology, Semantics and Analytics
Trends, success and applications for big data including the use of semantic technology
Curated by Tony Agresta
Your new post is loading...
Your new post is loading...
Scooped by Tony Agresta!

What's the Scoop on Hadoop?

What's the Scoop on Hadoop? | Big Data Technology, Semantics and Analytics |
If you are an investor in the field of Big Data, you must have heard the terms “Big Data” and “Hadoop” a million times.  Big Data pundits use the terms interchangeably and conversations might lead you to believe that...
Tony Agresta's insight:

"Hadoop is not great for low latency or ad-hoc analysis and it’s terrible for real-time analytics."

In a webcast today with Matt Aslett from 451 Research and Justin Makeig from MarkLogic, a wealth of inforrmation was presented about Hadoop including how it's used today and how MarkLogic extends Hadoop.  When the video becomes available, I'll post it but in the meantime, the quote from the Forbes article echoes what the speakers discussed today.

Today, Hadoop is used to store, process and integrate massive amounts of structured and unstructured data and is typically part of a database architecture that may include relational databases, NoSQL, Search and even Graph Databases.  Organizations can bulk load data into the Hadoop Distributed File System (HDFS) and process it with MapReduce.   Yarn is a  technology that's starting to gain traction enabling multiple applications to run on top of HDFS and process data in many ways. But it's still early stage.

What's missing?  Real Time Applications.  That's an understatement since reliability and security have also been question marks as well as limited support for SQL based analytics.   Complex configuration makes it difficult to apply Hadoop.

MarkLogic allows users to deploy an Enterprise NoSQL database into an existing Hadoop implementation and offers many advantages including:

  • Real time access to your data
  • Less data movement
  • Mixed workloads within the same infrastructure
  • Cost effective long term storage
  • The ability to leverage your existing infrastructure

Since all of your MarkLogic data can be stored in HDFS including indexes, you can combine local storage for active, real time results with lower cost tiered storage (HDFS) for data that's less relevant or needs additional processing.  MarkLogic allows you to partition your data, rebalance and migrate partitioned data interactively.

What does this mean for you?  You can optimize costs, performance and availability while also satisfying the needs of the business in the form of real time analytics, alerting and enterprise search. You can take data "off line" and then bring it back instantly since it's already indexed.  You can still process your data using batch programs in Hadoop but now all of this is done in a shared infrastructure. 

To learn more about MarkLogic and Hadoop, visit this Resource Center

When the video is live, I'll send a link out.

Bryan Borda's curator insight, July 19, 2013 11:39 AM

Excellent information on advantages to using NoSQL technology with a Hadoop infrastructure.  Take advantage of the existing Hadoop environment by adding powerful NoSQL features to enhance the value.

Scooped by Tony Agresta!

Big Data “Hype” Coming To An End | SiliconANGLE

Big Data “Hype” Coming To An End | SiliconANGLE | Big Data Technology, Semantics and Analytics |
Tony Agresta's insight:

"Organizations have fascinating ideas, but they are disappointed with a difficulty in figuring out reliable solutions,” writes Sicular from The Gartner Group.


"Their disappointment applies to more advanced cases of sentiment analysis, which go beyond traditional vendor offerings.  Difficulties are also abundant when organizations work on new ideas, which depend on factors that have been traditionally outside of their industry competence, e.g. linking a variety of unstructured data sources.”


Today, organizations are coming to the realization that free or low cost open source technology to handle big data requires intense development cycles that burn costs and time.  Solving demanding challenges in these four areas has proven difficult:


  • Search & Discovery
  • Content Delivery
  • Analytics and Information Products
  • Data Consolidation


Organizations need to work with proven technology that's reliable and durable.  They need to work with technology that handles ACID transactions, enterprise security, high availability, replication, real time indexing and alerting - without having to right 10,000+ lines of code. 


Major financial institutions, healthcare payors, government agencies, media giants, energy companies, and state & local organizations have standardized on big data technology proven to increase developer productivity, create new revenue streams and address mission critical operations in a post 9-11 era. 


Adrian Carr's curator insight, February 11, 2013 11:11 AM

IT does it again.  Build a technology up until we start to believe it will solve all world problems.  It generates huge "science projects" and then everything comes tumbling down.  Finally a voice of reason says...maybe we set expectations unrealistically...One more trough of disillussionment !

Scooped by Tony Agresta!


Tony Agresta's insight:

Followers may be interested in this white paper from The Bloor Group which summarizes the differences between database technologies  It's meaty.

Here are a few additional points that Bloor has written about MarkLogic's Enterprise NoSQL approach:

  • MarkLogic is also a true transactional database. Most NoSQL databases have compromised the ACID (Atomicity, Consistency, Isolation and Durability) properties that are important for transaction processing, MarkLogic is fully equipped to be a transactional database, and if you simply wanted to use it for order processing, there would be no problem in doing so.
  • The database has been built to enable rapid search of its content in a similar manner to the way that Google’s search capabilities have been built to enable rapid search of the Internet.
  • As some of MarkLogic’s implementations have grown to above the single petabyte level, fast search of massive amounts of data is one of its most important features. To enable its search capability MarkLogic indexes everything on ingest; not just the data, but also the XML metadata. This provides it with the ability to search both text and structure. For example, you might want to quickly find someone’s phone number from a collection of emails.
  • With MarkLogic you could pose a query such as: “Find all emails sent by Jonathan Jones, sort in reverse order by time and locate the latest email that contains a phone number in its signature block.”
  • You may be able to deduce from this that Mark Logic knows what an email is, knows how to determine who the sender is, knows what a signature block is and knows how to identify a phone number from within the signature block. If you were looking for a mobile phone number then you would simply add the word “mobile” in front of phone number. It should be clear from this that very few databases could handle such a query, because most databases are straight-jacketed by whatever version of SQL they implement and, even if it were possible to bend SQL in such a way as to formulate this kind of query, most databases cannot dig into data structures they hold in the way that MarkLogic.

With the release of MarkLogic 6 last fall, MarkLogic also provided SQL support through integration with Tableau and Cognos, in-database analytic functions, JSON support, JAVA and REST APIs and more.  For more information on this release, you can go here:

No comment yet.
Scooped by Tony Agresta!

The role of the Data Scientist in Big Data | TechRepublic

The role of the Data Scientist in Big Data | TechRepublic | Big Data Technology, Semantics and Analytics |
The role of the Data Scientist can be wide ranging while critical to large-scale Big Data efforts. Will Kelly peers into the role of the Data Scientis
Tony Agresta's insight:

As I read this article a few additional points came to mind.   As Data Scientists form a data plan, they typically take an inventory of available data including what may be "dark data", unstructured data that the organization is not using today.   The data inventory will likely include disparate sources of data. The analytical advantages of integrating this data could improve the chances of the exceeding business goals.  During the inventory process, Data Scientists will need to assess how complete each field of data is.  How dirty is it?  Do you have what you think you'll need to achieve your objectives or do you need to collect new information?   The data inventory is directly tied to analysis of the data which is, in turn, tied to your goals.  One way to think about this is as follows.  For each goal the business has, form a set of questions that need to be resolved through big data analysis.  Resolution to these questions will prove or disprove the hypothesis you have formulated.   The data visualization you perform, the predictive models that you build, the dashboards you derive should all be in support of resolving the questions and therefore the hypothesis you are testing.    A data inventory is essential.   Listing a set of questions you want to resolve in support of organizational goals is essential.  Using a variety of analytical approaches to answer these questions will help you create and manage a complete big data program.

No comment yet.