BigData NoSql and...
Follow
Find
9.5K views | +0 today
BigData NoSql and Data Stuff
Topics about big data, nosql, machine learning, hadoop, elasticsearch, solr, databases, crawling
Curated by Alex Kantone
Your new post is loading...
Your new post is loading...
Scooped by Alex Kantone
Scoop.it!

Analyzing Twitter Data using Datasift, MongoDB and Pig

Analyzing Twitter Data using Datasift, MongoDB and Pig | BigData NoSql and Data Stuff | Scoop.it

f you followed our recent postings on the updated Oracle Information Management Reference Architecture, one of the key concepts we talk about is the “data reservoir”. This is a pool of additional data that you can add to your data warehouse, typically stored on Hadoop or NoSQL databases, where you store unstructured, semi-structured or unprocessed structured data in large volume and at low cost. Adding a data reservoir gives you the ability to leverage the types of data sources that previously were thought of as too “messy”, too high-volume to store for any serious amount of time, or require processing or storing by tools that aren’t in the usual relational data warehouse toolset.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

ZooKeeper - The King of Coordination

ZooKeeper - The King of Coordination | BigData NoSql and Data Stuff | Scoop.it
Let's explore Apache ZooKeeper, a distributed coordination service for distributed systems. Needless to say, there are plenty of use cases! At Found, for example, we use ZooKeeper extensively for discovery, resource allocation, leader election and high priority notifications. In this article, we'll introduce you to this King of Coordination and look closely at how we use ZooKeeper at Found.

 

What Is ZooKeeper?

ZooKeeper is a coordination service for distributed systems. By providing a robust implementation of a few basic operations, ZooKeeper simplifies the implementation of many advanced patterns in distributed systems.

ZooKeeper as a Distributed File SystemZooKeeper as a Message QueueHow Does It Work?What We Don’t Use ZooKeeper For
more...
No comment yet.
Rescooped by Alex Kantone from Programming Stuffs
Scoop.it!

How to Become a Data Scientist

How to Become a Data Scientist | BigData NoSql and Data Stuff | Scoop.it

Summary:  If you are wondering how to become a Data Scientist or what that title really means, try these insights.

I got started in data science way back.  I’ve been a commercial predictive modeler since 2001 and as naming trends have changed I now identify myself as a Data Scientist.  No one gave me this title.  But by observing the literature, the job listings, and my peers in the field it was clear that Data Scientist communicated most clearly what my knowledge and experience have led me to become.

These days you can get a degree in data science so you can show your diploma that certifies your credentials.  But these are relatively new so, with all due respect, if you only recently got your degree you are still a beginner.  Those of us who use this title today most likely came from combination backgrounds of business, hard science, computer science, operations research, and statistics.

What you call yourself is one thing but what your employer or client is looking for can be quite a different kettle of fish.  A lot has been written about data scientists being as elusive as unicorns.  Not being a unicorn I’d say this sets the bar pretty high.  Additionally, as I’ve perused the job listings it is equally true that the title is used so loosely and with such little understanding that an ad for data scientist may actually describe an entry level analyst and some ads for analysts are looking for polymath data scientists. 

All of this confusion over what we’re called and what we actually do can make you down right schizophrenic.  This makes it all the more complicated to answer the frequent inquiries I get from folks still in school or early in their career about how to become a data scientist.

Imagine my surprise and delight when in the space of a week two publications came across my desk that not only cast new light and understanding on this question but also have helped me understand that there is not just one definition of data scientist, but a reasoned argument (based on statistical analysis) that there are in fact four types.

Four Types of Data Scientists

The information here comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013.  My hat’s off to these folks for their insightful survey and conclusions drawn by statistical analysis of those results.  This is a must read.  I was able to download this at no charge from http://www.oreilly.com/data/free/analyzing-the-analyzers.csp.

There are 40 pages of good analysis here so this will be only the highest level summary.  In short, they conclude there are four types of Data Scientists differentiated not so much by the breadth of knowledge, which is similar, but their depth in specific areas and how each type prefers to interact with data science problems.

Data Businesspeople

Data Creatives

Data Developers

Data Researchers

By evaluating 22 specific skills and multi-part self-identification statements they cluster and generalize according to these descriptions.  I am betting you will recognize yourself in one of these categories.

Data Businesspeople are those that are most focused on the organization and how data projects yield profit. They were most likely to rate themselves highly as leaders and entrepreneurs, and the most likely to have reported managing an employee. They were also quite likely to have done contract or consulting work, and a substantial proportion have started a business. Although they were the least likely to have an advanced degree among respondents, they were the most likely to have an MBA. But Data Businesspeople definitely have technical skills and were particularly likely to have undergraduate Engineering degrees. And they work with real data — about 90% report at least occasionally working on gigabyte-scale problems. 

Data Creatives.  Data scientists can often tackle the entire soup-to-nuts analytics process on their own: from extracting data, to integrating and layering it, to performing statistical or other advanced analyses, to creating compelling visualizations and interpretations, to building tools to make the analysis scalable and broadly applicable. We think of Data Creatives as the broadest of data scientists, those who excel at applying a wide range of tools and technologies to a problem, or creating innovative prototypes at hackathons — the quintessential Jack of All Trades. They have substantial academic experience with about three-quarters having taught classes and presented papers. Common undergraduate degrees were in areas like Economics and Statistics. Relatively few Data Creatives have a PhD. As the group most likely to identify as a Hacker they also had the deepest Open Source experience with about half contributing to OSS projects and about half working on Open Data projects.

Data Developer.  We think of Data Developers as people focused on the technical problem of managing data — how to get it, store it, and learn from it. Our Data Developers tended to rate themselves fairly highly as Scientists, although not as highly as Data Researchers did. This makes sense particularly for those closely integrated with the Machine Learning and related academic communities. Data Developers are clearly writing code in their day-to-day work. About half have Computer Science or Computer Engineering degrees.  More Data Developers land in the Machine Learning/ Big Data skills group than other types of data scientist.

Data Researchers.  One of the interesting career paths that leads to a title like “data scientist” starts with academic research in the physical or social sciences, or in statistics. Many organizations have realized the value of deep academic training in the use of data to understand complex processes, even if their business domains may be quite different from classic scientific fields. The majority of respondents whose top Skills Group was Statistics ended up in this category. Nearly 75% of Data Researchers have published in peer-reviewed journals and over half have a PhD.

What Does this Mean for Someone Seeking to Enter the Field?

So if I am a young person seeking to enter Data Science how are these descriptions useful?  It’s possible that you could train and develop an emphasis that would lead you into the Researcher, Developer, or Creative roles.  It is less likely that education alone will put you on the Businesspeople track which implies experiences in business, not just education.  But here’s what’s interesting.  According to Harris, Murphy, and Vaisman it’s not the skills that are different but the way we choose to emphasize them in our approach to Data Science problems.  Here’s their chart.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

100 Search Engines For Academic Research

100 Search Engines For Academic Research | BigData NoSql and Data Stuff | Scoop.it

Back in 2010, we shared with you 100 awesome search engines and research resources in our post: 100 Time-Saving Search Engines for Serious Scholars. It’s been an incredible resource, but now, it’s time for an update. Some services have moved on, others have been created, and we’ve found some new discoveries, too. Many of our original 100 are still going strong, but we’ve updated where necessary and added some of our new favorites, too. Check out our new, up-to-date collection to discover the very best search engine for finding the academic results you’re looking for.

 

General

Need to get started with a more broad search? These academic search engines are great resources.

iSEEK Education:iSeek is an excellent targeted search engine, designed especially for students, teachers, administrators, and caregivers. Find authoritative, intelligent, and time-saving resources in a safe, editor-reviewed environment with iSEEK.RefSeek:With more than 1 billion documents, web pages, books, journals, newspapers, and more, RefSeek offers authoritative resources in just about any subject, without all of the mess of sponsored links and commercial results.Virtual LRC:The Virtual Learning Resources Center has created a custom Google search, featuring only the best of academic information websites. This search is curated by teachers and library professionals around the world to share great resources for academic projects.Academic Index:This scholarly search engine and web directory was created just for college students. The websites in this index are selected by librarians, teachers, and educational consortia. Be sure to check out their research guides for history, health, criminal justice, and more.BUBL LINK:If you love the Dewey Decimal system, this Internet resource catalog is a great resource. Search using your own keywords, or browse subject areas with Dewey subject menus.Digital Library of the Commons Repository:Check out the DLC to find international literature including free and open access full-text articles, papers, and dissertations.OAIster:Search the OAIster database to find millions of digital resources from thousands of contributors, especially open access resources.Internet Public Library:Find resources by subject through the Internet Public Library’s database.Infomine:The Infomine is an incredible tool for finding scholarly Internet resource collections, especially in the sciences.Microsoft Academic Search:Microsoft’s academic search engine offers access to more than 38 million different publications, with features including maps, graphing, trends, and paths that show how authors are connected.Google Correlate:Google’s super cool search tool will allow you to find searches that correlate with real-world data.Wolfram|Alpha:Using expert-level knowledge, this search engine doesn’t just find links; it answers questions, does analysis, and generates reports.
more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Why Apache Spark is a Crossover Hit for Data Scientists

Why Apache Spark is a Crossover Hit for Data Scientists | BigData NoSql and Data Stuff | Scoop.it

park is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.

Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)

 

Data scientists performing investigative analytics use interactive statistical environments like R to perform ad-hoc, exploratory analytics in order to answer questions and gain insights. By contrast, data scientists building operational analytics systems have more in common with engineers. They build software that creates and queries machine-learning models that operate at scale in real-time serving environments, using systems languages like C++ and Java, and often use several elements of an enterprise data hub, including the Apache Hadoop ecosystem.

And there are subgroups within these groups of data scientists. For example, some analysts who are proficient with R have never heard of Python or scikit-learn, or vice versa, even though both provide libraries of statistical functions that are accessible from a REPL (Read-Evaluate-Print Loop) environment.

A World of Tradeoffs

It would be wonderful to have one tool for everyone, and one architecture and language for investigative as well as operational analytics. If I primarily work in Java, should I really need to know a language like Python or R in order to be effective at exploring data? Coming from a conventional data analyst background, must I understand MapReduce in order to scale up computations? The array of tools available to data scientists tells a story of unfortunate tradeoffs:

R offers a rich environment for statistical analysis and machine learning, but it has some rough edges when performing many of the data processing and cleanup tasks that are required before the real analysis work can begin. As a language, it’s not similar to the mainstream languages developers know.Python is a general purpose programming language with excellent libraries for data analysis like Pandas and scikit-learn. But like R, it’s still limited to working with an amount of data that can fit on one machine.It’s possible to develop distributed machine learning algorithms on the classic MapReduce computation framework in Hadoop (see Apache Mahout). But MapReduce is notoriously low-level and difficult to express complex computations in.Apache Crunch offers a simpler, idiomatic Java API for expressing MapReduce computations. But still, the nature of MapReduce makes it inefficient for iterative computations, and most machine learning algorithms have an iterative component.

And so on. There are both gaps and overlaps between these and other data science tools. Coming from a background in Java and Hadoop, I do wonder with envy sometimes: why can’t we have a nice REPL-like investigative analytics environment like the Python and R users have? That’s still scalable and distributed? And has the nice distributed-collection design of Crunch? And can equally be used in operational contexts?

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Sparse Index: MongoDB

Sparse Index: MongoDB | BigData NoSql and Data Stuff | Scoop.it

All sparse indexes are unique index, but not all unique indexes are sparse! There are situations where we want to create unique index on key/field which is not present in all documents. Wherever the key is not present, it’s value will be treated as NULL. If more than 1 document has NULL value to the key we want to make as unique, then it violates unique key rule. In such situations we can make use of Sparse index and create unique key on only those documents which has the key in it.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

How 5 Natural Language Processing APIs Stack Up

How 5 Natural Language Processing APIs Stack Up | BigData NoSql and Data Stuff | Scoop.it

The world is awash in digital data. The challenge: making sense of that data. To tackle that challenge, a growing number of companies are turning to natural language processing technology to understand and monetize their data.

Natural language processing, or NLP, refers to a field of technology focused on the application of algorithms and mathematical models to analyze human language. Its use has grown sharply as companies grapple with data volumes that make it virtually impossible to perform data analysis using techniques that require significant human involvement. Popular uses of NLP include content classification, sentiment analysis and automated summarization. For instance, media organizations may use NLP-based platforms to categorize, tag and summarize content, and many brands commonly employ tools that use NLP to determine if the social media buzz around their marketing campaigns is positive or negative.

Fortunately, what is a technically complicated field of computing is now accessible to even the smallest of businesses thanks to the existence of companies that provide NLP as a service. This article explores and compares five of the leading NLP service providers that offer API integration.

These service providers were selected based on the following criteria:

A live NLP-focused API offering that gives users access to at least several common low-level NLP functions.Availability of public documentation and pricing information.Self-serve registration/subscription.



more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Big Data Benchmark - Redshift, Hive, Shark, Impala, Tez

Big Data Benchmark - Redshift, Hive, Shark, Impala, Tez | BigData NoSql and Data Stuff | Scoop.it

Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP-like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitativeand qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

Redshift - a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse. We tested Redshift on HDDs.Hive - a Hadoop-based data warehousing system. (v0.12)Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1)Impala - a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3)Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0)

This remains a work in progress and will evolve to include additional frameworks and new capabilities. We welcome contributions.

What this benchmark is not

This benchmark is not intended to provide a comprehensive overview of the tested platforms. We are aware that by choosing default configurations we have excluded many optimizations. The choice of a simple storage format, compressed SequenceFile, omits optimizations included in columnar formats such as ORCFile and Parquet. For now, we've targeted a simple comparison between these systems with the goal that the results areunderstandable and reproducible.

What is being evaluated?

This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF's, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF's), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

MongoDB and Indexing: Best Practices

MongoDB and Indexing: Best Practices | BigData NoSql and Data Stuff | Scoop.it

Back in May last year, we talked about a set of best practices for indexing and MongoDB, focusing on performance. Since then MongoDB 2.6 has arrived and it’s a good time to update that list.

One Index per Query

MongoDB can only use one index in any one query operation. Although MongoDB 2.6 has introduced index intersection, which allows more than one index to be used (specifically two indexes currently), there are a number of caveats on their use and one index, whether singular or compound, is still easier to manage and optimise around.

One “Multi-Value” Operator per Query

There’s a range of selectors we call “multi-value” operators. They can return records that match a variety of different values. They include $gt, $gte, $lt and $lte, $in and $nin, $ne, $not and $near. If you need to use one, use one and no more. This is because the “multi-value” operators can return an unpredictable number of values and are more likely to be scanned to be evaluated – the more you have to scan, the slower the query.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Java Clients for Elasticsearch

One of the important aspects of Elasticsearch is that it is programming language independent. All of the APIs for indexing, searching and monitoring can be accessed using HTTP and JSON so it can be integrated in any language that has those capabilities. Nevertheless Java, the language Elasticsearch and Lucene are implemented in, is very dominant. In this post I would like to show you some of the options for integrating Elasticsearch with a Java application

 

The Native Client

The obvious first choice is to look at the client Elasticsearch provides natively. Unlike other solutions there is no separate jar file that just contains the client API but you are integrating the whole application Elasticsearch. Partly this is caused by the way the client connects to Elasticsearch: It doesn’t use the REST API but connects to the cluster as a cluster node. This node normally doesn’t contain any data but is aware of the state of the cluster.

The node client integrates with your Elasticsearch cluster

On the right side we can see two normal nodes, each containing two shards. Each node of the cluster, including our application’s client node, has access to the cluster state as indicated by the cylinder icon. That way, when requesting a document that resides on one of the shards of Node 1 your client node already knows that it has to ask Node 1. This saves a potential hop that would occur when asking Node 2 for the document that would then route your request to Node 1 for you.

Creating a client node in code is easy. You can use the NodeBuilder to get access to the Client interface. This then has methods for all of the API functionality, e.g. for indexing and searching data.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Why Microsoft's Azure Machine Learning is such a big deal

Why Microsoft's Azure Machine Learning is such a big deal | BigData NoSql and Data Stuff | Scoop.it

Microsoft made two big announcements on Monday, showcasing two different aspects of its changes. While developer access to new builds of Internet Explorerwould have been a showstopper on most days, the biggest news came an hour or so later with the unveiling of Azure Machine Learning (ML).

Featured ResourcePresented by Code425 Impactful Tips for Securing BYOD in the Enterprise

Here are five immediately useful tips for creating a security strategy to enable your mobile workers

LEARN MORE

Dig down into the heart of any big cloud service, whether it's Amazon, Google, or Bing, and you'll find machine learning. It's one of the technologies that powers a new generation of cloud scale AI-based technologies, using big data to learn how to respond to a wide range of inputs. A powerful set of tools, machine learning simplifies what were once complex programming tasks -- while allowing you to build software that can make inferences from your data sources.

But making machine learning easy to use is surprisingly difficult. In the past, building learning systems took time and required significant computing resources. It required data scientists to build the training sets and to guide the learning system through its first steps. It was a complex and expensive process.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Elasticsearch’s behavior under various types of network failure.

Elasticsearch’s behavior under various types of network failure. | BigData NoSql and Data Stuff | Scoop.it

Elasticsearch is a distributed search engine, built around Apache Lucene–a well-respected Java indexing library. Lucene handles the on-disk storage, indexing, and searching of documents, while ElasticSearch handles document updates, the API, and distribution. Documents are written to collections as free-form JSON; schemas can be overlaid onto collections to specify particular indexing strategies.

As with many distributed systems, Elasticsearch scales in two axes: sharding and replication. The document space is sharded–sliced up–into many disjoint chunks, and each chunk allocated to different nodes. Adding more nodes allows Elasticsearch to store a document space larger than any single node could handle, and offers quasilinear increases in throughput and capacity with additional nodes. For fault-tolerance, each shard is replicated to multiple nodes. If one node fails or becomes unavailable, another can take over. There are additional distinctions between nodes which can process writes, and those which are read-only copies–termed “data nodes”–but this is primarily a performance optimization.

Because index construction is a somewhat expensive process, Elasticsearch provides a faster, more strongly consistent database backed by a write-ahead log. Document creation, reads, updates, and deletes talk directly to this strongly-consistent database, which is asynchronously indexed into Lucene. Search queries lag behind the “true” state of Elasticsearch records, but should eventually catch up. One can force a flush of the transaction log to the index, ensuring changes written before the flush are made visible.

But this is Jepsen, where nothing works the way it’s supposed to. Let’s give this system’s core assumptions a good shake and see what falls out!

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Using the Power of Real-Time Distributed Search with ElasticSearch

Using the Power of Real-Time Distributed Search with ElasticSearch | BigData NoSql and Data Stuff | Scoop.it

Internet is a place where everyone in the world can find any information they want. But with billions of documents available in the web, how is it possible to find exactly what we want in seconds or less?

For this purpose special programs called ‘search engines’ are developed by using many algorithms for analyzing, stemming, building indexes and searching querying terms. In Java world there is one of the most popular open source libraries called Lucene from Apache. It is a high performance, reliable and widely used full-featured Information Retrieval library written in Java. On top of it are built a few servers such as Solr, ElasticSearch and others.

Nowadays most companies are trying to move all computation into the cloud and Search is not an exception. In this article I would like to consider ElasticSearch, which, besides many other features, is initially designed to work in clouds and is quite successful in accomplishing that mission.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

NoSQL Job Trends: August 2014

NoSQL Job Trends: August 2014 | BigData NoSql and Data Stuff | Scoop.it

Yes, it is already September, but I am only a day late for the NoSQL installment of the August job trends. In this update, I am splitting the graphs into two as I include more products. So, for the NoSQL job trends, we will be looking at Cassandra, Redis, Couchbase , SimpleDB, CouchDB, MongoDB, HBase, Riak, Neo4j and MarkLogic. I am including Neo4j for two reasons. First, it is a graph DB which means it might have a different trend than the other solutions. Also, I have been seeing more mention of it in blog posts. The second new inclusion is MarkLogic, again for multiple reasons. First, I use it in my day job, so I am being somewhat selfish. Second, it is an XML-native document database and not open-source. Third, MarkLogic is making a push for semantics (meaning storage of RDF) and native javascript. So this installment is really a baseline for the MarkLogic trend. If I am missing popular options, or if you want to see what the trends are for other NoSQL solutions, please let me know.

First, we look at the long-term trends from Indeed for our first 5, MongoDB, Cassandra, Redis, SimpleDB and HBase:

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Machine Learning: Sentiment Analysis Algorithm Tutorial in JavaScript

Machine Learning: Sentiment Analysis Algorithm Tutorial in JavaScript | BigData NoSql and Data Stuff | Scoop.it

This article demonstrates a simple but effective sentiment analysis algorithm built on top of the Naive Bayes classifier I demonstrated in the last ML in JS article. I’ll go over some basic sentiment analysis concepts and then discuss how a Naive Bayes classifier can be modified for sentiment analysis. If you’re just looking for the summary and code demonstration, jump down.

Introduction

Compared to the other major machine learning tasks, sentiment analysis has surprisingly little written about it. This is in part because sentiment analysis (and natural language in general) is difficult, but I also suspect that many people opt to monetize their sentiment algorithms rather than publishing them.

Sentiment analysis is highly applicable to enterprise business (“alert me any time someone writes something negative about one of our products anywhere on the internet!”), and sentiment analysis is fresh enough that there’s a lot of “secret sauce” still out there. And now that the world is run on social sites like Twitter, Facebook, Yelp, and Amazon (Reviews), there are literally billions of data points that can be analyzed every single day. So it’s hard to blame anyone for trying to get a piece of that pie!

Today we’ll discuss an “easy” but effective sentiment analysis algorithm. I put “easy” in quotes because our work today will build upon the Naive Bayes classifer we talked about last time; but this will only be “easy” if you’re familiar with those concepts already–you may want to review that article before continuing.

There are better, more sophisticated algorithms than the one we’ll develop today, but this is a great place to start. As always, I’ll follow up with additional articles that cover more advanced topics in sentiment analysis in the future.

I found a nice, compact data set for this experiment. It consists of 10,662 sentences from movie reviews, labeled “positive” and “negative” (and split evenly). The primary goal today is to figure out how to tweak the Naive Bayes classifier so that it works well for sentiment analysis.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

CIO Infographic: Big Data Through the Eyes of Your IT Staff

CIO Infographic: Big Data Through the Eyes of Your IT Staff | BigData NoSql and Data Stuff | Scoop.it

Big data is more than a mere technology concept or by-product of our information-driven world — it is a vision of how things could be ... or perhaps should be. To help CIOs gain the clarity they need to accomplish this vision, we sought the opinions of those most familiar with the best practices and pitfalls of big data deployment — IT staff.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Text Mining & Graph Databases - Two Technologies that Work Well Together

Text Mining & Graph Databases - Two Technologies that Work Well Together | BigData NoSql and Data Stuff | Scoop.it
Graph databases, also known as triplestores, have a very powerful capability – they can store hundreds of billions of semantic facts (triples) from any subject imaginable.  The number of free semantic facts on the market today from sources such as DbPedia, GeoNames and others is high and continues to grow every day.   Some estimates have this total between 150 and 200 billion right now.   As a result, Linked Open Data can be a good source of information with which to load your graph databases.

Linked Open Data is one source of data. When does it become really powerful?  When you create your own semantic triples from your own data and use them in conjunction with linked open data to enrich your database.  This process, commonly referred to as text mining,  extracts the salient facts from free flowing text and typically stores the results in some database.  With this done, you can analyze your enriched data, visualize it, aggregate it and report on it.  In a recent project Ontotext undertook on behalf of FIBO (Finanical Information Business Ontology), we enhanced the FIBO ontologies with Linked Open Data allowing us to query company names and stock prices at the same time to show the lowest trading prices for all public stocks in North America in the last 50 years.   To do this, we needed to combine semantic data sources,  something that’s easy to do with the Ontotext Semantic Platform.

We have found that the optimal way to apply text mining is in conjunction with a graph database.  Many of our customers use our Text Mining to do just that.

Some vendors only sell graph databases and leave it up to you to figure out how to mine the text.  Other vendors only sell the text mining part and leave it up to you to figure out where to store them.  At Ontotext, we support both along with other semantic products and services to build a complete solution.   What do we do to extract text from documents?

I’ll explain this in layman’s terms that everyone can understand.  In reviewing countless diagrams and descriptions about how text mining works, I like to boil it down to a basic 5 step process.   Text mining purists  can surely add to this discussion and we encourage you to.  At the most basic level, here’s what happens…

- See more at: http://www.ontotext.com/text-mining-graph-databases-work-well-together/#sthash.5qfN31n6.dpuf

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Where There’s Spark There’s Fire: The State of Apache Spark in 2014

Where There’s Spark There’s Fire: The State of Apache Spark in 2014 | BigData NoSql and Data Stuff | Scoop.it
Where There’s Spark There’s Fire: The State of Apache Spark in 2014
Alex Kantone's insight:

With the second Spark Summit behind us, we wanted to take a look back at our journey since 2009 when Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting and extremely gratifying to watch Spark mature over the years, thanks in large part to the vibrant, open source community that latched onto it and busily began contributing to make Spark what it is today.

The idea for Spark first emerged in the AMPLab (AMP stands for Algorithms, Machines, and People) at the University of California, Berkeley. With its significant industry funding and exposure, the AMPlab had a unique perspective on what is important and what issues exist among early adopters of big data. We had worked with most of the early users of Hadoop and consistently saw the same issues arise. Spark itself started as the solution to one such problem—speeding machine learning applications on clusters, which machine learning researchers in the lab were having trouble doing using Hadoop. However, we soon realized that we could easily cover a much broader set of applications.

 

The Vision

When we worked with early Hadoop users, we saw that they were all excited about the scalability of MapReduce. However, as soon as these users began using MapReduce, they needed more than the system could offer. First, users wanted faster data analysis—instead of waiting tens of minutes to run a query, as was required with MapReduce’s batch model, they wanted to query data interactively, or even continuously in real-time. Second, users wanted more sophisticated processing, such as iterative machine learning algorithms, which were not supported by the rigid, one-pass model of MapReduce.

At this point, several systems had started to emerge as point solutions to these problems, e.g., systems that ran only interactive queries, or onlymachine learning applications. However, these systems were difficult to use with Hadoop, as they would require users to learn and stitch together a zoo of different frameworks to build pipelines. Instead, we decided to try to generalize the MapReduce model to support more types of computation in a single framework.

We achieved this using only two simple extensions to the model. First, we added support for storing and operating on data in memory—a key optimization for the more complex, iterative algorithms required in applications like machine learning, and one that proved shrewd with the continued drop in memory prices. Second, we modeled execution as general directed acyclic graphs (DAGs) instead of the rigid model of map-and-reduce, which led to significant speedups even on disk. With these additions we were able to cover a wide range of emerging workloads, matching and sometimes exceeding the performance of specialized systems while keeping a single, simple unified programming model.

This decision allowed, over time, new functionality such as Shark (SQL over Spark), Spark Streaming (stream processing), MLlib (efficient implementations of machine learning algorithms), and GraphX (graph computation over Spark) to be built. These modules in Spark are not separate systems, but libraries that users can combine together into a program in powerful ways. Combined with the more than 80 basic data manipulation operators in Spark, they make it dramatically simpler to build big data applications compared to previous, multi-system pipelines. And we have sought to make them available in a variety of programming languages, including Java, Scala, and Python (available today) and soon R.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

CouchBase vs CouchDB vs MongoDB

CouchBase vs CouchDB vs MongoDB | BigData NoSql and Data Stuff | Scoop.it

In my last article I spoke about MongoDB, one of the most popular NoSQL databases so I’ll make a presentation of CouchDB, another great NoSQL database that has some amazing features and how it compares to MongoDB and CouchBase.
 
CouchDB is written in Erlang, a computer language highly optimized for concurrency, distribution and fault tolerance. Beside being fast from the name and logo we can have a hint that the database was targeted for easy use. Their focus was to increase developers's productivity by having easy to use tools.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Predictive Analytics Comparisons, Data Mining 2014

Predictive Analytics Comparisons, Data Mining 2014 | BigData NoSql and Data Stuff | Scoop.it

Predictive analytics is concerned with trawling through historical data to find useful patterns which might be used in the future. As such it employs data mining techniques to find the patterns, and once found and verified they are applied via some scoring mechanism, where each new event is scored in some way (e.g. new loan applicants are scored for suitability or not). The data mining platforms compared in this article represent the most common alternatives many organizations will consider. The analysis is high level, and not a feature by feature comparison – which is fairly pointless in our opinion. The five criteria used to compare the products are:

Capability – the breadth of the offering.Integration – how well the analytics environment integrates with data, production applications and management controls.Extensibility – very important and a measure of how well a platform can be extended functionally and how well it scales.Productivity – the support a platform offers for productive work.Value – likely benefits versus costs.

This form of analysis creates some surprises, but you need to look at the full review to see why a particular offering does so well.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

ElasticSearch Query: Performance Optimisation

ElasticSearch Query: Performance Optimisation | BigData NoSql and Data Stuff | Scoop.it

In one of my previous posts on elasticsearch, i shared my understanding of elasticsearch configurations and best practices. That was mostly from an indexing perspective. There are several tweaks one can use to optimise query performance as well. Improving querying time can be even more challenging than trying to improve indexing times. Lets see why querying is more of a challenge:

Queries can go on while index is getting updatedDifferent queries would need different strategies for optimisationsThere are far more configurations that impact query performance:Query syntax/clauses usedIndex schemaElasticsearch configurationsRAM, CPU, Network, IO

And there are times when you need to fire 2 or more queries in succession to get certain results back from ES. I have had one such scenario recently where i needed to fire 3 queries to ES and make sure that the response times where always less then a second. The 3 queries in question were related to each other in a sense that query 2 uses output of query 1 and query 3 uses output from query 2. For my use case, one of the queries was simple, while others two were more complex as they had aggregations, stats, filters etc.

As outlined above, there are several things that can prevent an optimal response time. Also, to safely say that  a desired response time has been achieved, one needs to test and test right. A poor testing method would lead to misleading performance statistics. Below are details of my testing methodology and tweaks that led to sub second response times for 3 queries.

ElasticSearch Cluster and Indexes5 Machines in the cluster5 Shards per index250 GB EBS volume on each machine to hold indexesIndexes are stored as compressedNo indexing takes place while testing (my use case asks for indexing in batch once a day)3 indexesIndex A: with 24+ million records (used in 1st query)All integer fields.4 fields.Index B: with 90+ million records (used in 2nd query)All integers3 fieldsIndex C: with 340K records (used in 3rd query)String, Integer and Date fieldsonly few fields used in querying.Different machine types:to hold ES indexes: m3.large to c3.4xlargeRAMDifferent sizes for tests, starting from 4GB to 15GB given to ES instance.
more...
No comment yet.
Rescooped by Alex Kantone from Programming Stuffs
Scoop.it!

NoSQL Performance Benchmarks: Cassandra vs HBase vs MongoDB vs Redis vs MySQL

NoSQL Performance Benchmarks: Cassandra vs HBase vs MongoDB vs Redis vs MySQL | BigData NoSql and Data Stuff | Scoop.it

Apache Cassandra is a leading NoSQL database platform for online applications.  By offering benefits of continuous availability, high scalability & performance, strong security, and operational simplicity —  while lowering overall cost of ownership — Cassandra has become a proven choice for both technical and business stakeholders.

When compared to other database platforms such as HBase, MongoDB, Redis, MySQL and many others, Cassandra delivers higher performance under heavy workloads.

The following  benchmark tests provide a graphical, ‘at a glance’ view of how these platforms compare under different scenarios.


End Point Benchmark Configuration and and Results

University of Toronto NoSQL Database Performance

Netflix Benchmarking Cassandra Scalability on AWS

 

End Point Benchmark Configuration and Results Summary

End Point, a database and open source consulting company, benchmarked the top NoSQL databases — Apache Cassandra, Apache HBase, and MongoDB — using a variety of different workloads on Amazon Web Services EC2 instances. This is an industry-standard platform for hosting horizontally scalable services such as the three NoSQL databases that were tested. In order to minimize the effect of AWS CPU and I/O variability, End Point performed each test 3 times on 3 different days. New EC2 instances were used for each test run to further reduce the impact of any “lame instance” or “noisy neighbor” effect on any one test.

A summary of the workload analysis is available below. For a review of the entire testing process with testing environment configuration details, the benchmarking NoSQL databases white paper by End Point is available.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Get Started: Ambari for provisioning, managing and monitoring Hadoop

Get Started: Ambari for provisioning, managing and monitoring Hadoop | BigData NoSql and Data Stuff | Scoop.it

Ambari is 100% open source and included in HDP, greatly simplifying installation and initial configuration of Hadoop clusters. In this article we’ll be running through some installation steps to get started with Ambari. Most of the steps here are covered in the main HDP documentation here.

The first order of business is getting Ambari Server itself installed. There are different approaches to this, but for the purposes of this short tour, we’ll assume Ambari is already installed on its own dedicated node somewhere or on one of the nodes on the (future) cluster itself. Instructions can be found under the installation steps linked above. Once Ambari Server is running, the hard work is actually done. Ambari  simplifies cluster install and initial configuration with a wizard interface, taking care of it with but a few clicks and decisions from the end user. Hit http://<server_you_installed_ambari>:8080 and log in with admin/admin. Upon logging in, we are greeted with a user-friendly, wizard interface. Welcome to Apache Ambari! Name that cluster and let’s get going.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

10 of the most useful cloud databases

10 of the most useful cloud databases | BigData NoSql and Data Stuff | Scoop.it

IDC predicts that big data is growing at an annual rate of 60% for structured and unstructured data. Businesses need to do something with all that data, and traditionally databases have been the answer. With cloud technology, providers are rolling out more ways to host those databases in the public cloud, freeing users from dedicating their own dedicated hardware to these databases, while providing the ability to scale the databases into large capacities. "This is a really huge market given all the data out there," says Jeff Kelly, a big data expert at research firm Wikibon. "The cloud is going to be the destination for a lot of this big data moving forward."

Some concerns remain for what some call database as a service (DBaaS), specifically around sensitive information being stored in the cloud and around cloud outages. But still, an emerging market of cloud database services and tools seems to be picking up steam. Here, Network World looks at 10 cloud database tools. Some of these are providers of direct relational, SQL or NoSQL databases, while others are niche focused on various open source databases. Please note this list is not meant to be exhaustive, as some big players, like Oracle, HP and EMC/VMware are still rounding out their cloud-based products and strategies for these tools.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Facebook Apollo NoSQL Database

Facebook Apollo NoSQL Database | BigData NoSql and Data Stuff | Scoop.it

Facebook’s latest project is a NoSQL database called Apollo that provides online low latency hierarchical storage.

The details of the database project were revealed at QCon New York on Wednesday by Jeff Johnson, a software engineer in Facebook’s Core Data group. He described Apollo as a distributed database around strong consistency using Paxos-style quorum protocols.

Paxos is a family of quorum consensus protocols, originally defined for deriving a single agreed result from a number of possibilities on a network of unreliable processors. It can be used in replicated databases to overcome the problems caused if distributed servers fail. In order for an update to be accepted it must be voted on by a majority of servers within a shard, and updates are only completed when they make their way to a majority of servers.

Distributed databases suffer from a problem described using the CAP or Brewer’s theorem, which states that a distributed database can’t achieve the following all at the same time:

Consistency of data across nodesAvailability in the sense of always responding to requests;Partition tolerance in working even if part of the network is unavailable
more...
No comment yet.