When you use the analytical process known as discovery, I recommend that you look for tools and environments that allow you connect to NoSQL platforms
Tony Agresta's insight:
The convergence of data visualization and NoSQL is becoming a hotter topic every day.We're at the very beginning of this movement as organizations integrate many forms of data with technology to visualize relationships and detect patterns across and within data sets.There aren't many vendors that do this well today and demand is growing. Some organizations are trying to achieve big data visualization through data science as a service. Some software companies have created connectors to NoSQL (and other) data sources to reach this goal. As you would expect, deployment options run the gamut.
Examples of companies that offer data visualization generated from a variety of data sources including NoSQL are Centrifuge Systems who displays results in the form of relationship graphs, Pentaho who provides a full array of analytics including data visualization and predictive analytics and Tableau who supports dozens of data sources along with great charting and other forms of visualization. Regardless of which you choose (and there are others), the process you apply to select and analyze the data will be important.
In the article, John L Myers discusses some of the challenges users face with data discovery technology (DDT).Since DDT operates from the premise that you don’t know all the answersin advance, it’s more difficult to pinpoint the sources needed in the analysis.Analysts discover insights as they navigate through the data visualizations.This challenge isn’t too distant from what predictive modelers face as they decide what variables they want to feed into models.They oftentimes don’t know what the strongest predictors will be so they apply their experience to carefully select data.They sometimes transform specific fields allowing an attribute to exhibit greater explanatory power.BI experts have long struggled with the same issue as they try and decide what metrics and dashboards will be most useful to the business.
Here are some guidelines that may help you solve the problem.They can be used to plan your approach to data analysis.
Start by writing down a hypothesis you want to prove before you connect to specific sources.What do you want to explore?What do you want to prove? In some cases, you'll want to prove many things. That's fine. Write down your top ones.
For each hypothesis create a list of specific questions you want to ask the data that could prove or disprove the hypothesis. You may have 20 or 30 questions for each hypothesis.
Find the data sources that have the data you need to answer the questions.What data will you need to arrive at a conclusion?
Begin to profile each field to see how complete the data is.In other words, take an inventory of the data checking to see if there are a missing values, data quality errors or values that make the specific source a good one. This may point back to changes in data collection needed by your current systems or processes.
Go a layer deeper in your charting and profiling beyond histograms to show relationships between variables you believe will be helpful as you attempt to answer your list of questions and prove or disprove your hypothesis.Show some relationships between two or more variables using heat maps, cross tabs and drill charts.
Reassess your original hypothesis.Do you have the necessary data?Or do you need to request additional types of data?
Once you are set on the inventory of data and you have the tools to connect to those sources, create a set of visualizations to resolve the answers to each of the questions. In some cases, it may be 4 or 5 visualizations for each question. Sometimes, you will be able to answer the question with one visualization.
Assemble the results for each question to prove or disprove the hypothesis. You should arrive at a nice storyboard approach that, when assembled in the right order, allows you to articulate the steps in the analysis and draw conclusions needed to run your business.
If you take these steps upfront and work with a tool that allows you to easily connect to a variety of data sources, you can quickly test your theory, profile and adjust the variables used in your analysis and create meaningful results the organization can use.But if you go into the exercise without any data planning, without any goals in mind, you are bound to waste cycle times trying to decide what to include in your analysis and what not to include. Granted, you won't be able to account for every data analysis issue your department or company has. The purpose of this exercise is to frame the questions you want to ask of the data in support of a more directed approach to data visualization.
Intelligence-led-decisions should be well received by your cohorts and applied more readily with this type of up front planning. The steps you take to analyze the data will run more smoothly. You will be able to explain and better defend the data visualization path you've taken to arrive at conclusions. In other words, the story will be more clear when you present it.
Consider the types of visualizations supported by the analytics technology when you do this. Will you need temporal analysis?Will you require relationship graphs that show connections between people, events, organizations and more?Do you need geospatial visualizations to prove your hypothesis? A little bit of planning when using data discovery and NoSQL technology will go a long way in meeting your analytical needs.
Big Oil Drills Into Big Data Wall Street Journal (blog) Big Oil is the latest industry to turn to Big Data software to shave costs.
Tony Agresta's insight:
Now here's a compelling reason to apply big data technology - Down for 2 days, down $1 million. Big Oil needs to collect and analyze massive amounts of data generated by equipment to reduce down time and anticipate outages. It's critical in Big Oil - both offshore or onshore.
A related application that is not referenced in the post is the use of data visualization to identify the locations and connection points for replacement parts. By analyzing the inventory of parts using network graphs and maps, analysts can identify locations and the shortest paths that need to be taken to optimize delivery times.
If you have not had the chance to review some of the free sources of big data that can enhance your content applications, take a look at the Linked Open Data Graph. It's updated daily and you can learn more by searching for the CKAN API. This graph represents tens of billions of semantic facts about a diverse set of topics. These facts have been used to enhance many content driven web sites allowing users to learn more about music, geography, populations and much more.
Here are some key findings in this report vis a vis MarkLogic:
Features — MarkLogic's offering includes replication, rollback, automated failover, point-in-time recovery, backup/restore, backup to Amazon S3, JSON, Hadoop Distributed File System use, parallelized ingest, role-based security, full text search, geospatial, converter for MongoDB, RDF and SPARQL.
Solid customer base — We estimate over 235 commercial customers, 5,000 licenses and strong financial backing.
Customer satisfaction — Survey ranked MarkLogic high for the experience of doing business with it.
What's interesting is that these findings are all related - you can't achieve a license install base of this magnitude with extraordinary levels of customer satisfaction unless you have enterprise features. With MarkLogic, users don't need to build these features since they already exist. What's the end result? - Time savings, security, more information products, higher levels of customer satisfaction and a competitive advantage in the market.
State Street s David Saul argues big data is better when it s smart data.
Tony Agresta's insight:
Banking, like many industries, faces challenges in the area of data consolidation. Addressing this challenge can require the use of semantic technology to accomplish the following:
A common taxonomy across banking divisions allowing everyone to speak the same language
Applications that integrate data including structured data with unstructured data and semantic facts about trading instruments, transactions that pose risk and derivatives
Ways to search all of the data instantly and represent results using different types of analysis, data visualization or through relevance rankings that highlight risk to the bank.
"What's needed is a robust data governance structure that puts underlying meaning to the information. You can have the technology and have the standards, but within your organization, if you don't know who owns the data, who's responsible for the data, then you don't have good control."
Some organizations have built data governance taxonomies to identify the important pieces of data that need to be surfaced in rich semantic applications focused on risk or CRM, for example. Taxonomies and ontologies understand how data is classified and relationships between the types of data. In turn, they can be used to create facts about the data which can be stored in modern databases (enterprise NoSQL) and used to drive smart applications.
Lee Fulmer, a London-based managing director of cash management for JPMorgan Chase says the creation of [data governance] standards is paramount for fueling adoption, because even if global banks can work out internal data issues, they still have differing regulatory regimes across borders that will require that the data be adapted.
"The big paradigm shift that we need, that would allow us to leverage technology to improve how we do our regulatory agenda in our banking system. If we can come up with a set of standards where we do the same sanction reporting, same format, same data transcription, same data transmission services, to the Canadians, to the Americans, to the British, to the Japanese, it would reduce a huge amount of costs in all of our banks."
Semantic technology is becoming an essential way to govern data, create a common language, build rich applications and, in turn, reduce risk, meet regulatory requirements and reduce costs.
The "semantic Web" is hugely important to tomorrow's business. Do not underestimate its significance: It truly changes everything. Embrace it, or risk extinction.
But what is it? And what does it mean for your business?
Tony Agresta's insight:
Semantic Search is transforming the way businesses operate. In a short period of time, organizations will be focused on this...many already are today. The ability to search on semantic facts (Bruce lives in New Rochelle, NY) while aslo seaching documents AND values at the same time yields search results that power rich content applications, increase visitor traffic, enhance product branding, help catch bad guys and a lot more.
I've been spending a lot of time reviewing examples of these applications for commerical and government websites and must admit, they are very, very compelling. They provide context-based search results that make the site sticky. They deliver information in real time. Look for more articles and posts on this subject to come out next week. I'll focus on how Semantic Search is being used as a transformational technology in business and government and highlight some of the new capabilties supported in MarkLogic 7.
In talking about Big Data so much, are we neglecting the important things that you can do with Small Data? Maybe, but... probably not. Looking beyond the hype…
Tony Agresta's insight:
Great post from Kirk Borne. I'm especially fond of "Association Discovery", an effective apporach to identify important networks of people that could be having a positive (or negative) impact on your organization while also detecting co-occurring combinations of attributes that can be used to improve rules-based trigger and alerts.
Technology companies are rushing into predictive search, developing apps like Google Now that process digital clues to anticipate what users want to know.
Tony Agresta's insight:
Worth reading to better understand how predictive analysis is converging with search technology to create a new breed of applications that analyzes digital interaction and attributes about you and then translates the results into personalized alerts.
Not all data is created equal. Some is active and absolutely essential in solving real time problems. And some may be needed in the future but, for now, can occupy fewer resources while keeping your costs down. This presentation discusses how organizations can optimize cost, performance and availability using tiered storage with Hadoop and MarkLogic. It demonstrates how you can have the best of both worlds - real time access to mission critical data AND the ability to immediately activate your long tail data stored in HDFS as needed.
The benefits include less data movement, less ETL, the ability to index your data once, selectively mount data for real time usage and cost effective storage options.
Today, more than ever, organizations need the flexibility to manage their data efficiently. Why not store MarkLogic data directly in the Hadoop File System, apply Map Reduce to operate on that data through batch processing and then mount any portion of that data in MarkLogic for real time access? Mixing real time and batch workloads allows you to manage your low density, active data in an enterprise environment with replication and high availability while also maintaining all other data in Hadoop.This video is worth watching.
Use these five use cases to spark your thinking about how to combine big data and visualization tools in your enterprise.
Tony Agresta's insight:
One form of data visualization that is underutilized by sales and marketing professionals is a relationship graph which shows you connections between people, places, things, events...any attributes you want to see in the graph. This form of visualization has long been used by the intelligence community to find bad guys and identify fraud networks. But it also has practical applications in sales and marketing.
Let's say you're trying to improve your lead conversion process and accelerate sales cycles. Wouldn't it be important to analyze relationships between campaigns, qualified leads created, the business development people that created the leads and how fast each lead progressed through sales stages?
Imagine a network graph that showed the campaigns, business development people that worked the lead pool, qualified leads and the number of opportunities created. Imagine if components of the graph (nodes) were scaled based on the amount of money spent on each campaign, the number of leads each person worked and the value of each opportunity. Your eye would be immediately drawn to a number of insights.
You could quickly see which campaigns provided the most bang for your buck - the ones with relatively low cost and high qualified lead production. You could quickly see which business development reps generated a high volume of qualified leads and how many turned into real opportunities. Now imagine if you could play the creation of the graph over time. You could see when campaigns started to generate qualified leads. How long did it take? How soon could sales expect to get qualified leads? Should your campaign planning cycles change? Are your more expensive campaigns having the impact you expected? Is this all happening fast enough to meet sales targets?
This form of data visualization is easier to apply than you think. There are tools on the market that allow you to connect to CSV files exported from your CRM system and draw the graph in seconds. As data visualization becomes more common in business, sales and marketing professionals will start to use this approach to measure performance of campaigns and employees while better understanding influencing factors in each stage of the sales cycles.
IBM is introducing new data discovery software that enables business users to visually interact with and apply advanced analytics to their data without any specialized skillsto get deeper insights about their business. The new software will help close the analytics skills gap that makes current data discovery tools inaccessible for the everyday business user and make it possible to go from raw information to answers hidden deep within structured and unstructured information in minutes.
It was bound to happen and IBM seems headed in the right direction - Predictive Analytics and Data Visualization converge in the cloud. In a post 911 era, analysts recognized that revealing insights required the human mind to explore data in an unconstrained manner. If they had the chance to interact with disparate data sets, visualize that data in different forms and navigate to connection points directed by their experience, they could quickly pinpoint relationships that matter.
Today, groundbreaking approaches in intelligence analysis have their foundations built on unconstrained data discovery. Insurance organizations are applying interactive data visualization to uncover patterns that clearly point to fraud and collusion. eCommerce organizations are using these techniques to examine both employee and vendor behavior as they connect the dots highlighting networks of interest.
This revolution in analysis is only just beginning. Imagine what can be accomplished when predictive models and rules are applied to big data in real time yielding more focused data sets for discovery.
Four years ago I spoke with a global top 10 bank that applied predictive models to detect fraudulent transactions. When I asked if they had combined this approach with data visualization to pick up on any error in the models, they responded that their analysts couldn't use those tools because they were too complex. They couldn't identify fraud networks using relationship graphs. After nearly $2 billion in fines, I wonder if they are rethinking this approach? The fact is, they could have detected money laundering among their account holders by joining the results of predictive analysis with human data discovery.
Within 5 years, I would be surprised if every major bank, insurance company, retailer and healthcare organization wasn't following in the footsteps of the intelligence community. As these analytic methods converge, the criminal's chances of hiding the truth diminish dramatically.
This paper includes a great set of definitions around the subject of predictive analytics including a number of important applications - fraud prevention, location tracking, targeted advertising, law enforcement and intelligence. For anyone interested in predictive analytics, I suggest scanning topic headlines to identify areas of focus. Many of the fundamentals of predictive analytics are described in this paper.
Triplestores are gaining in popularity. This article does a nice job at describing what triple stores are and how they differ from graph databases. But there isn't much in the article on how triple stores are used. So here goes:
Some organizations are finding that when they apply a combination of semantic facts (triples) with other forms of unstructured and structured data, they can build extremely rich content applications. In some cases, content pages are constructed dynamically. The context based applications deliver targeted, relevant results creating a unique user experience. Single unified architectures that can store and search semantic facts, documents and values at the same time require fewer IT and data processing resources resulting in shorter time to market. Enterprise grade technology provides the security, replication, availability, role based access and the assurance no data is lost in the process. Real time indexing provides instant results.
Other organizations are using triples stores and graph databases to visually show connections useful in uncovering intelligence about your data. These tools connect to Triplestores and NoSQL databases easily allowing users to configure graphs to show how the data is connected. There's wide applicability for this but common use cases include identifying fraud and money laundering networks, counter-terrorism, social network analysis, sales performance, cyber security and IT asset management. The triples, documents and values provide the fuel for the visualization engine allowing for comprehensive data discovery and faster business decisions.
Other organizations focus on semantic enrichment and then ingest resulting semantic facts into triplestores to enhance the applications mentioned above. Semantic enrichment extracts meaning from free flowing text and identifies triples.
Today, the growth in open data - pre-built triple stores - is allowing organizations to integrate semantic facts to create richer content applications. There are hundreds of sources of triple stores that contain tens of billions of triples, all free.
What's most important about these approaches? Your organization can easily integrate all forms of data in a single unified architecture. The data is driving smart websites, rich search applications and powerful approaches to data visualization. This is worth looking at more closely since the end results are more customers, lower costs, greater insights and happier users.
The past decades organisations have been working with relational databases to store their structured data. In the big data era however, these types of databases are not sufficient anymore.
Tony Agresta's insight:
"The amount of available NoSQL databases is growing rapidly and currently there are, as this website shows, over 150 of them. One of the more well known is MarkLogic and recently they announced MarkLogic 7, an Enterprise NoSQL that shows the vast possibilities of NoSQL databases for organisations."
One of many new features in MarkLogic 7 is Semantics including support for a triple store, SPARQL to query the triples, a triples index and cache for enhanced performance and updated APIs for developers.
With this release, Enterprise NoSQL has taken another huge step forward. Why is this so important? For the first time ever, a single unified architecture exists allowing organizations to query and apply documents, others types of unstructured data, structured data (values, for example) and semantic facts at the same time. In other words, composite queries can be built which return any combination of data allowing developers to create very rich content applications.
From the article: "Semantic triples enable relationships between pieces of data and are related more closely to the way humans think. If you combine them with Linked Open Data (facts that are freely available and in a form that is easily consumed by machines) or information from DBPedia (Wikipedia but in a structured format understandable by machines), these triples suddenly receive a meaning and give data the context required to be valuable in a semantic environment. As the founder of MarkLogic, Christopher Lindblad, explained during the summit: “Data is not information, what you have to do to get from data to information is add context.”
We have already seen customers developing SPARQL queries that extract semantic facts alongside documents to graph the connections using relationship graphs. "Context" takes on new meaning when analysts can see how people are connected to places, publications, events, organizations and much more.
"Semantic search allows you to perform a combination of queries ranging from text queries, document queries, range queries or pose a query that goes over multiple data sources. It can return results that might not even contain the exact term you used in a query, but which is very closely linked to what you are looking for and therefore still relevant. It is a new way to find what you are looking for and in combination with Enterprise NoSQL the new way to understand and find corporate data."
There are some very good customer applications cited in the article by Mark van Rijmenam.
If you want to read more about this technology, I suggest you look at the review recently done by Gartner entitled Magic Quadrant
You write a query with great care, and excitedly hit the "enter" button, only to see a bunch of gobbledygook spit out on the screen.
Tony Agresta's insight:
"The real power of this approach becomes evident when one considers the hugely disparate nature of information on the Internet. An RDF powered application can build links between different pieces of data, and effectively 'learn' from the connections created by the semantic triples. This is the big (and as yet unrealized) pipe dream of the semantic Web."
Customers can now use a combination of documents, values and semantic facts in the form of triples to create very rich content applications. The semantics world will appreciate the fact that linked open triples can be imported into MarkLogic's triple store or they can use semantic facts and taxonomies that already exist in their organization to populate the triple store. On the other hand, the document world will appreciate that semantic facts already embedded in documents or created from authoring tools can be used to populate the triple store. Those already using semantic enrichment products can create triples from free flowing text and now apply MarkLogic to ingest, manage, search and create rich applications using those facts. We have already seen some early access users take a simple taxonomy, create triples, run SPARQL queries and then graph the connections between authors and publications using a third party graphing tool.
The applications for this use case in intelligence, fraud analysis, anti-money laundering, cyber security and other areas is at top of mind for most organizations today. The beauty of the MarkLogic approach is that the relationships in the facts can be combined with documents and values. When combined with 3rd party relationship graphing features, users can reveal hidden insights that can only be discovered using this approach.
If big data is a big party, Hadoop would be at the center and be surrounded by Hive, Giraph, YARN, NoSQL, and the other exotic technologies that generate so much excitement.
Tony Agresta's insight:
MarkLogic continues to enhance it's approach to NoSQL further confirming it's the adult at the party. MarkLogic 7 includes enhancements to enterprise search as well as Tiered Storage and integration with Hadoop.
MarkLogic Semantics, also part of MarkLogic 7, provides organizations with the ability to enhance the content experience for users by including an even richer experience that includes semantic facts, documents and values in the same search experience. By doing this, organizations can surface semantic facts stored in MarkLogic when users are searching for a topic or person of interest. For example, if a user searches all unstructured data on a topic, facts about authors, publication dates, related articles and other facts about the topic would be part of the search results.
This could be applied in many ways. Intelligence analysts may be interested in facts about people of interest. Fraud and AML analysts could be interested in facts about customers with unusual transaction behavior. Life Sciences companies may want to include documents, facts about the drug manufacturing process and values about pharma products as part of the search results.
Today, traditional search applications are being replaced by smarter, content rich semantic search. This addition to MarkLogic continues to confirm that all of this can be done within a single, unified architecture saving organizations development time, money and resources while delivering enterprise grade technology used in the most mission critical applications today.
Search, as we know it, is dead. In the new semantic Web, engaging content results in more online visitors, who are more easily converted into customers. Sounds great, but how do you play the new "semantic search" game?
Tony Agresta's insight:
Semantic Search allows companies to create a very rewarding web experience for visitors replete with relevant, context-based facts on exactly what they are looking for. Imagine if you were searching for information about singers and you arrived at a website that included artist bios, album reviews, recordings, related URLs and more. Websites like this do exist (BBC Music) powered by the movement in linked open data. Today, tens of millions of facts about subjects as diverse as geographic names, drugs, financial investment projects, government and DBpedia are available, for free, to enhance your website search experience.
Businesses and local governments are building new, information based web sites that surface these facts creating differentiation for themselves, new revenue streams and sticky sites that continue to increase in rankings.
What makes them unique? The foundation of this approach is rooted in the ability to display documents (unstructured data), semantic triples (facts about the content expressed in the form of subject-predicate-object) and values (structured data) on the site at the same time. In some cases, this is done dynamically where content pages are assembled in real time.
What makes this challenging? Often the data needed to power this type of rich content experience is stored in disparate silos making it difficult to query. The data needed to create the rich content experience is hard to access and can't be delivered at speeds required. Fortunately, enterprise NoSQL databases are designed to consolidate data, efficiently delver content, create new information products and analyze the results. If you're interested in some of the possibilities in this area, there's a good video on the subject here: Semantic Technologies in MarkLogic
The first 20 minutes of this video defines semantic triples and provides examples how they can be used to create the rich experiences described by Forbes and in this post.
Great paper that covers how you can make Hadoop really powerful. Not all data is created equal. Some is needed in real time. Some requires less expensive storage options. Some you may need to quickly migrate from HDFS to MarkLogic. This paper is the perfect road map to understand how you can unleash the power of Hadoop. No registration required to download a copy.
The BBC documentary follows people who mine Big Data, including the Los Angeles Police Department (LAPD) who uses data to predict crime. It's proven that historical patterns can be used to predict future behavior. With a database of over 13 million crimes spanning 80 years and real time continuous updates, the LAPD has applied mathematical algorithms and pattern recognition to identify crime hotspots. Targeted police work has resulted in a 26% decrease in burglaries and a 12% decrease in property crimes.
How does this work? In the same way that earthquake aftershocks can be predicted, data miners analyzed historical crime statistics including location and timing. They found patterns in the big data crime landscape. By tracking the history, timing and location of crimes, they revealed that the probability another crime would occur in certain locales was higher. They discovered patterns in the data. In this case, the rate of crime and geospatial distribution of events were excellent predictors of future behavior including pinpointing small geograpic areas which they used to direct police resources.
Today, these predictive aftershocks are becoming more accurate through the use of real time data feeds, alerts, geospatial analysis and temporal analysis. Over 150 cities in the US are starting to apply these techniques allowing police officers to anticipate, focus, apprehend and therefore lower risk.
Thanks to KD Nuggets for providing the link to the BBC video which is very well done.
Here's a nice list of bloggers, tweeters and general influencers in the Big Data space. The post makes the following key points worth noting:
You can now reach out to the influencer using Twitter, email, phone or any other appropriate way with the increased conviction that follows from knowing that you are being highly relevant to them. In fact – in most cases they are likely to thank you for bringing the relevant material to their attention and in many cases they will share their “find” with others.
The net effect of this solution is that your evangelists spend time on being relevant and building relationships with influencers – rather than spending time looking for opportunities to engage; and just as a sales team that works off a steady stream of hot leads performs better than one that has to find their own leads, your evangelists will help win significantly more hearts, minds and market share.