When you use the analytical process known as discovery, I recommend that you look for tools and environments that allow you connect to NoSQL platforms
The convergence of data visualization and NoSQL is becoming a hotter topic every day. We're at the very beginning of this movement as organizations integrate many forms of data with technology to visualize relationships and detect patterns across and within data sets. There aren't many vendors that do this well today and demand is growing. Some organizations are trying to achieve big data visualization through data science as a service. Some software companies have created connectors to NoSQL (and other) data sources to reach this goal. As you would expect, deployment options run the gamut.
Examples of companies that offer data visualization generated from a variety of data sources including NoSQL are Centrifuge Systems who displays results in the form of relationship graphs, Pentaho who provides a full array of analytics including data visualization and predictive analytics and Tableau who supports dozens of data sources along with great charting and other forms of visualization. Regardless of which you choose (and there are others), the process you apply to select and analyze the data will be important.
In the article, John L Myers discusses some of the challenges users face with data discovery technology (DDT). Since DDT operates from the premise that you don’t know all the answers in advance, it’s more difficult to pinpoint the sources needed in the analysis. Analysts discover insights as they navigate through the data visualizations. This challenge isn’t too distant from what predictive modelers face as they decide what variables they want to feed into models. They oftentimes don’t know what the strongest predictors will be so they apply their experience to carefully select data. They sometimes transform specific fields allowing an attribute to exhibit greater explanatory power. BI experts have long struggled with the same issue as they try and decide what metrics and dashboards will be most useful to the business.
Here are some guidelines that may help you solve the problem. They can be used to plan your approach to data analysis.
- Start by writing down a hypothesis you want to prove before you connect to specific sources. What do you want to explore? What do you want to prove? In some cases, you'll want to prove many things. That's fine. Write down your top ones.
- For each hypothesis create a list of specific questions you want to ask the data that could prove or disprove the hypothesis. You may have 20 or 30 questions for each hypothesis.
- Find the data sources that have the data you need to answer the questions. What data will you need to arrive at a conclusion?
- Begin to profile each field to see how complete the data is. In other words, take an inventory of the data checking to see if there are a missing values, data quality errors or values that make the specific source a good one. This may point back to changes in data collection needed by your current systems or processes.
- Go a layer deeper in your charting and profiling beyond histograms to show relationships between variables you believe will be helpful as you attempt to answer your list of questions and prove or disprove your hypothesis. Show some relationships between two or more variables using heat maps, cross tabs and drill charts.
- Reassess your original hypothesis. Do you have the necessary data? Or do you need to request additional types of data?
- Once you are set on the inventory of data and you have the tools to connect to those sources, create a set of visualizations to resolve the answers to each of the questions. In some cases, it may be 4 or 5 visualizations for each question. Sometimes, you will be able to answer the question with one visualization.
- Assemble the results for each question to prove or disprove the hypothesis. You should arrive at a nice storyboard approach that, when assembled in the right order, allows you to articulate the steps in the analysis and draw conclusions needed to run your business.
If you take these steps upfront and work with a tool that allows you to easily connect to a variety of data sources, you can quickly test your theory, profile and adjust the variables used in your analysis and create meaningful results the organization can use. But if you go into the exercise without any data planning, without any goals in mind, you are bound to waste cycle times trying to decide what to include in your analysis and what not to include. Granted, you won't be able to account for every data analysis issue your department or company has. The purpose of this exercise is to frame the questions you want to ask of the data in support of a more directed approach to data visualization.
Intelligence-led-decisions should be well received by your cohorts and applied more readily with this type of up front planning. The steps you take to analyze the data will run more smoothly. You will be able to explain and better defend the data visualization path you've taken to arrive at conclusions. In other words, the story will be more clear when you present it.
Consider the types of visualizations supported by the analytics technology when you do this. Will you need temporal analysis? Will you require relationship graphs that show connections between people, events, organizations and more? Do you need geospatial visualizations to prove your hypothesis? A little bit of planning when using data discovery and NoSQL technology will go a long way in meeting your analytical needs.