The third part in a series on the data science ecosystem looks at the applications that turn data into insights or models.
Your new post is loading...
Your new post is loading...
Current selected tag: datascience. Clear.
The third part in a series on the data science ecosystem looks at the applications that turn data into insights or models.
Probably because this category has the most ongoing research, there is quite a rich collection of open-source modeling and insights tools. R is an essential tool for most data scientists and works both as a programming language and an interactive environment for exploring data. Octave is a free, open-source port of matlab that works very well. Julia is becoming increasingly popular for technical computing. Stanford has an NLP library that has tools for most standard language processing tasks. Scikit, a machine learning package for python, is becoming very powerful and has implementations of most standard modeling and machine learning algorithms.
In the end, data application tools are what make data scientists incredibly valuable to any organization. They're the exact thing that allows a data scientist to make powerful suggestions, uncover hidden trends and provide tangible value. But these tools simply don't work unless you have good data and unless you enrich, blend and clean that data.
In the first three sections of the Twitter data collection tutorial, I demonstrated how to collect tweets using both R and Python and how to store these tweets first as JSON files then having R parse them into a .csv file. The .csv file works well, but tweets don’t always make good flat .csv files, since not every tweet contains the same fields or the same structure. Some of the data is well nested into the JSON object. It is possible to write a parser that has a field for each possible subfield, but this might take a while to write and will create a rather large .csv file or SQL database.
Project Big Data is an interactive tool which enables you to visualize and explore the funding patterns of over 600 companies in the Big Data ecosystem! It is based on the work I did for Project C (which you see and can read about here). The list of companies and their classification into categories is based on a dozen published sources and rough text analytics of the Crunchbase database. Crunchbase is a curated crowed sourced database of over 285k companies.
As for the data, there are 645 public & private companies in the data set. From Teradata and IBM to Actuate & Zoomdata. I began by harvesting data from Crunchbase by using their free API w/ Python. As of September, Crunchbase had 1250 funding events for 410 of the companies on my list. I've grouped these companies into 18 categories, allowing you to compare peers as well as trends across categories. Some of the categories are broken down further. For example, the tool allows you to differentiate between cloud-based and on premise solutions or SQL vs. NoSQL databases. I gathered additional data from a variety of sources. For example, LinkedIn was used to find the number of employees.
OPENACCESS Workbook: Project Big Data v1.0
Google open sourced a new package for the R statistical computing software that’s designed to help users infer whether a particular action really did cause subsequent activity. Google has been using the tool, called CausalImpact, to measure AdWords campaigns but it has broader appeal.
Via Dominique Cardon
Google announced on Tuesday a new open source tool that can help data analysts decide if changes to products or policies resulted in measurable change, or if the change would have happened anyway. The tool, called CausalImpact, is a package for the R statistical computing software, and Google details it in a blog post.
According to blog post author Kay H. Brodersen, Google uses the tool — created it, in fact — primarily for quantifying the effectiveness of AdWords campaigns. However, he noted, the same method could be used to gauge everything from whether adding a new feature caused an increase in app downloads to questions involving events in medical, social or political science.
The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.
Meanwhile I headed over to David Masad's writeup on getting started with GDELT in python
I hope I find the time to write a one-page survival guide for UNIX, Python and Perl. Here's one for R. The links to core data science concepts are below - I ne…
Via Pierre Levy
Here's the list:
- Practical illustration of Map-Reduce (Hadoop-style), on real data
- A synthetic variance designed for Hadoop and big data
- Fast Combinatorial Feature Selection with New Definition of Predict...
- A little known component that should be part of most data science a...
- 11 Features any database, SQL or NoSQL, should have
- Clustering idea for very large datasets
- Hidden decision trees revisited
- Correlation and R-Squared for Big Data
- Marrying computer science, statistics and domain expertize
- New pattern to predict stock prices, multiplies return by factor 5
- What Map Reduce can't do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- Source code for our Big Data keyword correlation API
- The curse of big data
- How to detect a pattern? Problem and solution
- Interesting Data Science Application: Steganography
A conversation about data-driven innovation is possible now because new technologies have made it easier and cheaper to collect, store, analyze, use, and disseminate data. But while the potential for vastly more data-driven innovation exists, many organizations have been slow to adopt these technologies. Policymakers around the world should do more to spur data-driven innovation in both the public and private sectors, including by supporting the development of human capital, encouraging the advancement of innovative technology, and promoting the availability of data itself for use and reuse.
From books to videos to online tutorials -- most free! -- here are plenty of ideas to burnish your R knowledge.
These websites, videos, blogs, social media / communities, software andbooks/ebooks can help you do more with R; the favorites are listed in bold.
I provide an overview of the data science workflow and highlight some challenges that data scientists face in their work.
Via João Greno Brogueira
@luiy. Great article about #DataScience: the workflow design, methods and problematics.
What do data scientists do at work, and what challenges do they face?
This post provides an overview of the modern data science workflow, adapted from Chapter 2 of my Ph.D. dissertation, Software Tools to Facilitate Research Programming.The Data Science Workflow
The figure below shows the steps involved in a typical data science workflow. There are four main phases, shown in the dotted-line boxes: preparation of the data, alternating between running the analysis andreflection to interpret the outputs, and finally dissemination of results in the form of written reports and/or executable code.
Table of Contents
This book’s contents are freely available as PDF files. When you click on a chapter title below, you will be taken to a webpage for that chapter. The page contains links for a PDF of that chapter and for any sample Python code and data that chapter requires. Please let me know if you see an error in the book, if some part of the book is confusing, or if you have some other comment. I will use these to revise the chapters.
Chapter 1: Introduction
Finding out what data mining is and what problems it solves. What will you be able to do when you finish this book.
Chapter 2: Get Started with Recommendation Systems
Introduction to social filtering. Basic distance measures including Manhattan distance, Euclidean distance, and Minkowski distance. Pearson Correlation Coefficient. Implementing a basic algorithm in Python.
Chapter 3: Implicit ratings and item-based filtering
A discussion of the types of user ratings we can use. Users can explicitly give ratings (thumbs up, thumbs down, 5 stars, or whatever) or they can rate products implicitly–if they buy an mp3 from Amazon, we can view that purchase as a ‘like’ rating.
Chapter 4: Classification
In previous chapters we used people’s ratings of products to make recommendations. Now we turn to using attributes of the products themselves to make recommendations. This approach is used by Pandora among others.
Chapter 5: Further Explorations in Classification
A discussion on how to evaluate classifiers including 10-fold cross-validation, leave-one-out, and the Kappa statistic. The k Nearest Neighbor algorithm is also introduced.
Chapter 6: Naïve Bayes
An exploration of Naïve Bayes classification methods. Dealing with numerical data using probability density functions.
Chapter 7: Naïve Bayes and unstructured text
This chapter explores how we can use Naïve Bayes to classify unstructured text. Can we classify twitter posts about a movie as to whether the post was a positive review or a negative one? (new version coming November 2013)
Since the release of Mining the Social Web, 2E in late October of last year, I have mostly focused on creating supplemental content that focused on Twitter data. This seemed like a natural starting point given that the first chapter of the book is a gentle introduction to data mining with Twitter’s API coupled with the inherent openness of accessing and analyzing Twitter data (in comparison to other data sources that are a little more restrictive.) Twitter’s IPO late last year also focused the spotlight a bit on Twitter, which provided some good opportunities to opine on Twitter’s underlying data model that can be interpreted as an interest graph.
TubeKit is a toolkit for creating YouTube crawlers. It allows one to build one's own crawler that can crawl YouTube based on a set of seed queries and collect up to 16 different attributes.
TubeKit assists in all the phases of this process starting database creation to finally giving access to the collected data with browsing and searching interfaces. In addition to creating crawlers, TubeKit also provides several tools to collect a variety of data from YouTube, including video details and user profiles
Slides for Chapters 1-5 of the 3rd edition can be found here.
Slides for Chapters 6-8 of the 3rd edition can be found here
These archives contain .pdf files as well as .odp files in Open Document Format that were generated using OpenOffice 2.0. Note that there are several free office programs now that can read .odp files. There is also a plug-in for Word made by Sun for reading this format. Corresponding information is on this Wikipedia page.
Preface and Table of Content
Chapter 1. Data Mining
Chapter 2. Map-Reduce and the New Software Stack
Chapter 3. Finding Similar Items
Chapter 4. Mining Data Streams
Chapter 5. Link Analysis
Chapter 6. Frequent Itemsets
Chapter 7. Clustering
Chapter 8. Advertising on the Web
Chapter 9. Recommendation Systems
Chapter 10. Mining Social-Network Graphs
Chapter 11. Dimensionality Reduction
Chapter 12. Large-Scale Machine Learning
Download Full Book :
Recently Plugged In to Gnip partner, Qlik, and international relief organization, Medair, hosted a hackathon focused on using social data to inform global disaster response.
At Gnip, we’re always excited to hear about groups and individuals who are using social data in unique ways to improve our world. We were recently fortunate enough to support this use of social data for humanitarian good first-hand. Along with Plugged In to Gnip partner, Qlik, and international relief organization, Medair, we hosted a hackathon focused on global disaster response.
The hackathon took place during Qlik’s annual partner conference in Orlando and studied social content from last year’s Typhoon Haiyan. Historical Twitter data from Gnip was paired with financial information from Medair to give participants the opportunity to create new analytic tools on Qlik’s QlikView.Next BI platform. The Twitter data set specifically included Tweets from users in the Philippines for the two week period around Typhoon Haiyan in November of 2013. The unique combination of data and platform allowed the hackathon developers to dissect and visualize a massive social data set with the goal of uncovering new insights that could be applied in future natural disasters.
Today we’re introducing a pilot project we’re calling Twitter Data Grants, through which we’ll give a handful of research institutions access to our public and historical data. Wi......
With more than 500 million Tweets a day, Twitter has an expansive set of data from which we can glean insights and learn about a variety of topics, from health-related information such as when and where the flu may hit to global events like ringing in the new year. To date, it has been challenging for researchers outside the company who are tackling big questions to collaborate with us to access our public, historical data. Our Data Grants program aims to change that by connecting research institutions and academics with the data they need.
If you’d like to participate, submit a proposal here no later than March 15th. For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.
Find statistics, consumer survey results and industry studies
Statista is the world’s largest statistics portal. Providing you with access to relevant data from over 18,000 sources, our focus is firmly based on professional, clear, quick and consistent results. Our customized search query form provides you with a list of statistics, studies and reports relating to your search request within a matter of seconds –kick-starting your research.
Today we are going to talk about the wide spectrum of functions and methods that we can use to visualize dendrograms in R. You can check an extended version of this post with the complete reproduci...
A quick reminder: a dendrogram (from Greek dendron=tree, and gramma=drawing) is nothing more than a tree diagram that practitioners use to depict the arrangement of the clusters produced by hierarchical clustering.
Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.
This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.
Over the last few years, we've heard a lot about how
“I think the big science journalism story of 2014 will be the brain drain from science to industry ‘data science,’” Fred writes. “Up until a few years ago, at least in my field, the best grad students got jobs as professors, and the less successful grad students took jobs in industry. It is now the reverse. It’s a real trend, and it’s a big deal. One reason is that science tends not to reward the graduate students who are best at developing good software, which is exactly what science needs right now…
“Another reason, especially important for me, is the quality of research in academia and in industry. In academia, the journals tend to want the most interesting results and are not so concerned about whether the results are true. In industry data science, [your] boss just wants the truth. That’s a much more inspiring environment to work in. I like writing code and analyzing data. In industry, I can do that for most of the day. In academia, it seems like faculty have to spend most of their time writing grants and responding to emails.”