Your new post is loading...
Your new post is loading...
Current selected tag: datascience. Clear.
The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.
Meanwhile I headed over to David Masad's writeup on getting started with GDELT in python
I hope I find the time to write a one-page survival guide for UNIX, Python and Perl. Here's one for R. The links to core data science concepts are below - I ne…
Via Pierre Levy
Here's the list:
- Practical illustration of Map-Reduce (Hadoop-style), on real data
- A synthetic variance designed for Hadoop and big data
- Fast Combinatorial Feature Selection with New Definition of Predict...
- A little known component that should be part of most data science a...
- 11 Features any database, SQL or NoSQL, should have
- Clustering idea for very large datasets
- Hidden decision trees revisited
- Correlation and R-Squared for Big Data
- Marrying computer science, statistics and domain expertize
- New pattern to predict stock prices, multiplies return by factor 5
- What Map Reduce can't do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- Source code for our Big Data keyword correlation API
- The curse of big data
- How to detect a pattern? Problem and solution
- Interesting Data Science Application: Steganography
A conversation about data-driven innovation is possible now because new technologies have made it easier and cheaper to collect, store, analyze, use, and disseminate data. But while the potential for vastly more data-driven innovation exists, many organizations have been slow to adopt these technologies. Policymakers around the world should do more to spur data-driven innovation in both the public and private sectors, including by supporting the development of human capital, encouraging the advancement of innovative technology, and promoting the availability of data itself for use and reuse.
From books to videos to online tutorials -- most free! -- here are plenty of ideas to burnish your R knowledge.
These websites, videos, blogs, social media / communities, software andbooks/ebooks can help you do more with R; the favorites are listed in bold.
I provide an overview of the data science workflow and highlight some challenges that data scientists face in their work.
Via João Greno Brogueira
@luiy. Great article about #DataScience: the workflow design, methods and problematics.
What do data scientists do at work, and what challenges do they face?
This post provides an overview of the modern data science workflow, adapted from Chapter 2 of my Ph.D. dissertation, Software Tools to Facilitate Research Programming.The Data Science Workflow
The figure below shows the steps involved in a typical data science workflow. There are four main phases, shown in the dotted-line boxes: preparation of the data, alternating between running the analysis andreflection to interpret the outputs, and finally dissemination of results in the form of written reports and/or executable code.
Table of Contents
This book’s contents are freely available as PDF files. When you click on a chapter title below, you will be taken to a webpage for that chapter. The page contains links for a PDF of that chapter and for any sample Python code and data that chapter requires. Please let me know if you see an error in the book, if some part of the book is confusing, or if you have some other comment. I will use these to revise the chapters.
Chapter 1: Introduction
Finding out what data mining is and what problems it solves. What will you be able to do when you finish this book.
Chapter 2: Get Started with Recommendation Systems
Introduction to social filtering. Basic distance measures including Manhattan distance, Euclidean distance, and Minkowski distance. Pearson Correlation Coefficient. Implementing a basic algorithm in Python.
Chapter 3: Implicit ratings and item-based filtering
A discussion of the types of user ratings we can use. Users can explicitly give ratings (thumbs up, thumbs down, 5 stars, or whatever) or they can rate products implicitly–if they buy an mp3 from Amazon, we can view that purchase as a ‘like’ rating.
Chapter 4: Classification
In previous chapters we used people’s ratings of products to make recommendations. Now we turn to using attributes of the products themselves to make recommendations. This approach is used by Pandora among others.
Chapter 5: Further Explorations in Classification
A discussion on how to evaluate classifiers including 10-fold cross-validation, leave-one-out, and the Kappa statistic. The k Nearest Neighbor algorithm is also introduced.
Chapter 6: Naïve Bayes
An exploration of Naïve Bayes classification methods. Dealing with numerical data using probability density functions.
Chapter 7: Naïve Bayes and unstructured text
This chapter explores how we can use Naïve Bayes to classify unstructured text. Can we classify twitter posts about a movie as to whether the post was a positive review or a negative one? (new version coming November 2013)
Recently Plugged In to Gnip partner, Qlik, and international relief organization, Medair, hosted a hackathon focused on using social data to inform global disaster response.
At Gnip, we’re always excited to hear about groups and individuals who are using social data in unique ways to improve our world. We were recently fortunate enough to support this use of social data for humanitarian good first-hand. Along with Plugged In to Gnip partner, Qlik, and international relief organization, Medair, we hosted a hackathon focused on global disaster response.
The hackathon took place during Qlik’s annual partner conference in Orlando and studied social content from last year’s Typhoon Haiyan. Historical Twitter data from Gnip was paired with financial information from Medair to give participants the opportunity to create new analytic tools on Qlik’s QlikView.Next BI platform. The Twitter data set specifically included Tweets from users in the Philippines for the two week period around Typhoon Haiyan in November of 2013. The unique combination of data and platform allowed the hackathon developers to dissect and visualize a massive social data set with the goal of uncovering new insights that could be applied in future natural disasters.
Today we’re introducing a pilot project we’re calling Twitter Data Grants, through which we’ll give a handful of research institutions access to our public and historical data. Wi......
With more than 500 million Tweets a day, Twitter has an expansive set of data from which we can glean insights and learn about a variety of topics, from health-related information such as when and where the flu may hit to global events like ringing in the new year. To date, it has been challenging for researchers outside the company who are tackling big questions to collaborate with us to access our public, historical data. Our Data Grants program aims to change that by connecting research institutions and academics with the data they need.
If you’d like to participate, submit a proposal here no later than March 15th. For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.
Find statistics, consumer survey results and industry studies
Statista is the world’s largest statistics portal. Providing you with access to relevant data from over 18,000 sources, our focus is firmly based on professional, clear, quick and consistent results. Our customized search query form provides you with a list of statistics, studies and reports relating to your search request within a matter of seconds –kick-starting your research.
Today we are going to talk about the wide spectrum of functions and methods that we can use to visualize dendrograms in R. You can check an extended version of this post with the complete reproduci...
A quick reminder: a dendrogram (from Greek dendron=tree, and gramma=drawing) is nothing more than a tree diagram that practitioners use to depict the arrangement of the clusters produced by hierarchical clustering.
Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.
This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.
Over the last few years, we've heard a lot about how
“I think the big science journalism story of 2014 will be the brain drain from science to industry ‘data science,’” Fred writes. “Up until a few years ago, at least in my field, the best grad students got jobs as professors, and the less successful grad students took jobs in industry. It is now the reverse. It’s a real trend, and it’s a big deal. One reason is that science tends not to reward the graduate students who are best at developing good software, which is exactly what science needs right now…
“Another reason, especially important for me, is the quality of research in academia and in industry. In academia, the journals tend to want the most interesting results and are not so concerned about whether the results are true. In industry data science, [your] boss just wants the truth. That’s a much more inspiring environment to work in. I like writing code and analyzing data. In industry, I can do that for most of the day. In academia, it seems like faculty have to spend most of their time writing grants and responding to emails.”