Public Datasets -...
Follow
Find tag "datascience"
7.7K views | +3 today
Public Datasets - Open Data -
Your new post is loading...
Your new post is loading...
Scooped by luiy
Scoop.it!

Quantifying Memory: Mapping the #GDELT data in #R (and some Russian protests, too) | #opendata #datascience

Quantifying Memory: Mapping the #GDELT data in #R (and some Russian protests, too) | #opendata #datascience | Public Datasets - Open Data - | Scoop.it
luiy's insight:

The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.

 

Meanwhile I headed over to David Masad's writeup on getting started with GDELT in python

more...
No comment yet.
Rescooped by luiy from Big Data, Cloud and Social everything
Scoop.it!

17 short #tutorials all data scientists should read (and practice) | #datascience

17 short #tutorials all data scientists should read (and practice) | #datascience | Public Datasets - Open Data - | Scoop.it
I hope I find the time to write a one-page survival guide for UNIX, Python and Perl. Here's one for R. The links to core data science concepts are below - I ne…

Via Pierre Levy
luiy's insight:

Here's the list:

 

- Practical illustration of Map-Reduce (Hadoop-style), on real data

- A synthetic variance designed for Hadoop and big data

- Fast Combinatorial Feature Selection with New Definition of Predict...

- A little known component that should be part of most data science a...

- 11 Features any database, SQL or NoSQL, should have

- Clustering idea for very large datasets

- Hidden decision trees revisited

- Correlation and R-Squared for Big Data

- Marrying computer science, statistics and domain expertize

- New pattern to predict stock prices, multiplies return by factor 5

- What Map Reduce can't do

- Excel for Big Data

- Fast clustering algorithms for massive datasets

- Source code for our Big Data keyword correlation API

- The curse of big data

- How to detect a pattern? Problem and solution

- Interesting Data Science Application: Steganography

more...
patrick Anguet's curator insight, March 3, 4:25 AM

A good introduction to understand the pillars of a data scientist.  Do not forget the mathematical  and business skills. 

Louis Joseph's comment, March 5, 3:11 PM
The image seem to be a wrong illustraton ;-)
Rescooped by luiy from Big Data, Cloud and Social everything
Scoop.it!

Center for Data #Innovation » #Data Innovation 101

Center for Data #Innovation » #Data Innovation 101 | Public Datasets - Open Data - | Scoop.it

Via Pierre Levy
luiy's insight:

A conversation about data-driven innovation is possible now because new technologies have made it easier and cheaper to collect, store, analyze, use, and disseminate data. But while the potential for vastly more data-driven innovation exists, many organizations have been slow to adopt these technologies. Policymakers around the world should do more to spur data-driven innovation in both the public and private sectors, including by supporting the development of human capital, encouraging the advancement of innovative technology, and promoting the availability of data itself for use and reuse.

more...
Arent van 't Spijker's curator insight, February 13, 4:09 AM

They must be very busy: the Center for Data Innovation formulates and promotes pragmatic public policies designed to enable data-driven innovation in the public and private sector, create new economic opportunities, and improve quality of life.

Rescooped by luiy from Data is big
Scoop.it!

60+ R resources to improve your data skills | #datascience #dataviz

60+ R resources to improve your data skills | #datascience #dataviz | Public Datasets - Open Data - | Scoop.it
From books to videos to online tutorials -- most free! -- here are plenty of ideas to burnish your R knowledge.

Via ukituki
luiy's insight:

These websites, videos, blogs, social media / communities, software andbooks/ebooks can help you do more with R; the favorites are listed in bold.

 

 

more...
No comment yet.
Rescooped by luiy from SNA - Social Network Analysis ... and more.
Scoop.it!

#DataScience Workflow: Overview and Challenges I #methods #research

#DataScience Workflow: Overview and Challenges I #methods #research | Public Datasets - Open Data - | Scoop.it
I provide an overview of the data science workflow and highlight some challenges that data scientists face in their work.

Via João Greno Brogueira
luiy's insight:

@luiy. Great article about #DataScience: the workflow design, methods and problematics. 

 

What do data scientists do at work, and what challenges do they face?

 

This post provides an overview of the modern data science workflow, adapted from Chapter 2 of my Ph.D. dissertation, Software Tools to Facilitate Research Programming.

The Data Science Workflow

The figure below shows the steps involved in a typical data science workflow.  There are four main phases, shown in the dotted-line boxes: preparation of the data, alternating between running the analysis andreflection to interpret the outputs, and finally dissemination of results in the form of written reports and/or executable code.

more...
No comment yet.
Rescooped by luiy from Big Data, IoT and other stuffs
Scoop.it!

A Programmer's Guide to #DataMining I #OpenBook #DataScience

A Programmer's Guide to #DataMining I #OpenBook #DataScience | Public Datasets - Open Data - | Scoop.it

Via Joaquín Herrero Pintado, Toni Sánchez
luiy's insight:

Table of Contents

 

This book’s contents are freely available as PDF files. When you click on a chapter title below, you will be taken to a webpage for that chapter. The page contains links for a PDF of that chapter and for any sample Python code and data that chapter requires. Please let me know if you see an error in the book, if some part of the book is confusing, or if you have some other comment. I will use these to revise the chapters.

 

Chapter 1: Introduction

 

Finding out what data mining is and what problems it solves. What will you be able to do when you finish this book.

 

Chapter 2: Get Started with Recommendation Systems

 

Introduction to social filtering. Basic distance measures including Manhattan distance, Euclidean distance, and Minkowski distance. Pearson Correlation Coefficient. Implementing a basic algorithm in Python.

 

Chapter 3: Implicit ratings and item-based filtering

 

A discussion of the types of user ratings we can use. Users can explicitly give ratings (thumbs up, thumbs down, 5 stars, or whatever) or they can rate products implicitly–if they buy an mp3 from Amazon, we can view that purchase as a ‘like’ rating.

Chapter 4: Classification

 

In  previous chapters we used  people’s ratings of products to make recommendations. Now we turn to using attributes of the products themselves to make recommendations. This approach is used by Pandora among others.

 

Chapter 5: Further Explorations in Classification

 

A discussion on how to evaluate classifiers including 10-fold cross-validation, leave-one-out, and the Kappa statistic. The k Nearest Neighbor algorithm is also introduced.

 

Chapter 6: Naïve Bayes

 

An exploration of Naïve Bayes classification methods. Dealing with numerical data using probability density functions.

 

Chapter 7: Naïve Bayes and unstructured text

 

This chapter explores how we can use Naïve Bayes to classify unstructured text. Can we classify twitter posts about a movie as to whether the post was a positive review or a negative one? (new version coming November 2013)

more...
Intriguing Networks's curator insight, December 8, 2013 2:48 PM

Cheers thanks for this handy for all budding DH students

Scooped by luiy
Scoop.it!

Hacking to Improve #Disaster Response with Qlik, Medair and Gnip | #opendata #datascience

Hacking to Improve #Disaster Response with Qlik, Medair and Gnip | #opendata #datascience | Public Datasets - Open Data - | Scoop.it
Recently Plugged In to Gnip partner, Qlik, and international relief organization, Medair, hosted a hackathon focused on using social data to inform global disaster response.
luiy's insight:

At Gnip, we’re always excited to hear about groups and individuals who are using social data in unique ways to improve our world. We were recently fortunate enough to support this use of social data for humanitarian good first-hand. Along with Plugged In to Gnip partner, Qlik, and international relief organization, Medair, we hosted a hackathon focused on global disaster response.

 

The hackathon took place during Qlik’s annual partner conference in Orlando and studied social content from last year’s Typhoon Haiyan. Historical Twitter data from Gnip was paired with financial information from Medair to give participants the opportunity to create new analytic tools on Qlik’s QlikView.Next BI platform. The Twitter data set specifically included Tweets from users in the Philippines for the two week period around Typhoon Haiyan in November of 2013. The unique combination of data and platform allowed the hackathon developers to dissect and visualize a massive social data set with the goal of uncovering new insights that could be applied in future natural disasters.

more...
No comment yet.
Scooped by luiy
Scoop.it!

Introducing Twitter #Data Grants | Twitter for Researchers | #datascience

Introducing Twitter #Data Grants | Twitter for Researchers | #datascience | Public Datasets - Open Data - | Scoop.it
Today we’re introducing a pilot project we’re calling Twitter Data Grants, through which we’ll give a handful of research institutions access to our public and historical data. Wi......
luiy's insight:

With more than 500 million Tweets a day, Twitter has an expansive set of data from which we can glean insights and learn about a variety of topics, from health-related information such as when and where the flu may hit to global events like ringing in the new year. To date, it has been challenging for researchers outside the company who are tackling big questions to collaborate with us to access our public, historical data. Our Data Grants program aims to change that by connecting research institutions and academics with the data they need.

 

If you’d like to participate, submit a proposal here no later than March 15th. For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.

 

more...
No comment yet.
Scooped by luiy
Scoop.it!

Statista - The Statistics Portal | #opendata #datascience

Statista - The Statistics Portal | #opendata #datascience | Public Datasets - Open Data - | Scoop.it
Find statistics, consumer survey results and industry studies
from over 18,000 sources on over 60,000 topics on the internet's
leading statistics database
luiy's insight:

Statista is the world’s largest statistics portal. Providing you with access to relevant data from over 18,000 sources, our focus is firmly based on professional, clear, quick and consistent results. Our customized search query form provides you with a list of statistics, studies and reports relating to your search request within a matter of seconds –kick-starting your research.

more...
No comment yet.
Scooped by luiy
Scoop.it!

7+ ways to plot dendrograms in R I #Clustering #DataScience

7+ ways to plot dendrograms in R I #Clustering #DataScience | Public Datasets - Open Data - | Scoop.it
Today we are going to talk about the wide spectrum of functions and methods that we can use to visualize dendrograms in R. You can check an extended version of this post with the complete reproduci...
luiy's insight:

A quick reminder: a dendrogram (from Greek dendron=tree, and gramma=drawing) is nothing more than a tree diagram that practitioners use to depict the arrangement of the clusters produced by hierarchical clustering.

more...
No comment yet.
Scooped by luiy
Scoop.it!

Announcing the PLOS Text Mining Collection | #DataScience #OpenAccess

Announcing the PLOS Text Mining Collection | #DataScience #OpenAccess | Public Datasets - Open Data - | Scoop.it
luiy's insight:

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.

This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

more...
No comment yet.
Scooped by luiy
Scoop.it!

Are “ #BigData ” Sucking Scientific Talent into Big Business? | #datascience

Are “ #BigData ” Sucking Scientific Talent into Big Business? | #datascience | Public Datasets - Open Data - | Scoop.it
Over the last few years, we've heard a lot about how
luiy's insight:

“I think the big science journalism story of 2014 will be the brain drain from science to industry ‘data science,’” Fred writes. “Up until a few years ago, at least in my field, the best grad students got jobs as professors, and the less successful grad students took jobs in industry. It is now the reverse. It’s a real trend, and it’s a big deal. One reason is that science tends not to reward the graduate students who are best at developing good software, which is exactly what science needs right now…

 

“Another reason, especially important for me, is the quality of research in academia and in industry. In academia, the journals tend to want the most interesting results and are not so concerned about whether the results are true. In industry data science, [your] boss just wants the truth. That’s a much more inspiring environment to work in. I like writing code and analyzing data. In industry, I can do that for most of the day. In academia, it seems like faculty have to spend most of their time writing grants and responding to emails.”

more...
No comment yet.