Public Datasets -...
Follow
Find tag "datascience"
8.6K views | +0 today
Public Datasets - Open Data -
Your new post is loading...
Your new post is loading...
Scooped by luiy
Scoop.it!

The #datascience ecosystem, part 3: Data applications | #openSource #tools

The #datascience ecosystem, part 3: Data applications | #openSource #tools | Public Datasets - Open Data - | Scoop.it
The third part in a series on the data science ecosystem looks at the applications that turn data into insights or models.
luiy's insight:
Open-source tools

Probably because this category has the most ongoing research, there is quite a rich collection of open-source modeling and insights tools. R is an essential tool for most data scientists and works both as a programming language and an interactive environment for exploring data. Octave is a free, open-source port of matlab that works very well. Julia is becoming increasingly popular for technical computing. Stanford has an NLP library that has tools for most standard language processing tasks. Scikit, a machine learning package for python, is becoming very powerful and has implementations of most standard modeling and machine learning algorithms.


In the end, data application tools are what make data scientists incredibly valuable to any organization. They're the exact thing that allows a data scientist to make powerful suggestions, uncover hidden trends and provide tangible value. But these tools simply don't work unless you have good data and unless you enrich, blend and clean that data.
more...
No comment yet.
Scooped by luiy
Scoop.it!

Collecting #Twitter Data: Storing Tweets in #MongoDB | #bigdata #NoSQL

luiy's insight:

In the first three sections of the Twitter data collection tutorial, I demonstrated how to collect tweets using both R and Python and how to store these tweets first as JSON files then having R parse them into a .csv file. The .csv file works well, but tweets don’t always make good flat .csv files, since not every tweet contains the same fields or the same structure. Some of the data is well nested into the JSON object. It is possible to write a parser that has a field for each possible subfield, but this might take a while to write and will create a rather large .csv file or SQL database.

more...
No comment yet.
Scooped by luiy
Scoop.it!

Project #BigData. Expanding on Project C to look at a different use case | #datascience #opendata

Project #BigData. Expanding on Project C to look at a different use case | #datascience #opendata | Public Datasets - Open Data - | Scoop.it
luiy's insight:

Project Big Data is an interactive tool which enables you to visualize and explore the funding patterns of over 600 companies in the Big Data ecosystem! It is based on the work I did for Project C (which you see and can read about here). The list of companies and their classification into categories is based on a dozen published sources and rough text analytics of the Crunchbase database. Crunchbase is a curated crowed sourced database of over 285k companies.

 

As for the data, there are 645 public & private companies in the data set. From Teradata and IBM to Actuate & Zoomdata. I began by harvesting data from Crunchbase by using their free API w/ Python. As of September, Crunchbase had 1250 funding events for 410 of the companies on my list. I've grouped these companies into 18 categories, allowing you to compare peers as well as trends across categories. Some of the categories are broken down further. For example, the tool allows you to differentiate between cloud-based and on premise solutions or SQL vs. NoSQL databases. I gathered additional data from a variety of sources. For example, LinkedIn was used to find the number of employees.

 

 

OPENACCESS Workbook: Project Big Data v1.0 

https://public.tableausoftware.com/download/workbooks/ProjectBigDatav1_0?format=html

 

more...
No comment yet.
Rescooped by luiy from Politique des algorithmes
Scoop.it!

Google has #open sourced a #tool for inferring cause from correlations | #algorithms #datascience

Google has #open sourced a #tool for inferring cause from correlations | #algorithms #datascience | Public Datasets - Open Data - | Scoop.it
Google open sourced a new package for the R statistical computing software that’s designed to help users infer whether a particular action really did cause subsequent activity. Google has been using the tool, called CausalImpact, to measure AdWords campaigns but it has broader appeal.

Via Dominique Cardon
luiy's insight:

Google announced on Tuesday a new open source tool that can help data analysts decide if changes to products or policies resulted in measurable change, or if the change would have happened anyway. The tool, called CausalImpact, is a package for the R statistical computing software, and Google details it in a blog post.

 

According to blog post author Kay H. Brodersen, Google uses the tool — created it, in fact — primarily for quantifying the effectiveness of AdWords campaigns. However, he noted, the same method could be used to gauge everything from whether adding a new feature caused an increase in app downloads to questions involving events in medical, social or political science.

 

http://google.github.io/CausalImpact/

 

 

more...
No comment yet.
Scooped by luiy
Scoop.it!

Quantifying Memory: Mapping the #GDELT data in #R (and some Russian protests, too) | #opendata #datascience

Quantifying Memory: Mapping the #GDELT data in #R (and some Russian protests, too) | #opendata #datascience | Public Datasets - Open Data - | Scoop.it
luiy's insight:

The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.

 

Meanwhile I headed over to David Masad's writeup on getting started with GDELT in python

more...
No comment yet.
Rescooped by luiy from Big Data Analysis in the Clouds
Scoop.it!

17 short #tutorials all data scientists should read (and practice) | #datascience

17 short #tutorials all data scientists should read (and practice) | #datascience | Public Datasets - Open Data - | Scoop.it
I hope I find the time to write a one-page survival guide for UNIX, Python and Perl. Here's one for R. The links to core data science concepts are below - I ne…

Via Pierre Levy
luiy's insight:

Here's the list:

 

- Practical illustration of Map-Reduce (Hadoop-style), on real data

- A synthetic variance designed for Hadoop and big data

- Fast Combinatorial Feature Selection with New Definition of Predict...

- A little known component that should be part of most data science a...

- 11 Features any database, SQL or NoSQL, should have

- Clustering idea for very large datasets

- Hidden decision trees revisited

- Correlation and R-Squared for Big Data

- Marrying computer science, statistics and domain expertize

- New pattern to predict stock prices, multiplies return by factor 5

- What Map Reduce can't do

- Excel for Big Data

- Fast clustering algorithms for massive datasets

- Source code for our Big Data keyword correlation API

- The curse of big data

- How to detect a pattern? Problem and solution

- Interesting Data Science Application: Steganography

more...
patrick Anguet's curator insight, March 3, 2014 7:25 AM

A good introduction to understand the pillars of a data scientist.  Do not forget the mathematical  and business skills. 

Louis Joseph's comment, March 5, 2014 6:11 PM
The image seem to be a wrong illustraton ;-)
Rescooped by luiy from Big Data Analysis in the Clouds
Scoop.it!

Center for Data #Innovation » #Data Innovation 101

Center for Data #Innovation » #Data Innovation 101 | Public Datasets - Open Data - | Scoop.it

Via Pierre Levy
luiy's insight:

A conversation about data-driven innovation is possible now because new technologies have made it easier and cheaper to collect, store, analyze, use, and disseminate data. But while the potential for vastly more data-driven innovation exists, many organizations have been slow to adopt these technologies. Policymakers around the world should do more to spur data-driven innovation in both the public and private sectors, including by supporting the development of human capital, encouraging the advancement of innovative technology, and promoting the availability of data itself for use and reuse.

more...
Arent van 't Spijker's curator insight, February 13, 2014 7:09 AM

They must be very busy: the Center for Data Innovation formulates and promotes pragmatic public policies designed to enable data-driven innovation in the public and private sector, create new economic opportunities, and improve quality of life.

Rescooped by luiy from Data is big
Scoop.it!

60+ R resources to improve your data skills | #datascience #dataviz

60+ R resources to improve your data skills | #datascience #dataviz | Public Datasets - Open Data - | Scoop.it
From books to videos to online tutorials -- most free! -- here are plenty of ideas to burnish your R knowledge.

Via ukituki
luiy's insight:

These websites, videos, blogs, social media / communities, software andbooks/ebooks can help you do more with R; the favorites are listed in bold.

 

 

more...
No comment yet.
Rescooped by luiy from SNA - Social Network Analysis ... and more.
Scoop.it!

#DataScience Workflow: Overview and Challenges I #methods #research

#DataScience Workflow: Overview and Challenges I #methods #research | Public Datasets - Open Data - | Scoop.it
I provide an overview of the data science workflow and highlight some challenges that data scientists face in their work.

Via João Greno Brogueira
luiy's insight:

@luiy. Great article about #DataScience: the workflow design, methods and problematics. 

 

What do data scientists do at work, and what challenges do they face?

 

This post provides an overview of the modern data science workflow, adapted from Chapter 2 of my Ph.D. dissertation, Software Tools to Facilitate Research Programming.

The Data Science Workflow

The figure below shows the steps involved in a typical data science workflow.  There are four main phases, shown in the dotted-line boxes: preparation of the data, alternating between running the analysis andreflection to interpret the outputs, and finally dissemination of results in the form of written reports and/or executable code.

more...
No comment yet.
Rescooped by luiy from Big Data, IoT and other stuffs
Scoop.it!

A Programmer's Guide to #DataMining I #OpenBook #DataScience

A Programmer's Guide to #DataMining I #OpenBook #DataScience | Public Datasets - Open Data - | Scoop.it

Via Joaquín Herrero Pintado, Toni Sánchez
luiy's insight:

Table of Contents

 

This book’s contents are freely available as PDF files. When you click on a chapter title below, you will be taken to a webpage for that chapter. The page contains links for a PDF of that chapter and for any sample Python code and data that chapter requires. Please let me know if you see an error in the book, if some part of the book is confusing, or if you have some other comment. I will use these to revise the chapters.

 

Chapter 1: Introduction

 

Finding out what data mining is and what problems it solves. What will you be able to do when you finish this book.

 

Chapter 2: Get Started with Recommendation Systems

 

Introduction to social filtering. Basic distance measures including Manhattan distance, Euclidean distance, and Minkowski distance. Pearson Correlation Coefficient. Implementing a basic algorithm in Python.

 

Chapter 3: Implicit ratings and item-based filtering

 

A discussion of the types of user ratings we can use. Users can explicitly give ratings (thumbs up, thumbs down, 5 stars, or whatever) or they can rate products implicitly–if they buy an mp3 from Amazon, we can view that purchase as a ‘like’ rating.

Chapter 4: Classification

 

In  previous chapters we used  people’s ratings of products to make recommendations. Now we turn to using attributes of the products themselves to make recommendations. This approach is used by Pandora among others.

 

Chapter 5: Further Explorations in Classification

 

A discussion on how to evaluate classifiers including 10-fold cross-validation, leave-one-out, and the Kappa statistic. The k Nearest Neighbor algorithm is also introduced.

 

Chapter 6: Naïve Bayes

 

An exploration of Naïve Bayes classification methods. Dealing with numerical data using probability density functions.

 

Chapter 7: Naïve Bayes and unstructured text

 

This chapter explores how we can use Naïve Bayes to classify unstructured text. Can we classify twitter posts about a movie as to whether the post was a positive review or a negative one? (new version coming November 2013)

more...
Intriguing Networks's curator insight, December 8, 2013 5:48 PM

Cheers thanks for this handy for all budding DH students

Scooped by luiy
Scoop.it!

Twitter Data Mining Round Up | #python #ddj #openaccess

Twitter Data Mining Round Up | #python #ddj #openaccess | Public Datasets - Open Data - | Scoop.it
luiy's insight:

Since the release of Mining the Social Web, 2E in late October of last year, I have mostly focused on creating supplemental content that focused on Twitter data. This seemed like a natural starting point given that the first chapter of the book is a gentle introduction to data mining with Twitter’s API coupled with the inherent openness of accessing and analyzing Twitter data (in comparison to other data sources that are a little more restrictive.) Twitter’s IPO late last year also focused the spotlight a bit on Twitter, which provided some good opportunities to opine on Twitter’s underlying data model that can be interpreted as an interest graph.

more...
No comment yet.
Scooped by luiy
Scoop.it!

TubeKit: A Youtube #Crawling Toolkit | #datascience #tools #bigdata

TubeKit: A Youtube #Crawling Toolkit | #datascience #tools #bigdata | Public Datasets - Open Data - | Scoop.it

 #bigdata

luiy's insight:

TubeKit is a toolkit for creating YouTube crawlers. It allows one to build one's own crawler that can crawl YouTube based on a set of seed queries and collect up to 16 different attributes.

 

TubeKit assists in all the phases of this process starting database creation to finally giving access to the collected data with browsing and searching interfaces. In addition to creating crawlers, TubeKit also provides several tools to collect a variety of data from YouTube, including video details and user profiles

more...
No comment yet.
Scooped by luiy
Scoop.it!

#DataMining: Practical Machine Learning #Tools and Techniques | #Weka #datascience #openaccess

#DataMining: Practical Machine Learning #Tools and Techniques | #Weka #datascience #openaccess | Public Datasets - Open Data - | Scoop.it
luiy's insight:

Teaching material

 

Slides for Chapters 1-5 of the 3rd edition can be found here.

Slides for Chapters 6-8 of the 3rd edition can be found here

 

These archives contain .pdf files as well as .odp files in Open Document Format that were generated using OpenOffice 2.0. Note that there are several free office programs now that can read .odp files. There is also a plug-in for Word made by Sun for reading this format. Corresponding information is on this Wikipedia page.

more...
No comment yet.
Rescooped by luiy from Data is big
Scoop.it!

Mining of Massive Datasets | #datascience #freebook

Mining of Massive Datasets | #datascience #freebook | Public Datasets - Open Data - | Scoop.it

Via ukituki
luiy's insight:

Preface and Table of Content

Chapter 1. Data Mining

Chapter 2. Map-Reduce and the New Software Stack

Chapter 3. Finding Similar Items

Chapter 4. Mining Data Streams

Chapter 5. Link Analysis

Chapter 6. Frequent Itemsets

Chapter 7. Clustering

Chapter 8. Advertising on the Web

Chapter 9. Recommendation Systems

Chapter 10. Mining Social-Network Graphs

Chapter 11. Dimensionality Reduction

Chapter 12. Large-Scale Machine Learning

 

Download Full Book :

http://infolab.stanford.edu/~ullman/mmds/book.pdf

more...
ukituki's curator insight, August 28, 2014 6:22 PM

The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).

Scooped by luiy
Scoop.it!

Hacking to Improve #Disaster Response with Qlik, Medair and Gnip | #opendata #datascience

Hacking to Improve #Disaster Response with Qlik, Medair and Gnip | #opendata #datascience | Public Datasets - Open Data - | Scoop.it
Recently Plugged In to Gnip partner, Qlik, and international relief organization, Medair, hosted a hackathon focused on using social data to inform global disaster response.
luiy's insight:

At Gnip, we’re always excited to hear about groups and individuals who are using social data in unique ways to improve our world. We were recently fortunate enough to support this use of social data for humanitarian good first-hand. Along with Plugged In to Gnip partner, Qlik, and international relief organization, Medair, we hosted a hackathon focused on global disaster response.

 

The hackathon took place during Qlik’s annual partner conference in Orlando and studied social content from last year’s Typhoon Haiyan. Historical Twitter data from Gnip was paired with financial information from Medair to give participants the opportunity to create new analytic tools on Qlik’s QlikView.Next BI platform. The Twitter data set specifically included Tweets from users in the Philippines for the two week period around Typhoon Haiyan in November of 2013. The unique combination of data and platform allowed the hackathon developers to dissect and visualize a massive social data set with the goal of uncovering new insights that could be applied in future natural disasters.

more...
No comment yet.
Scooped by luiy
Scoop.it!

Introducing Twitter #Data Grants | Twitter for Researchers | #datascience

Introducing Twitter #Data Grants | Twitter for Researchers | #datascience | Public Datasets - Open Data - | Scoop.it
Today we’re introducing a pilot project we’re calling Twitter Data Grants, through which we’ll give a handful of research institutions access to our public and historical data. Wi......
luiy's insight:

With more than 500 million Tweets a day, Twitter has an expansive set of data from which we can glean insights and learn about a variety of topics, from health-related information such as when and where the flu may hit to global events like ringing in the new year. To date, it has been challenging for researchers outside the company who are tackling big questions to collaborate with us to access our public, historical data. Our Data Grants program aims to change that by connecting research institutions and academics with the data they need.

 

If you’d like to participate, submit a proposal here no later than March 15th. For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.

 

more...
No comment yet.
Scooped by luiy
Scoop.it!

Statista - The Statistics Portal | #opendata #datascience

Statista - The Statistics Portal | #opendata #datascience | Public Datasets - Open Data - | Scoop.it
Find statistics, consumer survey results and industry studies
from over 18,000 sources on over 60,000 topics on the internet's
leading statistics database
luiy's insight:

Statista is the world’s largest statistics portal. Providing you with access to relevant data from over 18,000 sources, our focus is firmly based on professional, clear, quick and consistent results. Our customized search query form provides you with a list of statistics, studies and reports relating to your search request within a matter of seconds –kick-starting your research.

more...
No comment yet.
Scooped by luiy
Scoop.it!

7+ ways to plot dendrograms in R I #Clustering #DataScience

7+ ways to plot dendrograms in R I #Clustering #DataScience | Public Datasets - Open Data - | Scoop.it
Today we are going to talk about the wide spectrum of functions and methods that we can use to visualize dendrograms in R. You can check an extended version of this post with the complete reproduci...
luiy's insight:

A quick reminder: a dendrogram (from Greek dendron=tree, and gramma=drawing) is nothing more than a tree diagram that practitioners use to depict the arrangement of the clusters produced by hierarchical clustering.

more...
No comment yet.
Scooped by luiy
Scoop.it!

Announcing the PLOS Text Mining Collection | #DataScience #OpenAccess

Announcing the PLOS Text Mining Collection | #DataScience #OpenAccess | Public Datasets - Open Data - | Scoop.it
luiy's insight:

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable. However, as with the promises of many new technologies, the benefits of text mining are still not clear to most academic researchers.

This situation is now poised to change for several reasons. First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

more...
No comment yet.
Scooped by luiy
Scoop.it!

Are “ #BigData ” Sucking Scientific Talent into Big Business? | #datascience

Are “ #BigData ” Sucking Scientific Talent into Big Business? | #datascience | Public Datasets - Open Data - | Scoop.it
Over the last few years, we've heard a lot about how
luiy's insight:

“I think the big science journalism story of 2014 will be the brain drain from science to industry ‘data science,’” Fred writes. “Up until a few years ago, at least in my field, the best grad students got jobs as professors, and the less successful grad students took jobs in industry. It is now the reverse. It’s a real trend, and it’s a big deal. One reason is that science tends not to reward the graduate students who are best at developing good software, which is exactly what science needs right now…

 

“Another reason, especially important for me, is the quality of research in academia and in industry. In academia, the journals tend to want the most interesting results and are not so concerned about whether the results are true. In industry data science, [your] boss just wants the truth. That’s a much more inspiring environment to work in. I like writing code and analyzing data. In industry, I can do that for most of the day. In academia, it seems like faculty have to spend most of their time writing grants and responding to emails.”

more...
No comment yet.