Data hacking
Follow
Find
1.9K views | +0 today
Data hacking
Curating data science tools
Curated by Claudia Mihai
Your new post is loading...
Your new post is loading...
Scooped by Claudia Mihai
Scoop.it!

The First Interactive Network and Graph Data Repository with Interactive Graph Analytics and Visualization

The First Interactive Network and Graph Data Repository with Interactive Graph Analytics and Visualization | Data hacking | Scoop.it

A network and graph data repository containing hundreds of real-world networks and benchmark datasets. This large comprehensive collection of network graph data is useful for making significant research findings as well as benchmark data sets for machine learning and network science. All data sets are easily downloaded into a standard consistent format. We also have built a multi-level interactive graph analytics engine that allows for visualizing the structure of networks as well as many global graph statistics and local node level properties. 

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Landscape of open source data tools

Landscape of open source data tools | Data hacking | Scoop.it
Open-data-landscape : A (comprehensive) collection of open source tools used by the data community.
more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

New algorithm identifies data subsets that will yield the most reliable predictions

New algorithm identifies data subsets that will yield the most reliable predictions | Data hacking | Scoop.it

Much artificial-intelligence research addresses the problem of making predictions based on large data sets. An obvious example is the recommendation engines at retail sites like Amazon and Netflix.

But some types of data are harder to collect than online click histories —information about geological formations thousands of feet underground, for instance. And in other applications—such as trying to predict the path of a storm—there may just not be enough time to crunch all the available data.

Dan Levine, an MIT graduate student in aeronautics and astronautics, and his advisor, Jonathan How, the Richard Cockburn Maclaurin Professor of Aeronautics and Astronautics, have developed a new technique that could help with both problems. For a range of common applications in which data is either difficult to collect or too time-consuming to process, the technique can identify the subset of data items that will yield the most reliable predictions. So geologists trying to assess the extent of underground petroleum deposits, or meteorologists trying to forecast the weather, can make do with just a few, targeted measurements, saving time and money.

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Introducing tidyr

Introducing tidyr | Data hacking | Scoop.it

tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:

Each column is a variable.Each row is an observation.

Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.

To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data: gather(), separate() and spread().

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

How to stand out in academic scientific research

How to stand out in academic scientific research | Data hacking | Scoop.it

This post is aimed at young academic scientists, particularly post-docs.  Please note, I am a biologist, so some of my recommendations may be specific to this field.

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

BIDMach machine learning toolkit

BIDMach machine learning toolkit | Data hacking | Scoop.it

The BID Data Suite is a collection of hardware, software and design patterns that enable fast, large-scale data mining at very low cost.

The elements of the suite are:

Hardware. The data engine that balances storage, CPU and GPU acceleration for typical data mining workloads.Software.BIDMat, an interactive matrix library that integrates CPU and GPU acceleration and novel computational kernels.BIDMach, a machine learning system that includes very efficient model optimizers and mixing strategies.Scaling Up.Butterfly Mixing, a communication strategy that hides the latency of frequent model updates needed by fast optimizers for clusters.Sparse AllReduce, an efficient MapReduce like primitive for scalable communication of power-law data.


more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Machine Learning Done Wrong

Machine Learning Done Wrong | Data hacking | Scoop.it

Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Cytoscape.js an open-source graph theory library

Cytoscape.js an open-source graph theory library | Data hacking | Scoop.it

Cytoscape.js is an open-source graph theory library written in JavaScript. You can use Cytoscape.js for graph analysis and visualisation.

Cytoscape.js allows you to easily display and manipulate rich, interactive graphs. Because Cytoscape.js allows the user to interact with the graph and the library allows the client to hook into user events, Cytoscape.js is easily integrated into your webapp, especially since Cytoscape.js supports both desktop browsers, like Chrome, and mobile browsers, like on the iPad. Cytoscape.js includes all the gestures you would expect out-of-the-box, including pinch-to-zoom, box selection, panning, et cetera.

Cytoscape.js also has graph analysis in mind: The library contains a slew of useful functions in graph theory. You can use Cytoscape.js headlessly on Node.js to do graph analysis in the terminal or on a web server.

Cytoscape.js is an open-source project, and anyone is free to contribute. For more information, refer to the GitHub README.

The library was developed at the Donnelly Centre at the University of Toronto. It is the successor of Cytoscape Web.

 
more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

rBlocks

rBlocks | Data hacking | Scoop.it

rBlocks is an attempted port of ipythonblocks to R, to provide a fun and visual tool to explore data structures and control flow.

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Hidden Markov Models in R

Hidden Markov Models in R | Data hacking | Scoop.it

The general idea of a HMM is easy enough to understand: one observes some time series or stochastic process and imagines that it has been generated by an unobserved or "hidden" Markov process. However, the details of formulating and fitting a HMM involve some specialized knowledge, and the sophisticated tools available to develop a HMM in R can add an additional level of complexity. Joe’s presentation helps a beginner to dive right in. He briefly states what HMMs are all about, presents some practical examples, and then goes on to show how to use the functions in the very powerful depmixS4 package to fit an HMM model to a time series of S&P 500 returns.

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Metacademy: a package manager for knowledge

Metacademy: a package manager for knowledge | Data hacking | Scoop.it

In recent years, there’s been an explosion of free educational resources that make high-level knowledge and skills accessible to an ever-wider group of people. In your own field, you probably have a good idea of where to look for the answer to any particular question. But outside your areas of expertise, sifting through textbooks, Wikipedia articles, research papers, and online lectures can be bewildering (unless you’re fortunate enough to have a knowledgeable colleague to consult). What are the key concepts in the field, how do they relate to each other, which ones should you learn, and where should you learn them?

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

New to Machine Learning? Avoid these three mistakes

New to Machine Learning? Avoid these three mistakes | Data hacking | Scoop.it

Machine learning (ML) is one of the hottest fields in data science. As soon as ML entered the mainstream through Amazon, Netflix, and Facebook people have been giddy about what they can learn from their data. However, modern machine learning (i.e. not the theoretical statistical learning that emerged in the 70s) is very much an evolving field and despite its many successes we are still learning what exactly can ML do for data practitioners. I gave a talk on this topic earlier this fall at Northwestern University and I wanted to share these cautionary tales with a wider audience.

more...
Flavio Barros's curator insight, April 5, 11:50 PM

Os três erros mais comuns na aplicação de métodos de ML. Vale a pena ler.

Scooped by Claudia Mihai
Scoop.it!

CSV Fingerprint: Spot errors in your data at a glance

CSV Fingerprint: Spot errors in your data at a glance | Data hacking | Scoop.it
CSV Fingerprint: Spot errors in your data at a glance http://t.co/EyfyuCHFhh
more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Agent Based Models and RNetLogo

Agent Based Models and RNetLogo | Data hacking | Scoop.it

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, R Marries NetLogo: Introduction to the RNetLogo Package in the Journal of Statistical Software, academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in Naturefor using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a  2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

The Only Probability Cheatsheet You'll Ever Need

The Only Probability Cheatsheet You'll Ever Need | Data hacking | Scoop.it
Handy resource for #datascience : a super-condensed probability cheat sheet http://t.co/BdcgkAdgpi
more...
Yaser Helmy's curator insight, July 21, 10:49 AM

Although I have been a practicing data scientist for years now, I have actually understood some concepts from this sheet!

 

Loved it.

Scooped by Claudia Mihai
Scoop.it!

Forget the Wisdom of Crowds; Neurobiologists Reveal the Wisdom of the Confident

Forget the Wisdom of Crowds; Neurobiologists Reveal the Wisdom of the Confident | Data hacking | Scoop.it
The wisdom of crowds breaks down when people are biased. Now researchers have discovered a simple method of removing this bias–just listen to the most confident.
more...
No comment yet.
Rescooped by Claudia Mihai from Code Hacks
Scoop.it!

Code as a Research Object

Code as a Research Object | Data hacking | Scoop.it

Archive your GitHub code repository to figshare and receive a citable DOI.


Via Alin Velea
more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

The Two Longest-Lasting Computer Programs Are Mortal Enemies

The Two Longest-Lasting Computer Programs Are Mortal Enemies | Data hacking | Scoop.it

In a world where both software and hardware frequently become obsolete right on release, two rival programs can stake a claim to being among the longest-lived applications of all time. Both programs are about to enter their fifth decades. Both programs are text editors, for inputting and editing code, data files, raw HTML Web pages, and anything else. And they are mortal enemies.
Their names are Emacs and Vi (styled by programmers as “vi”). These editors are legendary and ancient, no exaggeration. Both date back to at least 1976, making them older than the vast majority of people currently using them. Both programs are text editors, which means they are not WYSIWYG (what you see is what you get)—unlike, say, word processors like Microsoft Word, they do not format your words onscreen. Programming is very different from word processing, and the basic goal of Emacs and Vi—fast editing of source code (and any other text files)—has yet to become obsolete. Both have been in ongoing development for almost 40 years.

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

Beaker Notebook - The data scientist's lab notebook

Beaker Notebook - The data scientist's lab notebook | Data hacking | Scoop.it

Beaker is a code notebook that allows you to analyze, visualize, and document data using multiple programming languages. Beaker's plugin-based polyglot architecture enables you to seamlessly switch between languages in your documents and add support for your favorite languages that we've missed.

more...
No comment yet.
Scooped by Claudia Mihai
Scoop.it!

A thorough guide to SQLite database operations in Python

A thorough guide to SQLite database operations in Python | Data hacking | Scoop.it

After I wrote the initial teaser article “SQLite – Working with large data sets in Python effectively” about how awesome SQLite databases are via sqlite3 in Python, I wanted to delve a little bit more into the SQLite syntax and provide you with some more hands-on examples.

more...
No comment yet.
Rescooped by Claudia Mihai from Code Hacks
Scoop.it!

Starting to Demo the Wolfram Language

Starting to Demo the Wolfram Language | Data hacking | Scoop.it

We’re getting closer to the first official release of the Wolfram Language—so I am starting to demo it more publicly.
Here’s a short video demo I just made. It’s amazing to me how much of this is based on things I hadn’t even thought of just a few months ago. Knowledge-based programming is going to be much bigger than I imagined…


Via Alin Velea
more...
No comment yet.