Data is big
Follow
Find
2.8K views | +3 today
 

From around the web

Data is big
"The future is here. It's just not evenly distributed yet." - William Gibson     :::: Follow this topic for fresh resources and ideas related to Data Science, Machine Learning, Algorithms and #bigdata ::::
Curated by ukituki
Your new post is loading...
Your new post is loading...
Scooped by ukituki
Scoop.it!

Popular software skills in Data Science job postings

Popular software skills in Data Science job postings | Data is big | Scoop.it
ukituki's insight:

Association rules mining was done to find which skills occur together. 


R, python and sql are the top 3 skills found. Java continues to be a favorite programming language. Interestingly, SQL triumphs hadoop in the skill list. 

more...
No comment yet.
Scooped by ukituki
Scoop.it!

Sorting Algorithms

Sorting Algorithms | Data is big | Scoop.it

Visualization of the sorting alogs.

ukituki's insight:

Visualization of the sorting alogs.

more...
No comment yet.
Scooped by ukituki
Scoop.it!

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition | Data is big | Scoop.it

We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. 

ukituki's insight:
Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations. 
more...
No comment yet.
Scooped by ukituki
Scoop.it!

10 Lessons Learned from Building Machine Learning Systems

10 lessons we learned at Netflix over years of building real-life, large-scale machine learning systems that impact a product and its users.
more...
No comment yet.
Scooped by ukituki
Scoop.it!

Do we Need Hundreds of Classifiers to Solve Real Worl Classification Problems? [pdf]

Click here to edit the title

more...
No comment yet.
Scooped by ukituki
Scoop.it!

Architecting Predictive Algorithms for Machine Learning (Channel 9)

Architecting Predictive Algorithms for Machine Learning (Channel 9) | Data is big | Scoop.it
CDP-B240 Machine learning is one of the newest tools in a Data Scientist’s arsenal.
more...
No comment yet.
Rescooped by ukituki from Social Network Analysis #sna
Scoop.it!

R Graph Catalog

R Graph Catalog | Data is big | Scoop.it
more...
ukituki's curator insight, November 1, 10:26 AM

This catalog is a complement to “Creating More Effective Graphs” by Naomi Robbins. All graphs were produced using the Rlanguage and the add-on packageggplot2, written by Hadley Wickham. The gallery is maintained by Joanna Zhao andJennifer Bryan.

 
Scooped by ukituki
Scoop.it!

10 Online Big Data Courses

10 Online Big Data Courses | Data is big | Scoop.it
The explosion of hype around the term big data ushered in a rabid desire in companies big and small to get their hands on employees with a data science skill set. For evidence, you need look no furthe
more...
No comment yet.
Scooped by ukituki
Scoop.it!

hts with regressors

hts with regressors | Data is big | Scoop.it

The hts pack­age for R allows for fore­cast­ing hier­ar­chi­cal and grouped time series data. The idea is to gen­er­ate fore­casts for all series at all lev­els of aggre­ga­tion with­out impos­ing the aggre­ga­tion con­straints, and then to rec­on­cile the fore­casts so they sat­isfy the aggre­ga­tion con­straints. (An intro­duc­tion to rec­on­cil­ing hier­ar­chi­cal and grouped time series is avail­able in this Fore­sight paper.)

 
more...
No comment yet.
Rescooped by ukituki from Data hacking
Scoop.it!

rBlocks

rBlocks | Data is big | Scoop.it
rBlocks is an attempted port of ipythonblocks to R, to provide a fun and visual tool to explore data structures and control flow.
Via Claudia Mihai
more...
No comment yet.
Rescooped by ukituki from Data hacking
Scoop.it!

The Only Probability Cheatsheet You'll Ever Need

The Only Probability Cheatsheet You'll Ever Need | Data is big | Scoop.it
“ Handy resource for #datascience : a super-condensed probability cheat sheet http://t.co/BdcgkAdgpi”;
Via Claudia Mihai
more...
Yaser Helmy's curator insight, July 21, 10:49 AM

Although I have been a practicing data scientist for years now, I have actually understood some concepts from this sheet!

 

Loved it.

Rescooped by ukituki from Social Network Analysis #sna
Scoop.it!

Data science with F#: Social network analysis - Twitter case study by @evelgab

more...
ukituki's curator insight, November 23, 8:22 AM

In this session we will work through the whole process of social network analysis: from downloading connections using Twitter REST-based API, to implementing our own PageRank algorithm which finds the most central Twitter accounts. In the process you’ll see how we can use F# type providers to access data and harness the power of the statistical language R to run some machine learning algorithms.

At the end, you’ll know how to run your own analysis on data from Twitter and how to use data science tools to gain insights from social networks.

Scooped by ukituki
Scoop.it!

“Recommender Systems in R” by Tamas Jambor @Sky

Talk by Tamas Jambor, Data Scientist @Sky Data Science London @ds_ldn meetup on 12/02/2013
more...
No comment yet.
Scooped by ukituki
Scoop.it!

The ensurer package (validation inside pipes) #rstats

Guest post by Stefan Holst Milton Bache on the ensurer package. If you use R in a production environment, you have most likely experienced that some circum
more...
No comment yet.
Scooped by ukituki
Scoop.it!

Analysis of large time-series data in OpenTSDB

ukituki's insight:

In recent years, the quantity of time series data generated in a wide variety of domains have grown consistently. Analyzing time-series datasets at a massive scale is one of the biggest challenges that data scientists are facing. This thesis focuses on implementation of a tool for analyzing large time-series data. It describes a way to analyze the data stored by OpenTSDB. OpenTSDB is an open source distributed and scalable time series database. It has become a challenge for statisticians and data scientists to analyze such massive data sets with the same level of comprehensive details as is possible for smaller analyses. Currently tools available for time-series analysis are time and memory consuming. Moreover, no single tool exists that specializes on providing an efficient implementations of analyzing time-series data through MapReduce programming model at massive scale. For these reason, we have designed an efficient and distributed computing framework - R2Time. R2Time integrates R open source project for statistical computing and visualization with the OpenTSDB [1] and RHIPE [2] based on the MapReduce framework for the distributed processing of large data sets across a cluster. It creates the programming environment by integrating R and HBase for the data scientists. This thesis describes the architecture of R2Time framework. The usefulness of this framework is verified by the performance analysis based on carefully choosen types of statistical analysis for time-series data. With the increase in the time-series data size and complexity of statistical functions, we have noticed supralinear nature in the performance of R2Time framework. The performance of this framework is verified by the performance analysis based on different configurations setting. Configuration settings as scan cache and batch size plays vital role with the performances of timeseries data.

more...
No comment yet.
Scooped by ukituki
Scoop.it!

Computers Are Writing Novels, But Do You Really Want To Read Them?

Computers Are Writing Novels, But Do You Really Want To Read Them? | Data is big | Scoop.it

It’s 10pm, November 30th, 2013. An author, aiming to finish a novel in November, takes up his laptop and begins typing furiously. By midnight, he’s completed I Got a Alligator for a Pet.  

more...
No comment yet.
Scooped by ukituki
Scoop.it!

How to become a data scientist in 8 easy steps: the #infographic

How to become a data scientist in 8 easy steps: the #infographic | Data is big | Scoop.it
This post was written by the team behind DataCamp, the online interactive learning platform for data science.   After being dubbed “sexiest job of the 21st Century” by Harvard Business Review, data scientists have stirred the interest of the general public. Many people are intrigued by this job, namely because the name has an interesting […]
more...
No comment yet.
Scooped by ukituki
Scoop.it!

Machine Learning - complete course notes

ukituki's insight:

The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng andoriginally posted on the ml-class.org website during the fall 2011 semester. The topics covered are shown below, although for a more detailed summary see lecture 19. The only content not covered here is the Octave/MATLAB programming.

 
more...
No comment yet.
Scooped by ukituki
Scoop.it!

My Commonly Done ggplot2 graphs

My Commonly Done ggplot2 graphs | Data is big | Scoop.it
In my last post, I discussed how ggplot2 is not always the answer to the question “How should I plot this” and that base graphics were still very useful. Why Do I use ggplot2 then? The overall question still remains: why (do I) use ggplot2? ggplot2 vs lattice For one, ggplot2 replaced the lattice package […]
more...
No comment yet.
Scooped by ukituki
Scoop.it!

TweetNLP: Twitter Natural Language Processing

TweetNLP: Twitter Natural Language Processing | Data is big | Scoop.it
ukituki's insight:

Twitter data has recently been one of the most favorite dataset for Natural Language Processing (NLP) researchers. Besides its magnanimous size, Twitter data has other unique qualities as well – it comprises of real-life conversations, uniform length (140 characters), rich variety, and real-time data stream. Advanced analytics on Twitter data needs one to go beyond the words and parse sentences into syntactic representations to develop a better contextual understanding of the tweet content. This can now be done conveniently through the tools developed by Prof. Noah Smith and his team at Carnegie Mellon University

more...
No comment yet.
Rescooped by ukituki from Data hacking
Scoop.it!

Hidden Markov Models in R

Hidden Markov Models in R | Data is big | Scoop.it
The general idea of a HMM is easy enough to understand: one observes some time series or stochastic process and imagines that it has been generated by an unobserved or "hidden" Markov process. However, the details of formulating and fitting a HMM involve some specialized knowledge, and the sophisticated tools available to develop a HMM in R can add an additional level of complexity. Joe’s presentation helps a beginner to dive right in. He briefly states what HMMs are all about, presents some practical examples, and then goes on to show how to use the functions in the very powerful depmixS4 package to fit an HMM model to a time series of S&P 500 returns.
Via Claudia Mihai
more...
No comment yet.
Rescooped by ukituki from Data hacking
Scoop.it!

Agent Based Models and RNetLogo

Agent Based Models and RNetLogo | Data is big | Scoop.it
If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, R Marries NetLogo: Introduction to the RNetLogo Package in the Journal of Statistical Software, academic work with ABMs didn’t really take off until the late 1990s. Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in Naturefor using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)
Via Claudia Mihai
more...
No comment yet.