Things about R
11.0K views | +0 today
Follow
Things about R
Articles about the R Project for Statistical Computing
Your new post is loading...
Your new post is loading...
Scooped by Roberto Rösler
Scoop.it!

Multivariate Forecasting in Tableau with R

Multivariate Forecasting in Tableau with R | Things about R | Scoop.it
Since version 8.0 it is very easy to generate forecasts in Tableau using exponential smoothing. But in some cases you may want to enrich your forecasts with external variables. For example you may have the government’s forecast for population growth, your own hiring plans, upcoming holidays*, planned marketing activities… which could all have varying levels…
more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Intro to the data.table Package

Intro to the data.table Package | Things about R | Scoop.it

R provides a helpful data structure called the “data frame” that gives the user an intuitive way to organize, view, and access data. Many of the functions that you would use to read in external files (e.g. read.csv) or connect to databases (RMySQL), will return a data frame structure by default. While there are other important data structures, such as the vector, list and matrix, the data frame winds up being at the heart of many operations not the least of which is aggregation

Roberto Rösler's insight:
Personally I prefer the combination of "dplyr"-Stack (dplyr, tidyr, magrittr) but if you are working on an edge case regarding performance then you should give data.table a try (code is much faster with less memory consumption).
more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Missing Value Treatment

Missing values in data is a common phenomenon in real world problems. Knowing how to handle missing values effectively is a required step to reduce bias and to produce powerful
Roberto Rösler's insight:
Found the explanation a little bit sloppy, but like the R examples for each different missing value approach.
more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Improving Adaboosting with decision stumps in R

Adaboosting is proven to be one of the most effective class prediction algorithms. It mainly consists of an ensemble simpler models (known as “weak learners”) that, although not very effective individually, are very performant combined. The process by which these weak learners are combined is though more complex than simply averaging results.

Roberto Rösler's insight:
If you look up machine learning these days on the internet, it sometimes looks like there is no other method then something based on deep learning algorithms. But this blog post shows a nice explanation for a very powerful ensemble learning method (boosting) modified to work also with larger datasets.
more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Bootstrap Evaluation of Clusters

Bootstrap Evaluation of Clusters | Things about R | Scoop.it

The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this article, based on chapter 8 of Practical Data Science with R, the authors discuss one approach to evaluating the clusters that are discovered by a chosen clustering method.

Roberto Rösler's insight:

also relates to http://www.win-vector.com/blog/2016/02/finding-the-k-in-k-means-by-parametric-bootstrap/

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

ggplot 2.0.0

ggplot 2.0.0 | Things about R | Scoop.it
I’m very pleased to announce the release of ggplot2 2.0.0. I know I promised that there wouldn’t be any more updates, but while working on the 2nd edition of the ggplot2 book, I just couldn’t stop myself from fixing some long standing problems. On the scale of ggplot2 releases, this one is huge with over one hundred fixes and improvements. This might break some of your existing code (although I’ve tried to minimise breakage as much as possible), but I hope the new features make up for any short term hassle. 
more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Turning numbers into stories: Accessing APIs from R (and a little R programming)

APIs are the driving force behind data mash-ups. It is APIs that allow machines to access data programmatically – that is automatically from within a program – to make use of API provided functionalities and data. Without APIs much of today’s Web 2.0, Apps and data applications would be outright impossible.

This post is about using APIs with R. As an example. we’ll use the EU’s EurLex1 data base API as provided by Buhl Rassmussen. This API is a good example of the APIs you might find in the wild. Of course, there are the APIs of large vendors, like Google or Facebook, that are thought out and well documented. But then there is the vast majority of smaller APIs for special applications that often lack in structure or documentation. Nevertheless, these APIs often provide access to valuable ressources.

Roberto Rösler's insight:

Nice article with a good introduction on APIs in general . Also worth to read - the vignette of the package httr itself under https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Why bother with magrittr

Why bother with magrittr | Things about R | Scoop.it

I’ve seen R users swooning over the magrittr package for a while now, but I couldn’t make heads or tails of all these scary %>% symbols. Finally I had time for a closer look, and it seems potentially handy indeed. Here’s the idea and a simple toy example.

Roberto Rösler's insight:

The magritter package is a huge step in doing data analysis in R - especially data preparation. The code is much more structured towards a clear and readable workflow. This post gives a nice example ...

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

10 Top Tips For Becoming A Better Coder!

10 Top Tips For Becoming A Better Coder! | Things about R | Scoop.it

We polled our consultants for their tips on how to be more effective at writing R code; and here are the top 10! 10. Don t pre-emptively tidy Spending time on formatting as you go along gives you a cause to procrastinate plus it s hard to make things consistent. Instead of worrying about it, let a

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Machine Master: RuleFit: When disassembled trees meet Lasso

Machine Master: RuleFit: When disassembled trees meet Lasso | Things about R | Scoop.it

The RuleFit algorithm from Friedman and Propescu is an interesting regression and classification approach that uses decision rules in a linear model.

RuleFit is not a completely new idea, but it combines a bunch of algorithms in a clever way. RuleFit consists of two components: The first component produces "rules" and the second component fits a linear model with these rules as input (hence the name "RuleFit"). The cool thing about the algorithm is that the produced model is highly interpretable, because the decision rules have an easy understandable format, but you still have a flexible enough approach to capture complex interactions and get a good fit.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

7+ ways to plot dendrograms in R | Gaston Sanchez

7+ ways to plot dendrograms in R | Gaston Sanchez | Things about R | Scoop.it

Today we are going to talk about the wide spectrum of functions and methods that we can use to visualize dendrograms in R. You can check an extended version of this post with the complete reproducible code in R in this Rpub.

A quick reminder: a dendrogram (from Greek dendron=tree, andgramma=drawing) is nothing more than a tree diagram that practitioners use to depict the arrangement of the clusters produced by hierarchical clustering.

Roberto Rösler's insight:

I have just been searching for some tutorial on ways to visualize results from hierarchical clustering in R. This posting was really helpful. Especially method 6) is well suited for plotting large labels (rules in my case).

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

A million ways to connect R and Excel

A million ways to connect R and Excel | Things about R | Scoop.it

In quantitative finance both R and Excel are the basis tools for any type of analysis. Whenever one has to use Excel in conjunction with R, there are many ways to approach the problem and many solutions. It depends on what you really want to do and the size of the dataset you’re dealing with. I list some possible connections in the table below.

Roberto Rösler's insight:

Good overview how to interact with Excel

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Working with Sessionized Data 2: Variable Selection

Working with Sessionized Data 2:  Variable Selection | Things about R | Scoop.it
In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we sessionized the data by considering all possible aggregations (window widths) of the data as features. Such naive sessionization can quickly lead to very wide data sets, with potentially more features than you have datums (and collinear features, as well). In this post, we will use the same example, but try to select our features more intelligently.
Roberto Rösler's insight:

Second article about working with sessionized data - nice recap of feature selection techniques in this context.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Simulating from the Bivariate Normal Distribution in R

Simulating from the Bivariate Normal Distribution in R | Things about R | Scoop.it
Here are five different ways to simulate random samples bivariate Normal distribution with a given mean and covariance matrix.
more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Profiling with RStudio and profvis

Profiling with RStudio and profvis | Things about R | Scoop.it

“How can I make my code faster?” If you write R code, then you’ve probably asked yourself this question. A profiler is an important tool for doing this: it records how the computer spends its time, and once you know that, you can focus on the slow parts to make them faster. The preview releases of RStudio now have integrated support for profiling R code and for visualizing profiling data. R itself has long had a built-in profiler, and now it’s easier than ever to use the profiler and interpret the results.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to see if there were some opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.


One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model. In both R and Panda’s, data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Every column can have missing values.


Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data.


In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Computing Classification Evaluation Metrics in R

Evaluation metrics are the key to understanding how your classification model performs when applied to a test dataset. In what follows, we present a tutorial on how to compute common metrics that are often used in evaluation, in addition to metrics generated from random classifiers, which help in justifying the value added by your predictive model, especially in cases where the common metrics suggest otherwise. Creating the Confusion Matrix Accuracy Per-class Precision, Recall, and F-1 Macro-averaged Metrics One-vs-all Matrices Average Accuracy Micro-averaged Metrics Evaluation on Highly Imbalanced Datasets Majority-class Metrics Random-guess...

Roberto Rösler's insight:
Great recap of evaluation metrics for classification - most of them are part  of the mlr library in R giving an easy access on them (https://mlr-org.github.io/mlr-tutorial/release/html/performance/index.html) plus all other parts of data analysis (data prep, modeling, tuning, ...)
more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Matching of Pandas (Python) to dplyr (R) for data wrangling

Roberto Rösler's insight:
 

This jupyther notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We'll work through the introductory dplyr vignette to analyze some flight data.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Fun with ddR: Using Distributed Data Structures in R

Fun with ddR: Using Distributed Data Structures in R | Things about R | Scoop.it

A few weeks ago, we revealed ddR (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package sets the seed for what may become a standardized and easy way to write parallel algorithms in R, regardless of the computational engine of choice. In designing ddR, we wanted to keep things simple and familiar. We expose only a small number of new user functions that are very close...

Roberto Rösler's insight:

looking forward to when the spark backend becomes available ...

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Deep Learning with MXNetR

Deep Learning with MXNetR | Things about R | Scoop.it

Deep learning has been an active field of research for some years, there are breakthroughs in image and language understanding etc. However, there has not yet been a good deep learning package in R that offers state-of-art deep learning models and the real GPU support to do fast training on these models.

In this post, we introduce MXNetR, an R package that brings fast GPU computation and state-of-art deep learning to the R community. MXNet allows you to flexibly configure state-of-art deep learning models backed by the fast CPU and GPU back-end. This post will cover the following topics:

Train your first neural network in five minutesUse MXNet for Handwritten Digits Classification CompetitionClassify real world images using state-of-art deep learning models.
Roberto Rösler's insight:

I'm currently using H2O for deep learning tasks in R, but this looks like a possible mature alternative. Sadly there is no real "quick start" like in H2O (http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/index.html#R) because the installation looks much more complicate on a windows system. 

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Introducing Distributed Data-structures in R

Introducing Distributed Data-structures in R | Things about R | Scoop.it

Due to R’s popularity as a data mining tool, many Big Data systems expose an R based interface to users. However, these interfaces are custom, non-standard, and difficult to learn.

Earlier in the year, we hosted a workshop on distributed computing in R. You can read about the event here. A brief summary of the workshop is: well-known R contributors from industry, academia, and R-core members discussed whether we can standardize the interface for distributed computing. It should encourage people to write portable distributed applications in R.

Roberto Rösler's insight:

I would like to see H2O as possible backend ...

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Integrating Python and R into a Data Analysis Pipeline – Part 1

Integrating Python and R into a Data Analysis Pipeline – Part 1 | Things about R | Scoop.it

This post kicks everything off by:

covering the reasons why you may want to include both languages in a pipeline;introducing ways of running R and Python from the command line; andshowing how you can accept inputs as arguments and write outputs to various file formats.
Roberto Rösler's insight:

Using R from command line is something I don't use that often. But this post presents a good and easy tutorial (including parameters) about it.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Cheatsheets - Shiny, ggplot2, dplyr, ...

Cheatsheets - Shiny, ggplot2, dplyr, ... | Things about R | Scoop.it

The cheat sheets below make it easy to learn about and use some of our favorite packages. From time to time, we will add new cheat sheets to the gallery

Roberto Rösler's insight:

Incredible useful resource if working with the corresponding packages...

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

Vtreat: designing a package for variable treatment

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

Missing values (NA or blanks)Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1)Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)Invalid values
Roberto Rösler's insight:

This package contains a lot features for handling data preprocessing - especially on missing values, rare observations values (eg a factor level that just appears a couple of times), ... .

I'd like the idea that you first generate a form of a template for data transformation that afterwards can be applied to new data.

more...
No comment yet.
Scooped by Roberto Rösler
Scoop.it!

A Simple Intro to Bayesian Change Point Analysis

A Simple Intro to Bayesian Change Point Analysis | Things about R | Scoop.it
The purpose of this post is to demonstrate change point analysis by stepping through an example of the technique in R presented in Rizzo's excellent, comprehensive, and very mathy book, Statistical...
Roberto Rösler's insight:

Nice post showing change point detection from "zero" using a bayesian approach.

I also tested most of the mentioned packages in my blog post http://things-about-r.tumblr.com/post/106806522699/change-point-detection-in-time-series-with-r-and.

 

more...
No comment yet.