Your new post is loading...
Your new post is loading...
Statistical tools for data analysis and visualization.
Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x). Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).
In the digital age of today, data comes in many forms. Many of the more common file types like CSV, XLSX, and plain text (TXT) are easy to access and manage. Yet, sometimes, the data we need is locked away in a file format that is less accessible such as a PDF. If you have ever found yourself in this dilemma, fret not — pdftools has you covered. In this post, you will learn how to: use pdftools to extract text from a PDF, use the stringr package to manipulate strings of text, and create a tidy data set.
Previously, I have written a blog post on machine learning with R by Caret package. In this post, I will use the scikitlearn library in Python. As we did in the R post, we will predict power output given a set of environmental readings from various sensors in a natural gasfired power generation plant.
The autoplotly package is an extension built on top of ggplot2, plotly, and ggfortify to provide functionalities to automatically generate interactive visualizations for many popular statistical results supported by ggfortify package in plotly and ggplot2 styles. The generated visualizations can also be easily extended using ggplot2 and plotly syntax while staying interactive.
We are proud to announce the beta release series of JupyterLab, the nextgeneration webbased interface for Project Jupyter. Project Jupyter exists to develop opensource software, open standards, and services for interactive and reproducible computing.
Since 2011, the Jupyter Notebook has been our flagship project for creating reproducible computational narratives. The Jupyter Notebook enables users to create and share documents that combine live code with narrative text, mathematical equations, visualizations, interactive controls, and other rich output. It also provides building blocks for interactive computing with data: a file browser, terminals, and a text editor.
I was working on monthly power demand in the Telangana state of India and used HoltWinters methodology using R to arrive at prediction forecasts. The data is since June 2014 from CEA website for Telangana (the state was formed in June 2014), so, data is available from that time only.
Time series data are data points collected over a period of time as a sequence of time gap. Time series data analysis means analyzing the available data to find out the pattern or trend in the data to predict some future values which will, in turn, help more effective and optimize business decisions.
Imagine you were to buy a car, would you just go to a store and buy the first one that you see? No, right? You usually consult few people around you, take their opinion, add your research to it and then go for the final decision. Let’s take a simpler scenario: whenever you go for a movie, do you ask your friends for reviews about the movie (unless, offcourse it stars one of your favorite actress)?
Classification algorithm defines set of rules to identify a category or group for an observation. There is various classification algorithm available like Logistic Regression, LDA, QDA, Random Forest, SVM etc. Here I am going to discuss Logistic regression, LDA, and QDA. The classification model is evaluated by confusion matrix. This matrix is represented by a table of Predicted True/False value with Actual True/False Value.
Makridakis Competitions (also known as the M Competitions or MCompetitions) are a series of competitions organized by teams led by forecasting researcher Spyros Makridakis and intended to evaluate and compare the accuracy of different forecasting methods. So far three competitions have taken place, named as M1 (1982), M2 (1993) and M3 (2000). The fourth competition is going to take place on year 2018 very soon.
The year is coming to an end. I did not write nearly as much as I had planned to. But I’m hoping to change that next year, with more tutorials around Reinforcement Learning, Evolution, and Bayesian Methods coming to WildML! And what better way to start than with a summary of all the amazing things that happened in 2017? Looking back through my Twitter history and the WildML newsletter, the following topics repeatedly came up. I’ll inevitably miss some important milestones, so please let me know about it in the comments!
Overview of simple outlier detection methods with their combination using dplyr and ruler packages. During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data. There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyr and ruler packages.
Machine Learning and Regression Machine Learning (ML) is a field of study that provides the capability to a Machine to understand data and to learn from the data. ML is not only about analytics modeling but it is endtoend modeling that broadly involves following steps: – Defining problem statement – Data collection. – Exploring, Cleaning and transforming data. – Making the analytics model. – Dashboard creation & deployment of the model.

Time series data mining in R. In this tutorial, I will show you one use case how to use time series representations effectively. This use case is clustering of time series and it will be clustering of consumers of electricity load. By clustering of consumers of electricity load, we can extract typical load profiles, improve the accuracy of consequent electricity consumption forecasting, detect anomalies or monitor a whole smart grid (grid of consumers) (Laurinec et al. (2016), Laurinec and Lucká (2016)). I will show you the first use case, the extraction of typical electricity load profiles by Kmedoids clustering method.
Extreme Gradient Boosting is among the hottest libraries in supervised machine learning these days. It supports various objective functions, including regression, classification, and ranking. It has gained much popularity and attention recently as it was the algorithm of choice for many winning teams of a number of machine learning competitions. Previously I showed how to do Extreme Gradient Boosting with R and in this post, I will show how to do it with Python.
There are many different methods for identifying outliers and a lot of them are available in R. But are outliers a matter of opinion? Do all methods give the same results? Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?
Principal Component Analysis (PCA) is unsupervised learning technique and it is used to reduce the dimension of the data with minimum loss of information. PCA is used in an application like face recognition and image compression.
This post is a followup post to my earlier post Deep Learning from first principles in Python, R and OctavePart 1. In the first part, I implemented Logistic Regression, in vectorized Python,R and Octave, with a wannabe Neural Network (a Neural Network with no hidden layers). In this second part, I implement a regular, but somewhat primitive Neural Network (a Neural Network with just 1 hidden layer). The 2nd part implements classification of manually created datasets, where the different clusters of the 2 classes are not linearly separable.
Integrating R Notebooks and R shiny with Tableau enables us to take advantage of the various statistical analysis and machine learning packages in R. In this short blog post, we will see how to integrate Tableau with R through R Notebooks and shiny. This approach helps us to have descriptive, inferential and predictive analytics in our Tableau story/dashboard. The data I am using is reading test from the Program for International Student Assessment (PISA).
In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem, and it’s especially relevant for supervised learning (i.e. predictive modeling). For example, you can’t say that neural networks are always better than decision trees or viceversa. There are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem, while using a holdout “test set” of data to evaluate performance and select the winner.
Let’s admit it. The whole world has been going crazy with Bitcoin. Bitcoin (BTC), the first cryptocurrency (in fact, the first digital currency to solve the doublespend problem) introduced by Satoshi Nakamoto has become bigger than wellestablished firms (even a few countries). So, a lot of Bitcoin Enthusiasts and Investors are looking to keep a track of its daily price to better read the market and make moves accordingly. This tutorial is to help an R user build his/her own Daily Bitcoin Price Tracker using three packages, Coindeskr, Shiny and Dygraphs.
Google Recently Launched its internal tool for collaborating on writing Data Science Code. The Project called Google CoLaboratory (g.co/colab) is based on the Jupyter Open Source Project and is integrated with Google Drive. Colaboratory allows users to work on Jupyter Notebooks as easily as working on Google Docs or spreadsheets.
“You don’t perceive objects as they are. You perceive them as you are.” “Your interpretation of physical objects has everything to do with the historical trajectory of your brain  and little to do with the objects themselves.” “The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model".
Now that Twitter allows 280 characters, the code of some drawings I have made can fit in a tweet. In this post I have compiled a few of them. The first one is a cardioid inspired in string art.
We used AI to search for planets in NASA Kepler data, and found two new planets, and the first 8planet solar system outside of our own, in the process. For thousands of years, people have looked up at the stars, recorded observations, and noticed patterns. Some of the first objects early astronomers identified were planets, which the Greeks called “planētai,” or “wanderers,” for their seemingly irregular movement through the night sky. Centuries of study helped people understand that the Earth and other planets in our solar system orbit the sun—a star like many others. Today, with the help of technologies like telescope optics, space flight, digital cameras, and computers, it’s possible for us to extend our understanding beyond our own sun and detect planets around other stars. Studying these planets—called exoplanets—helps us explore some of our deepest human inquiries about the universe. What else is out there? Are there other planets and solar systems like our own?
