Skills and tools for making and presenting environmental decisions
1.4K views | +0 today

# Skills and tools for making and presenting environmental decisions

Curated by Graham Smith
 Scooped by Graham Smith

## Trace Digitising in QGIS - Lutra Consulting

Following the popularity of the AutoTrace plugin, we received a number of requests for additional features and to incorporate the tool into the main Advanced Digitising tools in QGIS. Funding for the work was generously raised with the help of the community in 2015 and the feature landed in QGIS 2.14 Read more if you’d like to learn to use these new tools… How to use Trace Digitising Check out the Trace Digitising in QGIS for detailed instructions on how to use the tool. Key features The trace tool: uses Dijkstra’s shortest path algorithm to find traceable routes can trace routes over multiple distinct features can be used with Advanced Digitising tools (e.g. reshaping) can be enabled and disabled by pressing T on your keyboard while digitising is fast and easy to use The following video demonstrates how to use the tracing tool: This video shows how you can make use of tracing while reshaping a polygon: Sponsors The project was kindly sponsored by the QGIS community. Special thanks to all those who contributed towards the development of this tool: The Royal Borough of Windsor and Maidenhead Neath Port Talbot County Borough Council Ujaval Gandhi Surrey Heath Borough Council Matias Arnold Northumberland National Park Authority Buccleuch Estates Limited Countryscape
No comment yet.
 Scooped by Graham Smith

## Mapping Abundance in Streams in R

I haven’t worked with raster or spatial polygon data in R much before, but I want to create maps to show the results of a spatiotemporal model. Specifically, I want to depict the changes in abundance of fish in a stream network over time and space. I decided to play around with various packages for manipulating spatial data and then using base plot and ggplot2 functions to make the maps. Rather than post the code and output here, it was easier to publish it to RPubs and link to it: http://rpubs.com/djhocking/155740 The scale, colors, projection, and background will have to be adjusted based on the purpose of the map, but hopefully the code and explanations will help people get started. This was intended as a learning tool for me, rather than strictly as a tutorial, but I figured I would share in case it proved useful to others. Here’s some links to additional information I found useful along the way: Add a scale bar Add a north arrow and scale bar R GIS Tutorial R CRS System by NCEAS ggplot2 mapping in R by Zev Ross (he has tons of great info on his site!!!) ggplot2 cheat sheet raserVis – haven’t tried yet but looks very useful Raster data in R by NEON
No comment yet.
 Scooped by Graham Smith

## Data for teaching – real, fake, fictional

No comment yet.
 Scooped by Graham Smith

## Image attribution in presentations

No comment yet.
 Scooped by Graham Smith

## 50% Draft of Forthcoming Book Available

No comment yet.
 Scooped by Graham Smith

## QGIS Server Simple Browser Plugin

A new server plugin provides quick browsing of a QGIS project and OpenLayers based layers preview. Today I’m releasing the first version of QGIS Server Simple Browser Plugin, a simple Server plugin that generates a browsable table of contents of the project’s layers and a link to an OpenLayers map.   How it works The plugin adds an XSL stylesheet to GetProjectsettings XML response, the generated HTML looks like this:   The openlayers format The map preview is generated by adding a new application/openlayers FORMAT option to GetMap requests, the generated map automatically fits to the layer’s extent and has basic GetFeatureInfo capabilities. Limitations The current version only supports EPSG:4326 that must be available (enabled) on the server. Source code and download The plugin is available on the official repository: ServerSimpleBrowser The code is on GitHub. It's only fair to share...
No comment yet.
 Scooped by Graham Smith

## Statistics And The Law: The Prosecutor’s Fallacy | DC's Improbable Science

This post arose from a recent meeting at the Royal Society. It was organised by Julie Maxton to discuss the application of statistical methods to legal problems. I found myself sitting next to an Appeal Court Judge who wanted more explanation of the ideas. Here it is. Some preliminaries The paper that I wrote recently was a review of the problems associated with the interpretation of screening tests and tests of significance. It doesn’t allude to legal problems explicitly, though the problems are the same in principle.  It is open access, at http://rsos.royalsocietypublishing.org/content/1/3/140216 In that paper I was interested in the false positive rate (also known as the false discovery rate) in tests of significance. It turned out to be alarmingly large. That has serious consequences for the credibility of the scientific literature. In legal terms, the false positive rate means the proportion of cases in which, on the basis of the evidence, a suspect is found guilty when in fact they are innocent. That has even more serious consequences. Although most of what I want to say can be said without much algebra, it would perhaps be worth getting two things clear before we start. The rules of probability. (1) To get any understanding, it’s essential to understand the rules of probabilities, and, in particular, the idea of conditional probabilities. One source would be my old book, Lectures on Biostatistics (now free), The account on pages 19 to 24 give a pretty simple (I hope) description of what’s needed. Briefly, a vertical line is read as “given”, so Prob(evidence | not guilty) means the probability that the evidence would be observed given that the suspect was not guilty. (2) Another potential confusion in this area is the relationship between odds and probability. The relationship between the probability of an event occurring, and the odds on the event can be illustrated by an example. If the probability of being right-handed is 0.9, then the probability of being not being right-handed is 0.1.  That means that 9 people out of 10 are right-handed, and one person in 10 is not. In other words for every person who is not right-handed there are 9 who are right-handed. Thus the odds that a randomly-selected person is right-handed are 9 to 1. In symbols this can be written $\mathrm{probability=\frac{odds}{1 + odds}}$ In the example, the odds on being right-handed are 9 to 1, so the probability of being right-handed is 9 / (1+9) = 0.9. Conversely, $\mathrm{odds =\frac{probability}{1 – probability}}$ In the example, the probability of being right-handed is 0.9, so the odds of being right-handed are 0.9 / (1 – 0.9) = 0.9 / 0.1 = 9 (to 1). With these preliminaries out of the way, we can proceed to the problem. THE LEGAL PROBLEM The first problem lies in the fact that the answer depends on Bayes’ theorem. Although that was published in 1763, statisticians are still arguing about how it should be used to this day.  In fact whenever it’s mentioned, statisticians tend to revert to internecine warfare, and forget about the user. Bayes’ theorem can be stated in words as follows $\mathrm{\text{posterior odds ratio} = \text{prior odds ratio} \times \text{likelihood ratio}}$ “Posterior odds ratio” means the odds that the person is guilty, relative to the odds that they are innocent, in the light of the evidence, and that’s clearly what one wants to know.  The “prior odds” are the odds that the person was guilty before any evidence was produced, and that is the really contentious bit. Sometimes the need to specify the prior odds has been circumvented by using the likelihood ratio alone, but, as shown below, that isn’t a good solution. The analogy with the use of screening tests to detect disease is illuminating. Screening tests A particularly straightforward application of Bayes’ theorem is in screening people to see whether or not they have a disease.  It turns out, in many cases, that screening gives a lot more wrong results (false positives) than right ones.  That’s especially true when the condition is rare (the prior odds that an individual suffers from the condition is small).  The process of screening for disease has a lot in common with the screening of suspects for guilt. It matters because false positives in court are disastrous. The screening problem is dealt with in sections 1 and 2 of my paper. or on this blog (and here). A bit of animation helps the slides, so you may prefer the Youtube version:  (It deals with screening tests up to 8’45”). The rest of my paper applies similar ideas to tests of significance.  In that case the prior probability is the probability that there is in fact a real effect, or, in the legal case, the probability that the suspect is guilty before any evidence has been presented. This is the slippery bit of the problem both conceptually, and because it’s hard to put a number on it. But the examples below show that to ignore it, and to use the likelihood ratio alone, could result in many miscarriages of justice. In the discussion of tests of significance, I took the view that it is not legitimate (in the absence of good data to the contrary) to assume any prior probability greater than 0.5. To do so would presume you know the answer before any evidence was presented.  In the legal case a prior probability of 0.5 would mean assuming that there was a 50:50 chance that the suspect was guilty before any evidence was presented. A 50:50 probability of guilt before the evidence is known corresponds to a prior odds ratio of 1 (to 1)  If that were true, the likelihood ratio would be a good way to represent the evidence, because the posterior odds ratio would be equal to the likelihood ratio. It could be argued that 50:50 represents some sort of equipoise, but in the example below it is clearly too high, and if it is less that 50:50, use of the likelihood ratio runs a real risk of convicting an innocent person. The following example is modified slightly from section 3 of a book chapter by Mortera and Dawid (2008). Philip Dawid is an eminent statistician who has written a lot about probability and the law, and he’s a member of the legal group of the Royal Statistical Society. My version of the example removes most of the algebra, and uses different numbers. Example: The island problem The “island problem” (Eggleston 1983, Appendix 3) is an imaginary example that provides a good illustration of the uses and misuses of statistical logic in forensic identification. A murder has been committed on an island, cut off from the outside world, on which 1001 (= N + 1) inhabitants remain. The forensic evidence at the scene consists of a measurement, x, on a “crime trace” characteristic, which can be assumed to come from the criminal. It might, for example, be a bit of the DNA sequence from the crime scene. Say, for the sake of example, that the probability of a random member of the population having characteristic x is P = 0.004 (i.e. 0.4% ), so the probability that a random member of the population does not have the characteristic is 1 – P = 0.996. The mainland police arrive and arrest a random islander, Jack. It is found that Jack matches the crime trace. There is no other relevant evidence. How should this match evidence be used to assess the claim that Jack is the murderer? We shall consider three arguments that have been used to address this question. The first is wrong. The second and third are right. (For illustration, we have taken N = 1000, P = 0.004.) (1) Prosecutor’s fallacy Prosecuting counsel, arguing according to his favourite fallacy, asserts that the probability that Jack is guilty is 1 – P , or 0.996, and that this proves guilt “beyond a reasonable doubt”. The probability that Jack would show characteristic x if he were not guilty would be 0.4% i.e. Prob(Jack has x | not guilty) = 0.004.  Therefore the probability of the evidence, given that that Jack is guilty, Prob(Jack has x | Jack is guilty), is one 1 – 0.004 = 0.996. But this is Prob(evidence | guilty) which is not what we want.  What we need is the probability that Jack is guilty, given the evidence, P(Jack is guilty | Jack has characteristic x). To mistake the latter for the former is the prosecutor’s fallacy, or the error of the transposed conditional. Dawid gives an example that makes the distinction clear. “As an analogy to help clarify and escape this common and seductive confusion, consider the difference between “the probability of having spots, if you have measles” -which is close to 1  and “the probability of having measles, if you have spots” -which, in the light of the many alternative possible explanations for spots, is much smaller.” (2) Defence counter-argument Counsel for the defence points out that, while the guilty party must have characteristic x, he isn’t the only person on the island to have this characteristic. Among the remaining N = 1000 innocent islanders, 0.4% have characteristic x, so the number who have it will be NP = 1000 x 0.004 = 4 . Hence the total number of islanders that have this characteristic must be 1 + NP = 5 . The match evidence means that Jack must be one of these 5 people, but does not otherwise distinguish him from any of the other members of it.  Since just one of these is guilty, the probability that this is Jack is thus 1/5, or 0.2— very far from being “beyond all reasonable doubt”. (3) Bayesian argument The probability of the having characteristic x (the evidence) would be Prob(evidence | guilty) = 1 if Jack were guilty, but if Jack were not guilty it would be 0.4%, i.e. Prob(evidence | not guilty) = P. Hence the likelihood ratio in favour of guilt, on the basis of the evidence, is $LR=\frac{\text{Prob(evidence } | \text{ guilty})}{\text{Prob(evidence }|\text{ not guilty})} = \frac{1}{P}=250$ In words, the evidence would be 250 times more probable if Jack were guilty than if he were innocent.  While this seems strong evidence in favour of guilt, it still does not tell us what we want to know, namely the probability that Jack is guilty in the light of the evidence: Prob(guilty | evidence), or, equivalently, the odds ratio -the odds of guilt relative to odds of innocence, given the evidence, To get that we must multiply the likelihood ratio by the prior odds on guilt, i.e. the odds on guilt before any evidence is presented. It’s often hard to get a numerical value for this. But in our artificial example, it is possible. We can argue that, in the absence of any other evidence, Jack is no more nor less likely to be the culprit than any other islander, so that the prior probability of guilt is 1/(N + 1), corresponding to prior odds on guilt of 1/N. We can now apply Bayes’s theorem to obtain the posterior odds on guilt: $\text {posterior odds} = \text{prior odds} \times LR = \left ( \frac{1}{N}\right ) \times \left ( \frac{1}{P} \right )= 0.25$ Thus the odds of guilt in the light of the evidence are 4 to 1 against. The corresponding posterior probability of guilt is $Prob( \text{guilty } | \text{ evidence})= \frac{1}{1+NP}= \frac{1}{1+4}=0.2$ This is quite small –certainly no basis for a conviction. This result is exactly the same as that given by the Defence Counter-argument’, (see above). That argument was simpler than the Bayesian argument. It didn’t explicitly use Bayes’ theorem, though it was implicit in the argument. The advantage of using the former is that it looks simpler. The advantage of the explicitly Bayesian argument is that it makes the assumptions more clear. In summary The prosecutor’s fallacy suggested, quite wrongly, that the probability that Jack was guilty was 0.996. The likelihood ratio was 250, which also seems to suggest guilt, but it doesn’t give us the probability that we need. In stark contrast, the defence counsel’s argument, and equivalently, the Bayesian argument, suggested that the probability of Jack’s guilt as 0.2. or odds of 4 to 1 against guilt. The potential for wrong conviction is obvious. Conclusions. Although this argument uses an artificial example that is simpler than most real cases, it illustrates some important principles. (1) The likelihood ratio is not a good way to evaluate evidence, unless there is good reason to believe that there is a 50:50 chance that the suspect is guilty before any evidence is presented. (2) In order to calculate what we need, Prob(guilty | evidence), you need to give numerical values of how common the possession of characteristic x (the evidence) is the whole population of possible suspects (a reasonable value might be estimated in the case of DNA evidence),  We also need to know the size of the population.  In the case of the island example, this was 1000, but in general, that would be hard to answer and any answer might well be contested by an advocate who understood the problem. These arguments lead to four conclusions. (1) If a lawyer uses the prosecutor’s fallacy, (s)he should be told that it’s nonsense. (2) If a lawyer advocates conviction on the basis of likelihood ratio alone, s(he) should be asked to justify the implicit assumption that that there was a 50:50 chance that the suspect was guilty before any evidence was presented. (3) If a lawyer uses Defence counter-argument, or, equivalently, the version of Bayesian argument given here, (s)he should be asked to justify the estimates of the numerical value given to the prevalence of x in the population (P) and the numerical value of the size of this population (N).  A range of values of P and N could be used, to provide a range of possible values of the final result, the probability that the suspect is guilty in the light of the evidence. (4) The example that was used is the simplest possible case.  For more complex cases it would be advisable to ask a professional statistician. Some reliable people can be found at the Royal Statistical Society’s section on Statistics and the Law. If you do ask a professional statistician, and they present you with a lot of mathematics, you should still ask these questions about precisely what assumptions were made, and ask for an estimate of the range of uncertainty in the value of Prob(guilty | evidence) which they produce. Postscript: real cases Another paper by Philip Dawid, Statistics and the Law, is interesting because it discusses some recent real cases: for example the wrongful conviction of Sally Clark because of the wrong calculation of the statistics for Sudden Infant Death Syndrome. On Monday 21 March, 2016, Dr Waney Squier was struck off the medical register by the General Medical Council because they claimed that she misrepresented the evidence in cases of Shaken Baby Syndrome (SBS). This verdict was questioned by many lawyers, including Michael Mansfield QC and Clive Stafford Smith, in a letter. “General Medical Council behaving like a modern inquisition” The latter has already written “This shaken baby syndrome case is a dark day for science – and for justice“.. The evidence for SBS is based on the existence of a triad of signs (retinal bleeding, subdural bleeding and encephalopathy). It seems likely that these signs will be present if a baby has been shake, i.e Prob(triad | shaken) is high. But this is irrelevant to the question of guilt. For that we need Prob(shaken | triad). As far as I know, the data to calculate what matters are just not available. It seem that the GMC may have fallen for the prosecutor’s fallacy. Or perhaps the establishment won’t tolerate arguments. One is reminded, once again, of the definition of clinical experience: "Making the same mistakes with increasing confidence over an impressive number of years." (from A Sceptic’s Medical Dictionary by Michael O’Donnell. A Sceptic’s Medical Dictionary BMJ publishing, 1997). Appendix (for nerds). Two forms of Bayes’ theorem The form of Bayes’ theorem given at the start is expressed in terms of odds ratios. The same rule can be written in terms of probabilities. (This was the form used in the appendix of my paper.) For those interested in the details, it may help to define explicitly these two forms. In terms of probabilities, the probability of guilt in the light of the evidence (what we want) is $\text{Prob(guilty } | \text{ evidence}) = \text{Prob(evidence } | \text{ guilty}) \frac{\text{Prob(guilty })}{\text{Prob(evidence })}$ In terms of odds ratios, the odds ratio on guilt, given the evidence (which is what we want) is $\frac{ \text{Prob(guilty } | \text{ evidence})} {\text{Prob(not guilty } | \text{ evidence}} = \left ( \frac{ \text{Prob(guilty)}} {\text {Prob((not guilty)}} \right ) \left ( \frac{ \text{Prob(evidence } | \text{ guilty})} {\text{Prob(evidence } | \text{ not guilty}} \right )$ or, in words, $\text{posterior odds of guilt } =\text{prior odds of guilt} \times \text{likelihood ratio}$ This is the precise form of the equation that was given in words at the beginning. A derivation of the equivalence of these two forms is sketched in a document which you can download. FOLLOW-UP 23 March 2016 It’s worth pointing out the following connection between the legal argument (above) and tests of significance. (1) The likelihood ratio works only when there is a 50:50 chance that the suspect is guilty before any evidence is presented (so the prior probability of guilt is 0.5, or, equivalently, the prior odds ratio is 1). (2) The false positive rate in signiifcance testing is close to the P value only when the prior probability of a real effect is 0.5, as shown in section 6 of the P value paper. However there is another twist in the significance testing argument. The statement above is right if we take as a positive result any P < 0.05. If we want to interpret a value of P = 0.047 in a single test, then, as explained in section 10 of the P value paper, we should restrict attention to only those tests that give P close to 0.047. When that is done the false positive rate is 26% even when the prior is 0.5 (and much bigger than 30% if the prior is smaller –see extra Figure), That justifies the assertion that if you claim to have discovered something because you have observed P = 0.047 in a single test then there is a chance of at least 30% that you’ll be wrong. Is there, I wonder, any legal equivalent of this argument?
No comment yet.
 Scooped by Graham Smith

## Mapping Birds with Choroplethr

No comment yet.
 Scooped by Graham Smith

## R 3.2.4 is released

No comment yet.
 Scooped by Graham Smith

## First steps with Non-Linear Regression in R

(This article was first published on DataScience+, and kindly contributed to R-bloggers) Drawing a line through a cloud of point (ie doing a linear regression) is the most basic analysis one may do. It is sometime fitting well to the data, but in some (many) situations, the relationships between variables are not linear. In this case one may follow three different ways: (i) try to linearize the relationship by transforming the data, (ii) fit polynomial or complex spline models to the data or (iii) fit non-linear functions to the data. As you may have guessed from the title, this post will be dedicated to the third option. What is non-linear regression? In non-linear regression the analyst specify a function with a set of parameters to fit to the data. The most basic way to estimate such parameters is to use a non-linear least squares approach (function nls in R) which basically approximate the non-linear function using a linear one and iteratively try to find the best parameter values (wiki). A nice feature of non-linear regression in an applied context is that the estimated parameters have a clear interpretation (Vmax in a Michaelis-Menten model is the maximum rate) which would be harder to get using linear models on transformed data for example. Fit non-linear least squares First example using the Michaelis-Menten equation: #simulate some data set.seed(20160227) x<-seq(0,50,1) y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(51,0,1) #for simple models nls find good starting values for the parameters even if it throw a warning m<-nls(y~a*x/(b+x)) #get some estimation of goodness of fit cor(y,predict(m)) [1] 0.9496598 #plot plot(x,y) lines(x,predict(m),lty=2,col="red",lwd=3) Here it is the plot: The issue of finding the right starting values Finding good starting values is very important in non-linear regression to allow the model algorithm to converge. If you set starting parameters values completely outside of the range of potential parameter values the algorithm will either fail or it will return non-sensical parameter like for example returning a growth rate of 1000 when the actual value is 1.04. The best way to find correct starting value is to “eyeball” the data, plotting them and based on the understanding that you have from the equation find approximate starting values for the parameters. #simulate some data, this without a priori knowledge of the parameter value y<-runif(1,5,15)*exp(-runif(1,0.01,0.05)*x)+rnorm(51,0,0.5) #visually estimate some starting parameter values plot(x,y) #from this graph set approximate starting values a_start<-8 #param a is the y value when x=0 b_start<-2*log(2)/a_start #b is the decay rate #model m<-nls(y~a*exp(-b*x),start=list(a=a_start,b=b_start)) #get some estimation of goodness of fit cor(y,predict(m)) [1] 0.9811831 #plot the fit lines(x,predict(m),col="red",lty=2,lwd=3) Here it is the plot: Using SelfStarting function It is very common for different scientific fields to use different parametrization (i.e. different equations) for the same model, one example is the logistic population growth model, in ecology we use the following form: $$N_{t} = frac{K*N_{0}*e^{r*t}}{K + N_{0} * (e^{r*t} – 1)}$$ With (N_{t}) being the number of individuals at time (t), (r) being the population growth rate and (K) the carrying capacity. We can re-write this as a differential equation: $$dN/dt = R*N*(1-N/K)$$ library(deSolve) #simulating some population growth from the logistic equation and estimating the parameters using nls log_growth <- function(Time, State, Pars) { with(as.list(c(State, Pars)), { dN <- R*N*(1-N/K) return(list(c(dN))) }) } #the parameters for the logisitc growth pars <- c(R=0.2,K=1000) #the initial numbers N_ini <- c(N=1) #the time step to evaluate the ODE times <- seq(0, 50, by = 1) #the ODE out <- ode(N_ini, times, log_growth, pars) #add some random variation to it N_obs<-out[,2]+rnorm(51,0,50) #numbers cannot go lower than 1 N_obs<-ifelse(N_obs<1,1,N_obs) #plot plot(times,N_obs) This part was just to simulate some data with random error, now come the tricky part to estimate the starting values. Now R has a built-in function to estimate starting values for the parameter of a logistic equation (SSlogis) but it uses the following equation: $$N_{t} = frac{alpha}{1+e^{frac{xmid-t}{scale}}}$$ #find the parameters for the equation SS<-getInitial(N_obs~SSlogis(times,alpha,xmid,scale),data=data.frame(N_obs=N_obs,times=times)) We use the function getInitial which gives some initial guesses about the parameter values based on the data. We pass to this function a selfStarting model (SSlogis) which takes as argument an input vector (the t values where the function will be evaluated), and the un-quoted name of the three parameter for the logistic equation. However as the SSlogis use a different parametrization we need to use a bit of algebra to go from the estimated self-starting values returned from SSlogis to the one that are in the equation we want to use. #we used a different parametrization K_start<-SS["alpha"] R_start<-1/SS["scale"] N0_start<-SS["alpha"]/(exp(SS["xmid"]/SS["scale"])+1) #the formula for the model log_formula<-formula(N_obs~K*N0*exp(R*times)/(K+N0*(exp(R*times)-1))) #fit the model m<-nls(log_formula,start=list(K=K_start,R=R_start,N0=N0_start)) #estimated parameters summary(m) Formula: N_obs ~ K * N0 * exp(R * times)/(K + N0 * (exp(R * times) - 1)) Parameters: Estimate Std. Error t value Pr(>|t|) K 1.012e+03 3.446e+01 29.366 <2e-16 *** R 2.010e-01 1.504e-02 13.360 <2e-16 *** N0 9.600e-01 4.582e-01 2.095 0.0415 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 49.01 on 48 degrees of freedom Number of iterations to convergence: 1 Achieved convergence tolerance: 1.537e-06 #get some estimation of goodness of fit cor(N_obs,predict(m)) [1] 0.9910316 #plot lines(times,predict(m),col="red",lty=2,lwd=3) Here it is the plot: That was a bit of a hassle to get from the SSlogis parametrization to our own, but it was worth it! In a next post we will see how to go beyond non-linear least square to embrace maximum likelihood estimation methods which are way more powerful and reliable. They allow you to build any model that you can imagine. To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...
No comment yet.
 Scooped by Graham Smith

## The difference between a confidence interval and a capture percentage

No comment yet.
 Scooped by Graham Smith

## After 150 Years, the ASA Says No to p-values

No comment yet.
 Scooped by Graham Smith

## Principal Component Analysis using R

Curse of Dimensionality: One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality. In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension. Principal component analysis: Consider below scenario: The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable. Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible. Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance. Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component. In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1. For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names (“9.4″, 9.5″ …) correspond to midpoints of intervals of finger lengths whereas the 22 column names (“142.24″, “144.78”…) correspond to (body) heights of 3000 criminals, see also below. head(crimtab) 142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34 9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 182.88 185.42 187.96 190.5 193.04 195.58 9.4 0 0 0 0 0 0 9.5 0 0 0 0 0 0 9.6 0 0 0 0 0 0 9.7 0 0 0 0 0 0 9.8 0 0 0 0 0 0 9.9 0 0 0 0 0 0 dim(crimtab) [1] 42 22 str(crimtab) 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ... - attr(*, "dimnames")=List of 2 ..$: chr [1:42] "9.4" "9.5" "9.6" "9.7" ... ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ... sum(crimtab) [1] 3000 colnames(crimtab) [1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34" [17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58" let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying. apply(crimtab,2,var) We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp(). pca =prcomp(crimtab) Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component. Let’s plot all the principal components and see how the variance is accounted with each component. par(mar = rep(2, 4)) plot(pca) Clearly the first principal component accounts for maximum information. Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components. #below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one. pca$rotation=-pca$rotation pca$x=-pca$x biplot (pca , scale =0) The output of the preceding code is as follows: In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors. From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features. In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, “165.1, 167.64, and 170.18″ which are less correlated with them. Complete Code for PCA implementation in R: //Your code here //Be sure you escape your tags with < and > when displaying things like HTML. So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.
No comment yet.
 Scooped by Graham Smith

## QGIS 3.0 plans

No comment yet.
 Scooped by Graham Smith

## QGIS 2.14 ‘Essen’ is released!

No comment yet.
 Scooped by Graham Smith

## Good news for QGIS MapInfo users

So some good news for QGIS users who also need/want to use MapInfo.  QGIS via GDAL 2.0 can support MapInfo TAB file editing. In all older versions of GDAL there was only support for read and/or write but not both. MapInfo TAB editing has been supported in GDAL 2 but up until this point QGIS has only be built against GDAL 1.xx.  GDAL 2.x is now the default GDAL release in OSGeo4w. From Jurgen: 2.0.2 is now the default GDAL in OSGeo4W and the nightlies (qgis-ltr-dev, qgis-rel-dev and qgis-dev) already picked it up. With the next release the regular packages (2.14 and 2.8) will also be updated to use it Even if you don’t want to make the switch to full QGIS you can now use both bits of software and edit in both. QGIS will still only support a single geometry type per layer so if you open a mixed tab file you will get the geometry type selector.  You can load the layer 3 times if you need the 3 different geometry types.
No comment yet.
 Scooped by Graham Smith

## OS X XQuartz Vulnerability Test Using R

(This article was first published on R – rud.is, and kindly contributed to R-bloggers) It’s usually a good thing when my #rstats and infosec worlds collide. Unfortunately, this time it’s a script that R folk running on OS X can use to see if they are using a version of XQuartz that has a nasty vulnerability in the framework it uses to auto-update. If this test comes back with the warning, try to refrain from using XQuartz on insecure networks until the developers fix the issue. UPDATE Thanks to a gist prodding by @bearloga, here’s a script to scan all your applications for the vulnerability: library(purrr) library(dplyr) library(XML)   read_plist <- safely(readKeyValueDB) safe_compare <- safely(compareVersion)   apps <- list.dirs(c("/Applications", "/Applications/Utilities"), recursive=FALSE)   # if you have something further than this far down that's bad you're on your own   for (i in 1:4) { moar_dirs <- grep("app$", apps, value=TRUE, invert=TRUE) if (length(moar_dirs) > 0) { apps <- c(apps, list.dirs(moar_dirs, recursive=FALSE)) } } apps <- unique(grep("app$", apps, value=TRUE))   pb <- txtProgressBar(0, length(apps), style=3)   suppressWarnings(map_df(1:length(apps), function(i) {   x <- apps[i]   setTxtProgressBar(pb, i)   is_vuln <- FALSE version <- ""   app_name <- sub("\.app$", "", basename(x)) app_loc <- sub("^/", "", dirname(x)) to_look <- c(sprintf("%s/Contents/Frameworks/Autoupdate.app/Contents/Info.plist", x), sprintf("%s/Contents/Frameworks/Sparkle.framework/Versions/A/Resources/Info.plist", x), sprintf("%s/Contents/Frameworks/Sparkle.framework/Versions/A/Resources/Autoupdate.app/Contents/Info.plist", x)) is_there <- map_lgl(c(sprintf("%s/Contents/Frameworks/Sparkle.framework/", x), to_look), file.exists) has_sparkle <- any(is_there) to_look <- to_look[which(is_there[-1])] discard(map_chr(to_look, function(x) { read_plist(x)$result$CFBundleShortVersionString %||% NA }), is.na) -> vs if (any(map_dbl(vs, function(v) { safe_compare(v, "1.16.1")$result %||% -1 }) < 0)) { is_vuln <- TRUE version <- vs[1] }   data_frame(app_loc, app_name, has_sparkle, is_vuln, version)   })) -> app_scan_results   close(pb)   select(arrange(filter(app_scan_results, has_sparkle), app_loc, app_name), -has_sparkle) To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...
No comment yet.
 Scooped by Graham Smith

## Introducing pdftools – A fast and portable PDF extractor

No comment yet.
 Scooped by Graham Smith

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers) Tips below are based on the lessons I learnt from making mistakes during my years of research. It’s purely personal opinion. Order doesn’t mean anything. If you think I should include something else, please comment below. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible.  Set your own temporary folder (via –tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a ‘-‘ in the name will be transferred to ‘.’ in R unless you specify check.names=F. Always make a README file for the folder of your data; For why and how, read this: http://stackoverflow.com/questions/2304863/how-to-write-a-good-readme Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). Don’t save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka “digital garbage”) you will generate soon. Using parallel as much as you can, e.g. using “LC_ALL=C sort –parallel=24 –buffer-size=5G” for sorting (https://www.biostars.org/p/66927/), as multi-core CPU/GPU is common nowaday. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD for the project. You never know when you, your boss or your collaborators will need the data again; Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL) When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others’ credit. This applies to this post, of course Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques. When you learn some tips from others, remember to share something you learned to the community as well, as that’s how the community grows healthily. Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading: https://www.washingtonpost.com/news/wonk/wp/2015/06/02/medical-researchers-have-figured-out-how-much-time-is-okay-to-spend-sitting-each-day/ To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook... No comment yet.  Scooped by Graham Smith ## Upload plots as PNG file to your wordpress (This article was first published on R – Networkx, and kindly contributed to R-bloggers) Synopsis Note! This post is a addition to Create blog posts from RStudio to WordPress About a week ago I tried to add my blog to R-Bloggers. I thought everything was correct to add it. But this week I got a mail from Tal Galili (site administrator of R-Bloggers) with the message that my blog uses base64 images in my feed. This type of images are created, as default option, with knitr as a standalone HTML document. It would be great if I could solve it with PNG instead of base64 images, even better if I could solve it with posting from RStudio to WordPress. And so I did. In this post I will explain you how to post your WordPress post with PNG files and upload it to your blog. How to recognize a base64 image Well you can check your RSS feed of your blog and search for data:image/png;base64, it should be something like this (thx Tal for the image ): Setting your options To upload your PNG files to your blog you need to set some knitr options first. As I described in my earlier post you had to set your login parameters to login to your WordPress blog. For the upload of files to WordPress you need to set the upload.fun option of knitr. This will take the filename as the input and returns a link to the image in the RMarkdown file. Let’s set the login parameters and the upload.fun options. I’ve hashed it out because I can not post to the dummy credentials. In the earlier post you can add the upload.fun option after your login credentials and you are good to go. # Load libraries library(RWordPress) library(knitr) # Login parameters options(WordPressLogin=c(your_username="your_password"), WordPressURL="http://your.blog.com/xmlrpc.php") # Upload your plots as png files to your blog opts_knit$set(upload.fun = function(file){library(RWordPress);uploadFile(file)$url;}) After setting the above credentials your are ready to upload your post with PNG image(s) to your blog. For the completeness I will post a png image in this post I will create within R. Example of posting your PNG image data(iris) pairs(iris, col = iris$Species) After this I could start post to my blog with knit2wpCrayon. knit2wpCrayon("r2blog_with_png.Rmd", title = "Upload plots as PNG file to your wordpress", categories = c("R", "Programming"), publish = FALSE, upload = TRUE) Again I hashed it because it could not post itself. This code can also be find on my GitHub. To leave a comment for the author, please follow the link and comment on their blog: R – Networkx. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...
No comment yet.
 Scooped by Graham Smith