Your new post is loading...
Your new post is loading...
Following the popularity of the AutoTrace plugin, we received a number of requests for additional features and to incorporate the tool into the main Advanced Digitising tools in QGIS. Funding for the work was generously raised with the help of the community in 2015 and the feature landed in QGIS 2.14 Read more if you’d like to learn to use these new tools… How to use Trace Digitising Check out the Trace Digitising in QGIS for detailed instructions on how to use the tool. Key features The trace tool: uses Dijkstra’s shortest path algorithm to find traceable routes can trace routes over multiple distinct features can be used with Advanced Digitising tools (e.g. reshaping) can be enabled and disabled by pressing T on your keyboard while digitising is fast and easy to use The following video demonstrates how to use the tracing tool: This video shows how you can make use of tracing while reshaping a polygon: Sponsors The project was kindly sponsored by the QGIS community. Special thanks to all those who contributed towards the development of this tool: The Royal Borough of Windsor and Maidenhead Neath Port Talbot County Borough Council Ujaval Gandhi Surrey Heath Borough Council Matias Arnold Northumberland National Park Authority Buccleuch Estates Limited Countryscape
I haven’t worked with raster or spatial polygon data in R much before, but I want to create maps to show the results of a spatiotemporal model. Specifically, I want to depict the changes in abundance of fish in a stream network over time and space. I decided to play around with various packages for manipulating spatial data and then using base `plot` and `ggplot2` functions to make the maps. Rather than post the code and output here, it was easier to publish it to RPubs and link to it: http://rpubs.com/djhocking/155740 The scale, colors, projection, and background will have to be adjusted based on the purpose of the map, but hopefully the code and explanations will help people get started. This was intended as a learning tool for me, rather than strictly as a tutorial, but I figured I would share in case it proved useful to others. Here’s some links to additional information I found useful along the way: Add a scale bar Add a north arrow and scale bar R GIS Tutorial R CRS System by NCEAS ggplot2 mapping in R by Zev Ross (he has tons of great info on his site!!!) ggplot2 cheat sheet raserVis – haven’t tried yet but looks very useful Raster data in R by NEON
There is a push for teachers and students to use real data in learning statistics. In this post I am going to address the benefits and drawbacks of different sources of real data, and make a case for the use of good fictional data as part of a statistical programme. Here is a video introducing our fictional data set of 180 or 240 dragons, so you know what I am referring to. Real collected, real database, trivial, fictional There are two main types of real data. There is the real data that students themselves collect and there is real data in a dataset, collected by someone else, and available in its entirety. There are also two main types of unreal data. The first is trivial and lacking in context and useful only for teaching mathematical manipulation. The second is what I call fictional data, which is usually based on reallife data, but with some extra advantages, so long as it is skilfully generated. Poorly generated fictional data, as often found in case studies, is very bad for teaching. Focus When deciding what data to use for teaching statistics, it matters what it is that you are trying to teach. If you are simply teaching how to add up 8 numbers and divide the result by 8, then you are not actually doing statistics, and trivial fake data will suffice. Statistics only exists when there is a context. If you want to teach about the statistical enquiry process, then having the students genuinely involved at each stage of the process is a good idea. If you are particularly wanting to teach about fitting a regression line, you generally want to have multiple examples for students to use. And it would be helpful for there to be at least one linear relationship. I read a very interesting article in “Teaching Children Mathematics” entitled, “Practıcal Problems: Using Literature to Teach Statistics”. The authors, Hourigan and Leavy, used a children’s book to generate the data on the number of times different characters appeared. But what I liked most, was that they addressed the need for a “driving question”. In this case the question was provided by a preschool teacher who could only afford to buy one puppet for the book, and wanted to know which character appears the most in the story. The children practised collecting data as the story is read aloud. They collected their own data to analyse. Let’s have a look at the different pros and cons of studentcollected data, provided real data, and highquality fictional data. Collecting data When we want students to experience the process of collecting real data, they need to collect real data. However real time data collection is time consuming, and probably not necessary every year. Student data collection can be simulated by a program such as The Islands, which I wrote about previously. Data students collect themselves is much more likely to have errors in it, or be “dirty” (which is a good thing). When students are only given clean datasets, such as those usually provided with textbooks, they do not learn the skills of deciding what to do with an errant data point. Fictional databases can also have dirty data, generated into it. The fictional inhabitants of The Islands sometimes lie, and often refuse to give consent for data collection on them. Motivation I have heard that after a few years of school, graphs about cereal preference, number of siblings and type of pet get a little old. These topics, relating to the students, are motivating at first, but often there is no purpose to the investigation other than to get data for a graph. Students need to move beyond their own experience and are keen to try something new. Data provided in a database can be motivating, if carefully chosen. There are opportunities to use databases that encourage awareness of social justice, the environment and politics. Fictional data must be motivating or there is no point! We chose dragons as a topic for our first set of fictional data, as dragons are interesting to boys and girls of most ages. A meaningful question Here I refer again to that excellent article that talks about a driving question. There needs to be a reason for analysing the data. Maybe there is concern about food provided at the tuck shop, with healthy alternatives. Or can the question be tied into another area of the curriculum, such as which type of bean plant grows faster? Or can we increase the germination rate of seeds. The Census@school data has the potential for driving questions, but they probably need to be helped along. For existing datasets the driving question used by students might not be the same as the one (if any) driving the original collection of data. Sometimes that is because the original purpose is not ‘motivating’ for the students or not at an appropriate level. If you can’t find or make up a motivating meaningful question, the database is not appropriate. For our fictional dragon data, we have developed two scenarios – vaccinating for Pacific Draconian flu, and building shelters to make up for the deforestation of the island. With the vaccination scenario, we need to know about behaviour and size. For the shelter scenario we need to make decisions based on size, strength, behaviour and breath type. There is potential for a number of other scenarios that will also create driving questions. Getting enough data It can be difficult to get enough data for effects to show up. When students are limited to their class or family, this limits the number of observations. Only some databases have enough observations in them. There is no such problem with fictional databases, as you can just generate as much data as you need! There are special issues with regard to teaching about sampling, where you would want a large database with constrained access, like the Islands data, or the use of cards. Variables A problem with the data students collect is that it tends to be categorical, which limits the types of analysis that can be used. In databases, it can also be difficult to find measurement level data. In our fictional dragon database, we have height, strength and age, which all take numerical values. There are also four categorical variables. The Islands database has a large number of variables, both categorical and numerical. Interesting Effects Though it is good for students to understand that quite often there is no interesting effect, we would like students to have the satisfaction of finding interesting effects in the data, especially at the start. Interesting effects can be particularly exciting if the data is real, and they can apply their findings to the real world context. Studentcollecteddata is risky in terms of finding any noticeable relationships. It can be disappointing to do a long and involved study and find no effects. Databases from known studies can provide good effects, but unfortunately the variables with no effect tend to be left out of the databases, giving a false sense that there will always be effects. When we generate our fictional data, we make sure that there are the relationships we would like there, with enough interaction and noise. This is a highly skilled process, honed by decades of making up data for student assessment at university. (Guilty admission) Ethics There are ethical issues to be addressed in the collection of real data from people the students know. Informed consent should be granted, and there needs to be thorough vetting. Young students (and not so young) can be damagingly direct in their questions. You may need to explain that it can be upsetting for people to be asked if they have been beaten or bullied. When using fictional data, that may appear real, such as the Islands data, it is important for students to be aware that the data is not real, even though it is based on real effects. This was one of the reasons we chose to build our first database on dragons, as we hope that will remove any concerns about whether the data is real or not! The following table summarises the post. Real data collected by the students Real existing database Fictional data (The Islands, Kiwi Kapers, Dragons, Desserts) Data collection Real experience Nil Sometimes Dirty data Always Seldom Can be controlled Motivating Can be Can be Must be! Enough data Time consuming, difficult Hard to find Always Meaningful question Sometimes. Can be trivial Can be difficult Part of the fictional scenario Variables Tend towards nominal Often too few variables Generate as needed Ethical issues Often Usually fine Need to manage reality Effects Unpredictable Can be obvious or trivial, or difficult Can be managed
Do you provide attribution for images in your lectures and presentations? If you don’t, here are some reasons why you should. They say a picture is worth a thousand words. Apparently that’s not enough for a citation. As a PhD student, I see powerpointtype presentations regularly – in departmental seminars, in lab meetings, in classes I’m a TA for, and in classes in which I am a student. These presentations typically include beautiful photographic illustrations, usually of wildlife. I’ve noticed recently that whether the presenter is a professor, postdoc, graduate or undergraduate student, the vast majority of the times an image is shown on the screen there is no credit given to the photographer (this and all that follows applies equally to drawings and other illustrations but I’ll use photography as an example here). I see this happen again and again, and it makes me angry. Here’s why. A good wildlife photograph costs something to produce, including skill, time, and money. It’s increasingly easy to take a decent photo with an iphone or other cell phone camera. But the beautiful wildlife photographs scientists are downloading from the internet to illustrate their presentations are not typically casual iphone snaps. They are more often the result of hours of hard work by a talented photographer who shot hundreds of frames before capturing just the right moment. Their equipment likely cost hundreds, if not thousands of dollars to purchase. They may have spent more money to get to a remote location to find their subject. Maybe they get paid to do this work, but often they do not. A good wildlife photograph is rich with data. At a minimum it provides information about the morphology of the organism it depicts, in a much more meaningful way than words alone ever could. But it might show an animal engaged in a specific behaviour, or illustrate some kind of interaction between organisms, or otherwise tell a story about natural history. At the same time, it may provide information about the context in which a behaviour occurs, perhaps showing the abiotic environment of the organism, or other members of its community. A photographic illustration of my study organism is, to me, a valuable thing. When I use one in a presentation, I think it is worth writing the name of the photographer under the image. It costs me almost nothing, except a bit of room on the slide or on the photograph itself. If done with a bit of care, the photo credit need not distract the viewer. But it gives them the opportunity to take note of the creator of the image, if they are so inclined, and look them up later on. If the photographer is a professional (or even if not) and this leads to a sale, that little photo credit might be worth quite a lot. But that is not the main reason to provide attribution. A photograph is someone else’s work, and in science, when we use someone else’s work, we cite it appropriately. I do not understand why many scientists don’t make the small effort to include photo credits in their presentations. I have a few nonmutually exclusive hypotheses (if you can think of others, please do let me know): They are lazy. They can’t be bothered to look up who made the image and write down their name. They have aesthetic objections to adding photo credits on the grounds that they add “clutter” to slides (this ought to apply to figure citations too, though). They are unaware of copyright law*, how creative commons image licenses work, and that using other people’s work without attribution is a form of plagiarism. (*Occasionally I see presentations full of noncredited images that slap a “copyright Dr. X” on the bottom of every slide, and I’m not sure whether this constitutes evidence for or against this hypothesis.) They are aware of the above, but they don’t think a photograph is worth citing. At this point, I should probably say a bit about copyright and image licenses, in case the first part of hypothesis 3 is correct. If you want to use an image that’s under copyright in a presentation (and unless otherwise indicated, it’s safest to assume any image you find online is under copyright) the best way to avoid infringement is to contact the copyright owner (usually the creator) directly and ask permission. In my experience, they usually say yes, as long as you provide proper attribution. However, the presentations I am talking about are typically for the purpose of education – lectures, class presentations, lab meetings, or departmental seminars. In Canada and the US, it’s probably not copyright infringement to use an image without prior permission if the use is educational (this falls under the “fair dealing exception” in Canadian copyright law and “fair use” in the US, but it’s important to note that not all use of copyrighted materials for educational purposes are necessarily fair use – copyright law is complex). Anyway, if you are using an image without attribution for educational purposes you will not likely be found guilty of copyright infringement, provided it was actually fair use. With creative commons licensed images (images you find on Wikipedia are typically in this category, or public domain), things are a bit different. These images are under copyright, but depending on the license, they may be used without permission under certain circumstances. They may require that use be noncommercial, or that the image is not altered in any way, and they all have explicit requirements about attribution. Most of the times I’ve seen folks using creative commons images they are violating the terms of the license by not providing proper attribution. Again, you’re very unlikely to be taken to court for such violation in a lecture or lab meeting talk, but that doesn’t mean it’s not wrong. Here’s a nice guide from a Canadian university on how to appropriately cite public domain and creative commons images. So, in general there are no actual consequences for not abiding by copyright laws when you are using images for educational purposes. There is little risk that the copyright owner will ever see your presentation, let alone take you to court over it. But that’s not really the point. In writing, when you present ideas that are not your own, you provide attribution with a citation. In presentations, people often include figures that they did not produce themselves, typically from published papers. These are almost invariably accompanied by a citation. It would be wrong (it would be plagiarism) to present someone else’s figure as if it was your own work. An image is a figure too, containing visual, rather than numerical data. It is also wrong (it is also plagiarism) to present someone else’s image as if it was your own work. And if you are giving a talk, I would argue that everything in it should be your own work unless otherwise indicated. Perhaps at this point some readers may be thinking something along the lines of, “well, everyone takes images off the Internet for their presentations and everyone understands that those images are not the work of the presenter – they are just illustrations.” If this is your position, I urge you to reconsider. What makes a data figure worth more than an image, such that the former deserves attribution but the latter does not? Does your institution’s academic honesty policy make exemptions for plagiarism of images because everyone does it? I’m honestly interested in what arguments there might be against providing credit for photographs – I can’t think of any good ones myself. If none of the above is a convincing argument that it’s good idea to provide attribution for images in talks, I hope this final argument will be. In my experience, instructors and TAs expect undergraduates to provide attribution for work that is not their own in both writing and presentations. If they do not, it’s considered academic dishonesty, and (at least in theory) has severe consequences. I regularly hear instructors and TAs regularly bemoaning the fact that students do not take plagiarism and academic honesty seriously. Some spend time explicitly setting out the rules for using other people’s work in writing and in presentations (I show my students how to find and properly use creative commons licensed images before they give presentations, and I plan to write a post about how to do this soon), and even then some students apparently do not get the message. But consider the example being set by the instructors and TAs who regularly present slides using images without attribution. How can we expect students to take our messages about plagiarism and attribution seriously if they see that their instructors don’t bother with attribution in their own presentations? To sum up, if you aren’t already, please provide credit for images (and any figures) you use in your presentations. Here are three good reasons: The photographer who created the image (or the scientist who created the figure) deserves credit for their work Not doing so constitutes plagiarism, if not copyright infringement Doing so sets a good example for students
(This article was first published on Mad (Data) Scientist, and kindly contributed to Rbloggers) As I’ve mentioned here a couple of times, I am in the midst of writing a book, From Linear Models to Machine Learning: Regression and Classification, with Examples in R. As has been my practice with past books, I have now placed a 50% rough draft of the book on the Web. You will see even from this partial version that I take a very different approach than do the many existing books on the subject. Hopefully you will agree that it is not just different, but better. The book draws a lot on my regtools package of R tools. Suggestions, praise, criticisms etc. would be highly appreciated. To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
A new server plugin provides quick browsing of a QGIS project and OpenLayers based layers preview. Today I’m releasing the first version of QGIS Server Simple Browser Plugin, a simple Server plugin that generates a browsable table of contents of the project’s layers and a link to an OpenLayers map. How it works The plugin adds an XSL stylesheet to GetProjectsettings XML response, the generated HTML looks like this: The openlayers format The map preview is generated by adding a new application/openlayers FORMAT option to GetMap requests, the generated map automatically fits to the layer’s extent and has basic GetFeatureInfo capabilities. Limitations The current version only supports EPSG:4326 that must be available (enabled) on the server. Source code and download The plugin is available on the official repository: ServerSimpleBrowser The code is on GitHub. It's only fair to share...
This post arose from a recent meeting at the Royal Society. It was organised by Julie Maxton to discuss the application of statistical methods to legal problems. I found myself sitting next to an Appeal Court Judge who wanted more explanation of the ideas. Here it is. Some preliminaries The paper that I wrote recently was a review of the problems associated with the interpretation of screening tests and tests of significance. It doesn’t allude to legal problems explicitly, though the problems are the same in principle. It is open access, at http://rsos.royalsocietypublishing.org/content/1/3/140216 In that paper I was interested in the false positive rate (also known as the false discovery rate) in tests of significance. It turned out to be alarmingly large. That has serious consequences for the credibility of the scientific literature. In legal terms, the false positive rate means the proportion of cases in which, on the basis of the evidence, a suspect is found guilty when in fact they are innocent. That has even more serious consequences. Although most of what I want to say can be said without much algebra, it would perhaps be worth getting two things clear before we start. The rules of probability. (1) To get any understanding, it’s essential to understand the rules of probabilities, and, in particular, the idea of conditional probabilities. One source would be my old book, Lectures on Biostatistics (now free), The account on pages 19 to 24 give a pretty simple (I hope) description of what’s needed. Briefly, a vertical line is read as “given”, so Prob(evidence  not guilty) means the probability that the evidence would be observed given that the suspect was not guilty. (2) Another potential confusion in this area is the relationship between odds and probability. The relationship between the probability of an event occurring, and the odds on the event can be illustrated by an example. If the probability of being righthanded is 0.9, then the probability of being not being righthanded is 0.1. That means that 9 people out of 10 are righthanded, and one person in 10 is not. In other words for every person who is not righthanded there are 9 who are righthanded. Thus the odds that a randomlyselected person is righthanded are 9 to 1. In symbols this can be written \[ \mathrm{probability=\frac{odds}{1 + odds}} \] In the example, the odds on being righthanded are 9 to 1, so the probability of being righthanded is 9 / (1+9) = 0.9. Conversely, \[ \mathrm{odds =\frac{probability}{1 – probability}} \] In the example, the probability of being righthanded is 0.9, so the odds of being righthanded are 0.9 / (1 – 0.9) = 0.9 / 0.1 = 9 (to 1). With these preliminaries out of the way, we can proceed to the problem. THE LEGAL PROBLEM The first problem lies in the fact that the answer depends on Bayes’ theorem. Although that was published in 1763, statisticians are still arguing about how it should be used to this day. In fact whenever it’s mentioned, statisticians tend to revert to internecine warfare, and forget about the user. Bayes’ theorem can be stated in words as follows \[ \mathrm{\text{posterior odds ratio} = \text{prior odds ratio} \times \text{likelihood ratio}} \] “Posterior odds ratio” means the odds that the person is guilty, relative to the odds that they are innocent, in the light of the evidence, and that’s clearly what one wants to know. The “prior odds” are the odds that the person was guilty before any evidence was produced, and that is the really contentious bit. Sometimes the need to specify the prior odds has been circumvented by using the likelihood ratio alone, but, as shown below, that isn’t a good solution. The analogy with the use of screening tests to detect disease is illuminating. Screening tests A particularly straightforward application of Bayes’ theorem is in screening people to see whether or not they have a disease. It turns out, in many cases, that screening gives a lot more wrong results (false positives) than right ones. That’s especially true when the condition is rare (the prior odds that an individual suffers from the condition is small). The process of screening for disease has a lot in common with the screening of suspects for guilt. It matters because false positives in court are disastrous. The screening problem is dealt with in sections 1 and 2 of my paper. or on this blog (and here). A bit of animation helps the slides, so you may prefer the Youtube version: (It deals with screening tests up to 8’45”). The rest of my paper applies similar ideas to tests of significance. In that case the prior probability is the probability that there is in fact a real effect, or, in the legal case, the probability that the suspect is guilty before any evidence has been presented. This is the slippery bit of the problem both conceptually, and because it’s hard to put a number on it. But the examples below show that to ignore it, and to use the likelihood ratio alone, could result in many miscarriages of justice. In the discussion of tests of significance, I took the view that it is not legitimate (in the absence of good data to the contrary) to assume any prior probability greater than 0.5. To do so would presume you know the answer before any evidence was presented. In the legal case a prior probability of 0.5 would mean assuming that there was a 50:50 chance that the suspect was guilty before any evidence was presented. A 50:50 probability of guilt before the evidence is known corresponds to a prior odds ratio of 1 (to 1) If that were true, the likelihood ratio would be a good way to represent the evidence, because the posterior odds ratio would be equal to the likelihood ratio. It could be argued that 50:50 represents some sort of equipoise, but in the example below it is clearly too high, and if it is less that 50:50, use of the likelihood ratio runs a real risk of convicting an innocent person. The following example is modified slightly from section 3 of a book chapter by Mortera and Dawid (2008). Philip Dawid is an eminent statistician who has written a lot about probability and the law, and he’s a member of the legal group of the Royal Statistical Society. My version of the example removes most of the algebra, and uses different numbers. Example: The island problem The “island problem” (Eggleston 1983, Appendix 3) is an imaginary example that provides a good illustration of the uses and misuses of statistical logic in forensic identification. A murder has been committed on an island, cut off from the outside world, on which 1001 (= N + 1) inhabitants remain. The forensic evidence at the scene consists of a measurement, x, on a “crime trace” characteristic, which can be assumed to come from the criminal. It might, for example, be a bit of the DNA sequence from the crime scene. Say, for the sake of example, that the probability of a random member of the population having characteristic x is P = 0.004 (i.e. 0.4% ), so the probability that a random member of the population does not have the characteristic is 1 – P = 0.996. The mainland police arrive and arrest a random islander, Jack. It is found that Jack matches the crime trace. There is no other relevant evidence. How should this match evidence be used to assess the claim that Jack is the murderer? We shall consider three arguments that have been used to address this question. The first is wrong. The second and third are right. (For illustration, we have taken N = 1000, P = 0.004.) (1) Prosecutor’s fallacy Prosecuting counsel, arguing according to his favourite fallacy, asserts that the probability that Jack is guilty is 1 – P , or 0.996, and that this proves guilt “beyond a reasonable doubt”. The probability that Jack would show characteristic x if he were not guilty would be 0.4% i.e. Prob(Jack has x  not guilty) = 0.004. Therefore the probability of the evidence, given that that Jack is guilty, Prob(Jack has x  Jack is guilty), is one 1 – 0.004 = 0.996. But this is Prob(evidence  guilty) which is not what we want. What we need is the probability that Jack is guilty, given the evidence, P(Jack is guilty  Jack has characteristic x). To mistake the latter for the former is the prosecutor’s fallacy, or the error of the transposed conditional. Dawid gives an example that makes the distinction clear. “As an analogy to help clarify and escape this common and seductive confusion, consider the difference between “the probability of having spots, if you have measles” which is close to 1 and “the probability of having measles, if you have spots” which, in the light of the many alternative possible explanations for spots, is much smaller.” (2) Defence counterargument Counsel for the defence points out that, while the guilty party must have characteristic x, he isn’t the only person on the island to have this characteristic. Among the remaining N = 1000 innocent islanders, 0.4% have characteristic x, so the number who have it will be NP = 1000 x 0.004 = 4 . Hence the total number of islanders that have this characteristic must be 1 + NP = 5 . The match evidence means that Jack must be one of these 5 people, but does not otherwise distinguish him from any of the other members of it. Since just one of these is guilty, the probability that this is Jack is thus 1/5, or 0.2— very far from being “beyond all reasonable doubt”. (3) Bayesian argument The probability of the having characteristic x (the evidence) would be Prob(evidence  guilty) = 1 if Jack were guilty, but if Jack were not guilty it would be 0.4%, i.e. Prob(evidence  not guilty) = P. Hence the likelihood ratio in favour of guilt, on the basis of the evidence, is \[ LR=\frac{\text{Prob(evidence }  \text{ guilty})}{\text{Prob(evidence }\text{ not guilty})} = \frac{1}{P}=250 \] In words, the evidence would be 250 times more probable if Jack were guilty than if he were innocent. While this seems strong evidence in favour of guilt, it still does not tell us what we want to know, namely the probability that Jack is guilty in the light of the evidence: Prob(guilty  evidence), or, equivalently, the odds ratio the odds of guilt relative to odds of innocence, given the evidence, To get that we must multiply the likelihood ratio by the prior odds on guilt, i.e. the odds on guilt before any evidence is presented. It’s often hard to get a numerical value for this. But in our artificial example, it is possible. We can argue that, in the absence of any other evidence, Jack is no more nor less likely to be the culprit than any other islander, so that the prior probability of guilt is 1/(N + 1), corresponding to prior odds on guilt of 1/N. We can now apply Bayes’s theorem to obtain the posterior odds on guilt: \[ \text {posterior odds} = \text{prior odds} \times LR = \left ( \frac{1}{N}\right ) \times \left ( \frac{1}{P} \right )= 0.25 \] Thus the odds of guilt in the light of the evidence are 4 to 1 against. The corresponding posterior probability of guilt is \[ Prob( \text{guilty }  \text{ evidence})= \frac{1}{1+NP}= \frac{1}{1+4}=0.2 \] This is quite small –certainly no basis for a conviction. This result is exactly the same as that given by the Defence Counterargument’, (see above). That argument was simpler than the Bayesian argument. It didn’t explicitly use Bayes’ theorem, though it was implicit in the argument. The advantage of using the former is that it looks simpler. The advantage of the explicitly Bayesian argument is that it makes the assumptions more clear. In summary The prosecutor’s fallacy suggested, quite wrongly, that the probability that Jack was guilty was 0.996. The likelihood ratio was 250, which also seems to suggest guilt, but it doesn’t give us the probability that we need. In stark contrast, the defence counsel’s argument, and equivalently, the Bayesian argument, suggested that the probability of Jack’s guilt as 0.2. or odds of 4 to 1 against guilt. The potential for wrong conviction is obvious. Conclusions. Although this argument uses an artificial example that is simpler than most real cases, it illustrates some important principles. (1) The likelihood ratio is not a good way to evaluate evidence, unless there is good reason to believe that there is a 50:50 chance that the suspect is guilty before any evidence is presented. (2) In order to calculate what we need, Prob(guilty  evidence), you need to give numerical values of how common the possession of characteristic x (the evidence) is the whole population of possible suspects (a reasonable value might be estimated in the case of DNA evidence), We also need to know the size of the population. In the case of the island example, this was 1000, but in general, that would be hard to answer and any answer might well be contested by an advocate who understood the problem. These arguments lead to four conclusions. (1) If a lawyer uses the prosecutor’s fallacy, (s)he should be told that it’s nonsense. (2) If a lawyer advocates conviction on the basis of likelihood ratio alone, s(he) should be asked to justify the implicit assumption that that there was a 50:50 chance that the suspect was guilty before any evidence was presented. (3) If a lawyer uses Defence counterargument, or, equivalently, the version of Bayesian argument given here, (s)he should be asked to justify the estimates of the numerical value given to the prevalence of x in the population (P) and the numerical value of the size of this population (N). A range of values of P and N could be used, to provide a range of possible values of the final result, the probability that the suspect is guilty in the light of the evidence. (4) The example that was used is the simplest possible case. For more complex cases it would be advisable to ask a professional statistician. Some reliable people can be found at the Royal Statistical Society’s section on Statistics and the Law. If you do ask a professional statistician, and they present you with a lot of mathematics, you should still ask these questions about precisely what assumptions were made, and ask for an estimate of the range of uncertainty in the value of Prob(guilty  evidence) which they produce. Postscript: real cases Another paper by Philip Dawid, Statistics and the Law, is interesting because it discusses some recent real cases: for example the wrongful conviction of Sally Clark because of the wrong calculation of the statistics for Sudden Infant Death Syndrome. On Monday 21 March, 2016, Dr Waney Squier was struck off the medical register by the General Medical Council because they claimed that she misrepresented the evidence in cases of Shaken Baby Syndrome (SBS). This verdict was questioned by many lawyers, including Michael Mansfield QC and Clive Stafford Smith, in a letter. “General Medical Council behaving like a modern inquisition” The latter has already written “This shaken baby syndrome case is a dark day for science – and for justice“.. The evidence for SBS is based on the existence of a triad of signs (retinal bleeding, subdural bleeding and encephalopathy). It seems likely that these signs will be present if a baby has been shake, i.e Prob(triad  shaken) is high. But this is irrelevant to the question of guilt. For that we need Prob(shaken  triad). As far as I know, the data to calculate what matters are just not available. It seem that the GMC may have fallen for the prosecutor’s fallacy. Or perhaps the establishment won’t tolerate arguments. One is reminded, once again, of the definition of clinical experience: "Making the same mistakes with increasing confidence over an impressive number of years." (from A Sceptic’s Medical Dictionary by Michael O’Donnell. A Sceptic’s Medical Dictionary BMJ publishing, 1997). Appendix (for nerds). Two forms of Bayes’ theorem The form of Bayes’ theorem given at the start is expressed in terms of odds ratios. The same rule can be written in terms of probabilities. (This was the form used in the appendix of my paper.) For those interested in the details, it may help to define explicitly these two forms. In terms of probabilities, the probability of guilt in the light of the evidence (what we want) is \[ \text{Prob(guilty }  \text{ evidence}) = \text{Prob(evidence }  \text{ guilty}) \frac{\text{Prob(guilty })}{\text{Prob(evidence })} \] In terms of odds ratios, the odds ratio on guilt, given the evidence (which is what we want) is \[ \frac{ \text{Prob(guilty }  \text{ evidence})} {\text{Prob(not guilty }  \text{ evidence}} = \left ( \frac{ \text{Prob(guilty)}} {\text {Prob((not guilty)}} \right ) \left ( \frac{ \text{Prob(evidence }  \text{ guilty})} {\text{Prob(evidence }  \text{ not guilty}} \right ) \] or, in words, \[ \text{posterior odds of guilt } =\text{prior odds of guilt} \times \text{likelihood ratio} \] This is the precise form of the equation that was given in words at the beginning. A derivation of the equivalence of these two forms is sketched in a document which you can download. FOLLOWUP 23 March 2016 It’s worth pointing out the following connection between the legal argument (above) and tests of significance. (1) The likelihood ratio works only when there is a 50:50 chance that the suspect is guilty before any evidence is presented (so the prior probability of guilt is 0.5, or, equivalently, the prior odds ratio is 1). (2) The false positive rate in signiifcance testing is close to the P value only when the prior probability of a real effect is 0.5, as shown in section 6 of the P value paper. However there is another twist in the significance testing argument. The statement above is right if we take as a positive result any P < 0.05. If we want to interpret a value of P = 0.047 in a single test, then, as explained in section 10 of the P value paper, we should restrict attention to only those tests that give P close to 0.047. When that is done the false positive rate is 26% even when the prior is 0.5 (and much bigger than 30% if the prior is smaller –see extra Figure), That justifies the assertion that if you claim to have discovered something because you have observed P = 0.047 in a single test then there is a chance of at least 30% that you’ll be wrong. Is there, I wonder, any legal equivalent of this argument?
(This article was first published on R – AriLamstein.com, and kindly contributed to Rbloggers) After releasing my course Mapmaking in R with Choroplethr last December I received an interesting email from Dr. Herb Wilson, a biologist at Colby College. Herb studies the Redbreasted Nuthatch, which live throughout the United States and Canada. He asked if it was possible to use choroplethr to map the location of these birds. Herb’s data was a list of (region, value) pairs, where the regions are US States and Canadian Provinces. At that time it was possible to use the function ?admin1_choropleth to map the birds in US States or Canadian Provinces, but not both simultaneously. So I created a new function for him, ?admin1_region_choropleth, which solves this exact problem. The code is now on CRAN. Below is a tutorial on how to use it. The Code To get the latest version of choroplethr, simply install it from CRAN and check its version: install.packages("choroplethr") packageVersion("choroplethr") [1] ‘3.5.0’ If you see a lower version (for example 3.4.0), then you need to wait a day or two until your CRAN mirror updates. The new function is called admin1_region_choropleth. You can see its builtin help like this: ?admin1_region_choropleth The Data The bird count data comes from the annual Christmas Bird Count run by the National Audubon Society. I have put it in a github repository which you can access here. Once you download the file you can read it in like this: library(readr) rbn = read_csv("~/Downloads/rbn.csv") head(rbn) Source: local data frame [6 x 8] State AdminRegion Count_yr SpeciesNumber NumberByPartyHours Year ReportingCounts ReportingObservers (chr) (chr) (int) (int) (dbl) (int) (int) (int) 1 AK alaska 63 1 0.0128 1962 1 1 2 AK alaska 64 2 0.0233 1963 1 2 3 AK alaska 70 6 0.0513 1969 2 8 4 AK alaska 71 4 0.0313 1970 1 7 5 AK alaska 72 2 0.0187 1971 2 18 6 AK alaska 73 3 0.0328 1972 2 13 If we wanted to map the NumberByPartyHours in 2013 we could start like this: library(dplyr) rbn2013 = rbn %>% rename(region = AdminRegion, value = NumberByPartyHours) %>% filter(Year == 2013 & !region %in% c("northwest territories", "alaska")) We rename the columns to region and value because choroplethr requires columns with those names. We filter out Alaska and the Northwest Territores because they visually throw off the map a bit, and might look nicer as insets. Making the Map To make the map, simply pass the new data frame to the function admin1_region_choropleth: library(choroplethr) library(ggplot2) admin1_region_choropleth(rbn2013, title = "2013 Redbreasted Nuthatch Sightings", legend = "Sightings By Party Hours") + coord_map() And now the visual pattern of the data is clear. Within North America, the Redbreasted Nuthatch has been seen mostly in the northwest and northeast. Updating the Course This is the third update I’ve made to choroplethr since launching my course Mapmaking in R with Choroplethr last December. (You can learn about the other updates here and here.) My plan is to update the course with video lectures that demonstrate this material soon. But that will probably have to wait until I finish production of my next course, which I hope to announce soon. The post Mapping Birds with Choroplethr appeared first on AriLamstein.com. To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on R – Rstatistics blog, and kindly contributed to Rbloggers) R 3.2.4 (codename “Very Secure Dishes”) was released today. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below. Upgrading to R 3.2.4 on Windows If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui: install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R. Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page. NEW FEATURES install.packages() and related functions now give a more informative warning when an attempt is made to install a base package. summary(x) now prints with less rounding when x contains infinite values. (Request of PR#16620.) provideDimnames() gets an optional unique argument. shQuote() gains type = "cmd2" for quoting in cmd.exe in Windows. (Response to PR#16636.) The data.frame method of rbind() gains an optional argument stringsAsFactors (instead of only depending on getOption("stringsAsFactors")). smooth(x, *) now also works for long vectors. tools::texi2dvi() has a workaround for problems with the texi2dvi script supplied by texinfo 6.1. It extracts more error messages from the LaTeX logs when in emulation mode. UTILITIES R CMD check will leave a log file ‘build_vignettes.log’ from the rebuilding of vignettes in the ‘.Rcheck’ directory if there is a problem, and always if environment variable_R_CHECK_ALWAYS_LOG_VIGNETTE_OUTPUT_ is set to a true value. DEPRECATED AND DEFUNCT Use of SUPPORT_OPENMP from header ‘Rconfig.h’ is deprecated in favour of the standard OpenMP define _OPENMP. (This has been the recommendation in the manual for a while now.) The make macro AWK which is long unused by R itself but recorded in file ‘etc/Makeconf’ is deprecated and will be removed in R 3.3.0. The C header file ‘S.h’ is no longer documented: its use should be replaced by ‘R.h’. BUG FIXES kmeans(x, centers = <1row>) now works. (PR#16623) Vectorize() now checks for clashes in argument names. (PR#16577) file.copy(overwrite = FALSE) would signal a successful copy when none had taken place. (PR#16576) ngettext() now uses the same default domain as gettext(). (PR#14605) array(.., dimnames = *) now warns about nonlist dimnames and, from R 3.3.0, will signal the same error for invalid dimnames as matrix() has always done. addmargins() now adds dimnames for the extended margins in all cases, as always documented. heatmap() evaluated its add.expr argument in the wrong environment. (PR#16583) require() etc now give the correct entry of lib.loc in the warning about an old version of a package masking a newer required one. The internal deparser did not add parentheses when necessary, e.g. before [] or [[]]. (Reported by Lukas Stadler; additional fixes included as well). as.data.frame.vector(*, row.names=*) no longer produces ‘corrupted’ data frames from row names of incorrect length, but rather warns about them. This will become an error. url connections with method = "libcurl" are destroyed properly. (PR#16681) withCallingHandler() now (again) handles warnings even during S4 generic’s argument evaluation. (PR#16111) deparse(..., control = "quoteExpressions") incorrectly quoted empty expressions. (PR#16686) format()ting datetime objects ("POSIX[cl]?t") could segfault or recycle wrongly. (PR#16685) plot.ts(, las = 1) now does use las. saveRDS(*, compress = "gzip") now works as documented. (PR#16653) (Windows only) The Rgui front end did not always initialize the console properly, and could cause R to crash. (PR#16998) dummy.coef.lm() now works in more cases, thanks to a proposal by Werner Stahel (PR#16665). In addition, it now works for multivariate linear models ("mlm", manova) thanks to a proposal by Daniel Wollschlaeger. The as.hclust() method for "dendrogram"s failed often when there were ties in the heights. reorder() and midcache.dendrogram() now are nonrecursive and hence applicable to somewhat deeply nested dendrograms, thanks to a proposal by Suharto Anggono in PR#16424. cor.test() now calculates very small p values more accurately (affecting the result only in extreme not statistically relevant cases). (PR#16704) smooth(*, do.ends=TRUE) did not always work correctly in R versions between 3.0.0 and 3.2.3. pretty(D) for datetime objects D now also works well if range(D) is (much) smaller than a second. In the case of only one unique value in D, the pretty range now is more symmetric around that value than previously. Similarly, pretty(dt) no longer returns a length 5 vector with duplicated entries for Date objects dt which span only a few days. The figures in help pages such as ?points were accidentally damaged, and did not appear in R 3.2.3. (PR#16708) available.packages() sometimes deleted the wrong file when cleaning up temporary files. (PR#16712) The X11() device sometimes froze on Red Hat Enterprise Linux 6. It now waits for MapNotify events instead of Expose events, thanks to Siteshwar Vashisht. (PR#16497) [dpqr]nbinom(*, size=Inf, mu=.) now works as limit case, for ‘dpq’ as the Poisson. (PR#16727) pnbinom() no longer loops infinitely in border cases. approxfun(*, method="constant") and hence ecdf() which calls the former now correctly “predict” NaN values as NaN. summary.data.frame() now displays NAs in Date columns in all cases. (PR#16709) To leave a comment for the author, please follow the link and comment on their blog: R – Rstatistics blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on DataScience+, and kindly contributed to Rbloggers) Drawing a line through a cloud of point (ie doing a linear regression) is the most basic analysis one may do. It is sometime fitting well to the data, but in some (many) situations, the relationships between variables are not linear. In this case one may follow three different ways: (i) try to linearize the relationship by transforming the data, (ii) fit polynomial or complex spline models to the data or (iii) fit nonlinear functions to the data. As you may have guessed from the title, this post will be dedicated to the third option. What is nonlinear regression? In nonlinear regression the analyst specify a function with a set of parameters to fit to the data. The most basic way to estimate such parameters is to use a nonlinear least squares approach (function nls in R) which basically approximate the nonlinear function using a linear one and iteratively try to find the best parameter values (wiki). A nice feature of nonlinear regression in an applied context is that the estimated parameters have a clear interpretation (Vmax in a MichaelisMenten model is the maximum rate) which would be harder to get using linear models on transformed data for example. Fit nonlinear least squares First example using the MichaelisMenten equation: #simulate some data set.seed(20160227) x<seq(0,50,1) y<((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(51,0,1) #for simple models nls find good starting values for the parameters even if it throw a warning m<nls(y~a*x/(b+x)) #get some estimation of goodness of fit cor(y,predict(m)) [1] 0.9496598 #plot plot(x,y) lines(x,predict(m),lty=2,col="red",lwd=3) Here it is the plot: The issue of finding the right starting values Finding good starting values is very important in nonlinear regression to allow the model algorithm to converge. If you set starting parameters values completely outside of the range of potential parameter values the algorithm will either fail or it will return nonsensical parameter like for example returning a growth rate of 1000 when the actual value is 1.04. The best way to find correct starting value is to “eyeball” the data, plotting them and based on the understanding that you have from the equation find approximate starting values for the parameters. #simulate some data, this without a priori knowledge of the parameter value y<runif(1,5,15)*exp(runif(1,0.01,0.05)*x)+rnorm(51,0,0.5) #visually estimate some starting parameter values plot(x,y) #from this graph set approximate starting values a_start<8 #param a is the y value when x=0 b_start<2*log(2)/a_start #b is the decay rate #model m<nls(y~a*exp(b*x),start=list(a=a_start,b=b_start)) #get some estimation of goodness of fit cor(y,predict(m)) [1] 0.9811831 #plot the fit lines(x,predict(m),col="red",lty=2,lwd=3) Here it is the plot: Using SelfStarting function It is very common for different scientific fields to use different parametrization (i.e. different equations) for the same model, one example is the logistic population growth model, in ecology we use the following form: $$ N_{t} = frac{K*N_{0}*e^{r*t}}{K + N_{0} * (e^{r*t} – 1)} $$ With (N_{t}) being the number of individuals at time (t), (r) being the population growth rate and (K) the carrying capacity. We can rewrite this as a differential equation: $$ dN/dt = R*N*(1N/K) $$ library(deSolve) #simulating some population growth from the logistic equation and estimating the parameters using nls log_growth < function(Time, State, Pars) { with(as.list(c(State, Pars)), { dN < R*N*(1N/K) return(list(c(dN))) }) } #the parameters for the logisitc growth pars < c(R=0.2,K=1000) #the initial numbers N_ini < c(N=1) #the time step to evaluate the ODE times < seq(0, 50, by = 1) #the ODE out < ode(N_ini, times, log_growth, pars) #add some random variation to it N_obs<out[,2]+rnorm(51,0,50) #numbers cannot go lower than 1 N_obs<ifelse(N_obs<1,1,N_obs) #plot plot(times,N_obs) This part was just to simulate some data with random error, now come the tricky part to estimate the starting values. Now R has a builtin function to estimate starting values for the parameter of a logistic equation (SSlogis) but it uses the following equation: $$ N_{t} = frac{alpha}{1+e^{frac{xmidt}{scale}}} $$ #find the parameters for the equation SS<getInitial(N_obs~SSlogis(times,alpha,xmid,scale),data=data.frame(N_obs=N_obs,times=times)) We use the function getInitial which gives some initial guesses about the parameter values based on the data. We pass to this function a selfStarting model (SSlogis) which takes as argument an input vector (the t values where the function will be evaluated), and the unquoted name of the three parameter for the logistic equation. However as the SSlogis use a different parametrization we need to use a bit of algebra to go from the estimated selfstarting values returned from SSlogis to the one that are in the equation we want to use. #we used a different parametrization K_start<SS["alpha"] R_start<1/SS["scale"] N0_start<SS["alpha"]/(exp(SS["xmid"]/SS["scale"])+1) #the formula for the model log_formula<formula(N_obs~K*N0*exp(R*times)/(K+N0*(exp(R*times)1))) #fit the model m<nls(log_formula,start=list(K=K_start,R=R_start,N0=N0_start)) #estimated parameters summary(m) Formula: N_obs ~ K * N0 * exp(R * times)/(K + N0 * (exp(R * times)  1)) Parameters: Estimate Std. Error t value Pr(>t) K 1.012e+03 3.446e+01 29.366 <2e16 *** R 2.010e01 1.504e02 13.360 <2e16 *** N0 9.600e01 4.582e01 2.095 0.0415 *  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 49.01 on 48 degrees of freedom Number of iterations to convergence: 1 Achieved convergence tolerance: 1.537e06 #get some estimation of goodness of fit cor(N_obs,predict(m)) [1] 0.9910316 #plot lines(times,predict(m),col="red",lty=2,lwd=3) Here it is the plot: That was a bit of a hassle to get from the SSlogis parametrization to our own, but it was worth it! In a next post we will see how to go beyond nonlinear least square to embrace maximum likelihood estimation methods which are way more powerful and reliable. They allow you to build any model that you can imagine. To leave a comment for the author, please follow the link and comment on their blog: DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on The 20% Statistician, and kindly contributed to Rbloggers) I was reworking a lecture on confidence intervals I’ll be teaching, when I came across a perfect real life example of a common error people make when interpreting confidence intervals. I hope everyone (Harvard Professors, Science editors, my bachelor students) will benefit from a clear explanation of this misinterpretation of confidence intervals. Let’s assume a Harvard professor and two Science editors make the following statement: If you take 100 original studies and replicate them, then “sampling error alone should cause 5% of the replication studies to “fail” by producing results that fall outside the 95% confidence interval of the original study.”* The formal meaning of a confidence interval is that 95% of the confidence intervals should, in the long run, contain the true population parameter. See Kristoffer Magnusson’s excellent visualization, where you can see how 95% of the confidence intervals include the true population value. Remember that confidence intervals are a statement about where future confidence intervalswill fall. Single confidence intervals are not a statement about where the means of future samples will fall. The percentage of means in future samples that falls within a single confidence interval is called the capture percentage. The percentage of future means that fall within a single unbiased confidence interval depends upon which single confidence interval you happened to observe, but in the long run, a 95% CI has a 83.4% capture percentage (Cumming & Maillardet, 2006). In other words, in a large number of unbiased original studies, 16.6% (not 5%) of replication studies will observe a parameter estimate that falls outside of a single confidence interval. (Note that this percentage assumes an equal sample size in the original and replication study – if sample sizes differ, you would need to simulate the capture percentages for each study.) Let’s experience this through simulation. Run the entire R script available at the bottom of this post. This scripts will simulate a single sample with a true population mean of 100 and standard deviation of 15 (the mean and SD of an IQ test), and create a plot. Samples drawn from this true population will show variation, as you can see from the mean and standard deviation of the sample in the plot. The black dotted line illustrates the true mean of 100. The orange area illustrates the 95% confidence interval around the sample mean, and 95% of orange bars will contain the black dotted line. For example: The simulation also generates a large number of additional samples, after the initial one that was plotted. The simulation returns the number of confidence intervals from these simulations that contain the mean (which should be 95% in the long run). The simulation also returns the % of sample means from future studies that fall within the 95% of the original study. This is the capture percentage. It differs from (and is typically lower than) the confidence interval. Q1: Run the simulations multiple times (the 100000 simulations take a few seconds). Look at the output you will get in the R console. For example: “95.077 % of the 95% confidence intervals contained the true mean” and “The capture percentage for the plotted study, or the % of values within the observed confidence interval from 88.17208 to 103.1506 is: 82.377 %”. While running the simulations multiple times, look at the confidence interval around the sample mean, and relate this to the capture percentage. Which statement is true? A) The further the sample mean in the original study is from the true population mean, the lower the capture percentage. B) The further the sample mean in the original study is from the true population mean, the higher the capture percentage. C) The wider the confidence interval around the mean, the higher the capture interval. D) The narrower the confidence interval around the mean, the higher the capture interval. Q2: Simulations in R are randomly generated, but you can make a specific simulation reproducible by setting the seed of the random generation process. Copypaste “set.seed(123456)” to the first line of the R script, and run the simulation. The sample mean should be 108 (see the picture below). This is a clear overestimate of the true population parameter. Indeed, the just by chance, this simulation yielded a result that is significantly different from the null hypothesis (the mean IQ of 100), even though it is a Type 1 error. Such overestimates are common in a literature rife with publication bias. A recent large scale replication project revealed that even for studies that replicated (according to a p < 0.05 criterion), the effect sizes in the original studies were substantially inflated. Given the true mean of 100, many sample means should fall to the left of the orange bar, and this percentage is clearly much larger than 5%. What is the capture percentage in this specific situation where the original study yielded an upwardly biased estimate? A) 95% (because I believe Harvard Professors and Science editors over you and your simulations!) I always find it easier to see how statistics work, if you can simulate them. I hope this example makes it clear what the difference between a confidence interval and a capture percentage is. * This is a hypothetical statement. Any similarity to commentaries that might be published in Science in the future is purely coincidental. To leave a comment for the author, please follow the link and comment on their blog: The 20% Statistician. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on Mad (Data) Scientist, and kindly contributed to Rbloggers) Sadly, the concept of pvalues and significance testing forms the very core of statistics. A number of us have been pointing out for decades that pvalues are at best underinformative and often misleading. Almost all statisticians agree on this, yet they all continue to use it and, worse, teach it. I recall a few years ago, when Frank Harrell and I suggested that R place less emphasis on pvalues in its output, there was solid pushback. One can’t blame the pusherbackers, though, as the use of pvalues is so completely entrenched that R would not be serving its users well with such a radical move. And yet, wonder of wonders, the American Statistical Association has finally taken a position against pvalues. I never thought this would happen in my lifetime, or in anyone else’s, for that matter, but I say, Hooray for the ASA! To illustrate the problem, consider the one of the MovieLens data sets, consisting of user ratings of movies. There are 949 users. Here is an analysis in which I regress average rating per user against user age and gender: > head(uu) userid age gender occup zip avg_rat 1 1 24 0 technician 85711 3.610294 2 2 53 0 other 94043 3.709677 3 3 23 0 writer 32067 2.796296 4 4 24 0 technician 43537 4.333333 5 5 33 0 other 15213 2.874286 6 6 42 0 executive 98101 3.635071 > q summary(q) ... Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 3.4725821 0.0482655 71.947 < 2e16 *** age 0.0033891 0.0011860 2.858 0.00436 ** gender 0.0002862 0.0318670 0.009 0.99284 ... Multiple Rsquared: 0.008615, Adjusted Rsquared: 0.006505 Woohoo! Doublestar significance on age! Pvalue of only 0.004! Age is a highlysignificant predictor of movie ratings! Older people give higher ratings! Well, no. A 10year age difference corresponds to only a 0.03 difference in ratings — quite minuscule in light of the fact that ratings take values between 1 and 5. The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis, in this case H0: βage = 0, and ironically declare this unimportant result “significant.” We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important. In addition, there is the hypocrisy aspect. Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre. Speaking of hypocrisy: As noted above, instructors of statistics courses all know of the above problems, and yet teach testing anyway, with little or (likely) no warning about this dangerous method. Those instructors also do testing in their own work. My hat is off to ASA for finally taking some action. To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
Curse of Dimensionality: One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is highdimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality. In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension. Principal component analysis: Consider below scenario: The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the ith observation of the jth variable. Thus the N members of the matrix can be identified with the M rows, each variable corresponding to Ndimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible. Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. The algorithm when applied linearly transforms mdimensional input space to ndimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (mn) dimensions. PCA allows us to discard the variables/features that have less variance. Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component. In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1. For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names (“9.4″, 9.5″ …) correspond to midpoints of intervals of finger lengths whereas the 22 column names (“142.24″, “144.78”…) correspond to (body) heights of 3000 criminals, see also below. head(crimtab) 142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34 9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 182.88 185.42 187.96 190.5 193.04 195.58 9.4 0 0 0 0 0 0 9.5 0 0 0 0 0 0 9.6 0 0 0 0 0 0 9.7 0 0 0 0 0 0 9.8 0 0 0 0 0 0 9.9 0 0 0 0 0 0 dim(crimtab) [1] 42 22 str(crimtab) 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...  attr(*, "dimnames")=List of 2 ..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ... ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ... sum(crimtab) [1] 3000 colnames(crimtab) [1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34" [17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58" let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying. apply(crimtab,2,var) We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp(). pca =prcomp(crimtab) Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component. Let’s plot all the principal components and see how the variance is accounted with each component. par(mar = rep(2, 4)) plot(pca) Clearly the first principal component accounts for maximum information. Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components. #below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one. pca$rotation=pca$rotation pca$x=pca$x biplot (pca , scale =0) The output of the preceding code is as follows: In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors. From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features. In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, “165.1, 167.64, and 170.18″ which are less correlated with them. Complete Code for PCA implementation in R: //Your code here //Be sure you escape your tags with < and > when displaying things like HTML. So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Ok so quick spoiler here: there is no QGIS 3.0 ready yet, nor will there be a QGIS 3.0 for some time. This article provides a bit more detail on the plans for QGIS 3.0. A few weeks ago I wrote about some of the considerations for the 3.0 release, so you may want to read that first before continuing with this article as I do not cover the same ground here. A lot of consideration has gone into deciding what the approach will be for the development of QGIS 3.0. Unfortunately the first PSC vote regarding which proposal to follow was a split decision (4 for, 3 against, 1 abstention and 1 suggestion for an alternative in the discussion). During our PSC meeting this week we retabled the topic and eventually agreed on Jürgen Fischer’s proposal (Jürgen is a QGIS PSC Member and the QGIS Release Manager) by a much more unanimous margin of 8 for, 1 neutral and 1 absent. Jürgen’s proposal is largely similar to the Proposal 2 described in my previous posting. I want to make some special notes here about our discussion and subsequent decision which will hopefully help to clarify the thinking behind our decision for other interested observers. First let me lay out Jürgen’s plan in his own words: “My preferred approach would still be: Do a Qt5/PyQt5/Python3 branch in parallel, actually work on it until it’s ready, make it master and release it as 3.0 Meantime keep working on master (2.x) and keep releasing them every 4 months as usual Everyone can work on the branch (s)he wants (or is hired to), but needs to consider if (s)he want to do it (or spend funds on): only for 2.x: knowing that it will be released soon; but might become unusable because platforms drop support for stuff it depends on sooner or later only for 3.x: not knowing when that will ever release or for both: knowing that work needs to be done twice. People adding features to the master branch will be responsible to ensure that their work gets merged to 3.0 branch. As PSC we should maintain the environment for people to do something for QGIS – but we cannot tell them to – so we don’t have resources we can actually plan with and that means we can either release something when the big thing is ready or what we have in fixed intervals.” – Jürgen Fischer What follows are some further details and clarifications to our preferred approach: Why do parallel development? Parallel development of 3.0 maintaining our master branch with 2.x code in it has advantages and disadvantages. First the advantages: If we encounter major technical difficulties / release blockers in the 3.0 branch, it will not impact on our normal 3 monthly release cycle. Our binary build systems (Linux, Windows and OSX binaries) will be unaffected until 3.0 is ready. It is very likely that building 3.0 binaries on different platforms is going to have difficulties for each platform. For example OSGEO4W has no Python3 and Qt5 packages yet. Someone needs to see to the creation of the required package as a separate exercise from the actuals development of a version of QGIS that will take advantage of these updated libraries. We don’t yet know what problems may be in countered in preparing these. “Don’t break what already works”: we have a working and relatively stable master branch and we don’t want to do a ‘cowboy stunt’ and break it. We prefer to wait until the 3.0 branch is mature, has passing tests and is known to work well before merging it into master and treating it as our ‘best we currently have’ master branch. Of course nothing in life is completely easy, there are also some disadvantages: Some developers may feel that running two mainstream branches is dilution of effort. To counter this, our public recommendation is that after 2.16 comes out, all QGIS contributors are strongly encouraged to provide their patches against the 3.0 branch. Any features applied to the master branch is not guaranteed to be part of the 3.0 release. Regular merging of master to the 3.0 branch may prove more and more difficult over time as the two branches diverge more. Again we will strongly encourage that developers submitting new features after the 2.16 release do so against the 3.0 branch. 3.0 branch won’t have auto build system for nightly binaries in the beginning. Since we expect that the initial branch of 3.0 will break these anyway, Having a separate branch is actually an advantage here as it will give binary packages some time to get their build systems up to speed. To clarify things for developers wondering where to commit their work: we discourage people from writing new features in master after 2.16 is released and rather ask them to make their changes in the 3.0 branch. Only those who really need to see their features in the next 2.18 release would have to dual commit. Isn’t it better to work on 3.0 in the master branch? Some queries have been raised about whether it would be better to rather work on 3.0 in the ‘master branch’ and relegate the 2.x code base to a side branch (instead of our intended approach which is to keep master tracking 2.x until 3.0 is ready and then merge it to master). We feel that keeping master tracking the 2.x code base is more conservative – it will not break existing packaging / build systems and if there is any major hitch in 3.0 development the release process will continue unabated based on the 2.x code set. While 3.0 is under development, package builders will have time to figure out the packaging process while still keeping the regular nightly builds against 2.x running. The implication of this is that 2.18 may contain only bug fixes which were applied to the 2.x branch and no significant new features. The schedule will not be fixed One thing that we want to make really clear (and was a key point in our many discussions) is that there will be no fixed release date for QGIS version 3.0. There are several reasons for this: As a steering committee, we can only set the QGIS ship pointing in a given direction, our power to actually make work happen is extremely limited. This is because we are a community made up largely of volunteer developers or developers working on a commission basis for third party clients. We have no say in how these contributors spend their time. We do not yet know which (if any) major technical issues will be encountered during the development of 3.0. Any such issues could very well delay the roll out of QGIS 3.0. Instead our plan is to make the 2.16 release and then effectively move all developer effort to the 3.0 branch as best we can (through close liaison with our developer community). To clarify things for those wondering when they will give 3.0 to their users: The actual release date for 3.0 its interterminate, but the general aim is still to try to encourage everyone to get it ready for 1 year from now. Remember that as an open source community we cannot directly ensure that project timelines are met since our developer force is largely volunteer based or work according to their own companies agendas. Will 3.0 be a Long Term Release (LTR)? It is our recommendation that we wait until 3.2 is ready before designating it an LTR – there are going to be large changes in the code base for 3.0 and we would rather stabilise things a bit before applying the LTR label to the release. Looking forward to 3.0 Personally I am very much looking forward to the release of QGIS 3.0 – it represents another huge milestone in our project, it affords us a great opportunity to get rid of a lot of cruft out of our code base and API’s and it will arm us with a set of modern, new libraries that will see us through the next several years. Rock on QGIS 3.0! QGIS PSC Chairman
QGIS is a user friendly Open Source Geographic Information System that runs on Linux, Unix, Mac OSX, and Windows. We are very pleased to announce the release of QGIS 2.14 ‘Essen’. Essen was the host city to our developer meet ups in October 2012 and 2014. Long Term Release This is a special release since it is designated an ‘LTR’ (Long Term Release). LTR releases will be supported with backported bug fixes for one year, and will be in permanent feature freeze (i.e. no new features will be added, only bug fixes and trivial updates). Note that we are in discussion to extend the term of our LTR releases to two years, but for technical reasons we will not do this until QGIS 3.2. The purpose of LTR releases is to provide a stable and less frequently changing platform for enterprises and organizations that do not want to deal with updating user skills, training materials etc. more than once per year. The success of the LTR is very much down to you, our beloved users – we need your support to help funding bug fixes and making sure in your support contracts with support providers specify that any bug fixes done on your behalf are applied to the LTR branch as well as our normal development branch. If an LTR is important to you, please consider also directly supporting the QGIS project, or encourage your commercial provider to use LTR as a basis for your enterprise solution so that everyone may benefit from a stable platform that is being continuously improved and refined. Note that for users and organizations that like to live on the frontier, our regular four monthly releases will continue unabated. New Features in QGIS 2.14 ‘Essen’ If you are upgrading from QGIS 2.8 (our previous LTR version) you will find a great many new features in this release. ( http://qgis.org/en/site/forusers/visualchangelog214/) We encourage you to peruse the changelogs for the intermediate non LTR 2.10 and 2.12 releases as this QGIS 2.14 includes all features published in those releases too. ( http://qgis.org/en/site/forusers/visualchangelog212/) ( http://qgis.org/en/site/forusers/visualchangelog210/) Note that 2.14 first enters the regular package repositories and will not immediately replace 2.8 in the LTR package repositories. That will happen when 2.16 is released. Whenever new features are added to software they introduce the possibility of new bugs – if you encounter any problems with this release, please file a ticket on the QGIS Bug Tracker. ( http://hub.qgis.org) The source code and binaries for Windows, Debian and Ubuntu are already available via the large download link on our home page: http://qgis.org. More packages will follow as soon as the package maintainers finish their work. Please revisit the page if your platform is not available yet. Thanks We would like to thank the developers, documenters, testers and all the many folks out there who volunteer their time and effort (or fund people to do so). From the QGIS community we hope you enjoy this release! If you wish to donate time, money or otherwise get involved in making QGIS more awesome, please wander along to qgis.org and lend a hand! QGIS is supported by donors and sponsors. A current list of donors who have made financial contributions large and small to the project can be seen on our donors list. If you would like to become and official project sponsor, please visit our sponsorship page for details. ( http://qgis.org/en/site/about/sponsorship.html) Current Sponsors of QGIS: Sponsoring QGIS helps us to fund our six monthly developer meetings, maintain project infrastructure and fund bug fixing efforts. A complete list of current sponsors is provided below – our very great thank you to all of our sponsors! SILVER AGH University of Science and Technology, Krakow, Poland SILVER Sourcepole AG, Switzerland SILVER GAIA mbH, Germany SILVER Office of Public Works, Flood Risk Management and Data Management Section, Ireland SILVER Land Vorarlberg, Austria BRONZE 2D3D.GIS, France BRONZE Cawdor Forestry, United Kingdom BRONZE ChameleonJohn, United States BRONZE Chartwell Consultants Ltd., Canada BRONZE Gis3W, Italy BRONZE Dr. Kerth + Lampe GeoInfometric GmbH, Germany BRONZE Gaia3D, Inc., South Korea BRONZE GeoSynergy, Australia BRONZE GFI – Gesellschaft für Informationstechnologie mbH, Germany BRONZE GKG Kassel, (Dr.Ing. Claas Leiner), Germany BRONZE HostingFacts.com, Estonia BRONZE Lutra Consulting, United Kingdom BRONZE MappingGIS, Spain BRONZE Nicholas Pearson Associates, United Kingdom BRONZE QGIS Polska, Poland BRONZE Royal Borough of Windsor and Maidenhead, United Kingdom BRONZE TerreLogiche, Italy BRONZE Trage Wegen vzw, Belgium BRONZE Urbsol, Australia BRONZE GISSupport, Poland ADLARES GmbH, Germany BRONZE WhereGroup GmbH & Co. KG, Germany BRONZE www.molitec.it, Italy BRONZE www.argusoft.de, Germany BRONXE Customer Analytics, USA BRONZE Nicholas Pearson Associates QGIS is Free software and you are under no obligation to pay anything to use it – in fact we want to encourage people far and wide to use it regardless of what your financial or social status is – we believe empowering people with spatial decision making tools will result in a better society for all of humanity. If you are able to support QGIS, you can donate using this link : http://qgis.org/en/site/getinvolved/donations.html Happy QGISing! Regards, The QGIS Team!
So some good news for QGIS users who also need/want to use MapInfo. QGIS via GDAL 2.0 can support MapInfo TAB file editing. In all older versions of GDAL there was only support for read and/or write but not both. MapInfo TAB editing has been supported in GDAL 2 but up until this point QGIS has only be built against GDAL 1.xx. GDAL 2.x is now the default GDAL release in OSGeo4w. From Jurgen: 2.0.2 is now the default GDAL in OSGeo4W and the nightlies (qgisltrdev, qgisreldev and qgisdev) already picked it up. With the next release the regular packages (2.14 and 2.8) will also be updated to use it Even if you don’t want to make the switch to full QGIS you can now use both bits of software and edit in both. QGIS will still only support a single geometry type per layer so if you open a mixed tab file you will get the geometry type selector. You can load the layer 3 times if you need the 3 different geometry types.
(This article was first published on R – rud.is, and kindly contributed to Rbloggers) It’s usually a good thing when my #rstats and infosec worlds collide. Unfortunately, this time it’s a script that R folk running on OS X can use to see if they are using a version of XQuartz that has a nasty vulnerability in the framework it uses to autoupdate. If this test comes back with the warning, try to refrain from using XQuartz on insecure networks until the developers fix the issue. UPDATE Thanks to a gist prodding by @bearloga, here’s a script to scan all your applications for the vulnerability: library(purrr) library(dplyr) library(XML) read_plist < safely(readKeyValueDB) safe_compare < safely(compareVersion) apps < list.dirs(c("/Applications", "/Applications/Utilities"), recursive=FALSE) # if you have something further than this far down that's bad you're on your own for (i in 1:4) { moar_dirs < grep("app$", apps, value=TRUE, invert=TRUE) if (length(moar_dirs) > 0) { apps < c(apps, list.dirs(moar_dirs, recursive=FALSE)) } } apps < unique(grep("app$", apps, value=TRUE)) pb < txtProgressBar(0, length(apps), style=3) suppressWarnings(map_df(1:length(apps), function(i) { x < apps[i] setTxtProgressBar(pb, i) is_vuln < FALSE version < "" app_name < sub("\.app$", "", basename(x)) app_loc < sub("^/", "", dirname(x)) to_look < c(sprintf("%s/Contents/Frameworks/Autoupdate.app/Contents/Info.plist", x), sprintf("%s/Contents/Frameworks/Sparkle.framework/Versions/A/Resources/Info.plist", x), sprintf("%s/Contents/Frameworks/Sparkle.framework/Versions/A/Resources/Autoupdate.app/Contents/Info.plist", x)) is_there < map_lgl(c(sprintf("%s/Contents/Frameworks/Sparkle.framework/", x), to_look), file.exists) has_sparkle < any(is_there) to_look < to_look[which(is_there[1])] discard(map_chr(to_look, function(x) { read_plist(x)$result$CFBundleShortVersionString %% NA }), is.na) > vs if (any(map_dbl(vs, function(v) { safe_compare(v, "1.16.1")$result %% 1 }) < 0)) { is_vuln < TRUE version < vs[1] } data_frame(app_loc, app_name, has_sparkle, is_vuln, version) })) > app_scan_results close(pb) select(arrange(filter(app_scan_results, has_sparkle), app_loc, app_name), has_sparkle) To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on rOpenSci Blog  R, and kindly contributed to Rbloggers) Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plaintext one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or paywalled search engines. The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation. Installing pdftools cran github On Windows and Mac the binary packages can be installed directly from CRAN: install.packages("pdftools") Installation on Linux requires the poppler development library. On Debian/Ubuntu: sudo aptget install libpopplercppdev On Fedora or CentOS: sudo yum install popplercppdevel If you want to install the package from source on Mac OSX you need brew: That's it. Getting started The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page. library(pdftools) download.file(" http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb") txt < pdf_text("1403.2805.pdf") # first page text cat(txt[1]) # second page text cat(txt[2]) In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON: # Table of contents toc < pdf_toc("1403.2805.pdf") # Show as JSON jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE) Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags. # Author, version, etc info < pdf_info("1403.2805.pdf") # Table with fonts fonts < pdf_fonts("1403.2805.pdf") Bonus feature: rendering pdf A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg. # renders pdf to bitmap array bitmap < pdf_render_page("1403.2805.pdf", page = 1) # save bitmap image png::writePNG(bitmap, "page.png") jpeg::writeJPEG(bitmap, "page.jpeg") webp::write_webp(bitmap, "page.webp") This feature is still experimental and currently does not work on Windows. Limitations Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data. txt < pdf_text(" http://arxiv.org/pdf/1406.4806.pdf") # some tables cat(txt[18]) cat(txt[19]) Pdftools usually does a decent job in retaining the positioning of table elements when converting from pdf to text. But the output is still very dependent on the formatting of the original pdf table, which makes it very difficult to write a generic table extractor. But with a little creativity you might be able to parse the table data from the text output of a given paper. Jeroen Ooms joins team rOpenSci! A message from the team: We are happy to announce that Jeroen Ooms has joined the rOpenSci crew! Jeroen is a prolific programmer and author of numerous widely used packages. At rOpenSci, he will continue to work on developing awesome packages and infrastructural software for improving the scientific tool chain. To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on One Tip Per Day, and kindly contributed to Rbloggers) Tips below are based on the lessons I learnt from making mistakes during my years of research. It’s purely personal opinion. Order doesn’t mean anything. If you think I should include something else, please comment below. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible. Set your own temporary folder (via –tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a ‘‘ in the name will be transferred to ‘.’ in R unless you specify check.names=F. Always make a README file for the folder of your data; For why and how, read this: http://stackoverflow.com/questions/2304863/howtowriteagoodreadme Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). Don’t save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka “digital garbage”) you will generate soon. Using parallel as much as you can, e.g. using “LC_ALL=C sort –parallel=24 –buffersize=5G” for sorting ( https://www.biostars.org/p/66927/), as multicore CPU/GPU is common nowaday. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD for the project. You never know when you, your boss or your collaborators will need the data again; Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL) When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others’ credit. This applies to this post, of course Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques. When you learn some tips from others, remember to share something you learned to the community as well, as that’s how the community grows healthily. Last but not least, stand up and move around after sitting for 12 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading: https://www.washingtonpost.com/news/wonk/wp/2015/06/02/medicalresearchershavefiguredouthowmuchtimeisokaytospendsittingeachday/ To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on R – Networkx, and kindly contributed to Rbloggers) Synopsis Note! This post is a addition to Create blog posts from RStudio to WordPress About a week ago I tried to add my blog to RBloggers. I thought everything was correct to add it. But this week I got a mail from Tal Galili (site administrator of RBloggers) with the message that my blog uses base64 images in my feed. This type of images are created, as default option, with knitr as a standalone HTML document. It would be great if I could solve it with PNG instead of base64 images, even better if I could solve it with posting from RStudio to WordPress. And so I did. In this post I will explain you how to post your WordPress post with PNG files and upload it to your blog. How to recognize a base64 image Well you can check your RSS feed of your blog and search for data:image/png;base64, it should be something like this (thx Tal for the image ): Setting your options To upload your PNG files to your blog you need to set some knitr options first. As I described in my earlier post you had to set your login parameters to login to your WordPress blog. For the upload of files to WordPress you need to set the upload.fun option of knitr. This will take the filename as the input and returns a link to the image in the RMarkdown file. Let’s set the login parameters and the upload.fun options. I’ve hashed it out because I can not post to the dummy credentials. In the earlier post you can add the upload.fun option after your login credentials and you are good to go. # Load libraries library(RWordPress) library(knitr) # Login parameters options(WordPressLogin=c(your_username="your_password"), WordPressURL=" http://your.blog.com/xmlrpc.php") # Upload your plots as png files to your blog opts_knit$set(upload.fun = function(file){library(RWordPress);uploadFile(file)$url;}) After setting the above credentials your are ready to upload your post with PNG image(s) to your blog. For the completeness I will post a png image in this post I will create within R. Example of posting your PNG image data(iris) pairs(iris, col = iris$Species) After this I could start post to my blog with knit2wpCrayon. knit2wpCrayon("r2blog_with_png.Rmd", title = "Upload plots as PNG file to your wordpress", categories = c("R", "Programming"), publish = FALSE, upload = TRUE) Again I hashed it because it could not post itself. This code can also be find on my GitHub. To leave a comment for the author, please follow the link and comment on their blog: R – Networkx. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers) I’m very pleased to announce the release of ggplot2 2.1.0, scales 0.4.0, and gtable 0.2.0. These are set of relatively minor updates that fix a whole bunch of little problems that crept in during the last big update. The most important changes are described below. When mapping an aesthetic to a constant the default guide title is the name of the aesthetic (i.e. “colour”), not the value (i.e. “loess”). This is a really handy technique for labelling individual layers: ggplot(mpg, aes(displ, 1 / hwy)) + geom_point() + geom_smooth(method = lm, aes(colour = "lm"), se = FALSE) + geom_smooth(aes(colour = "loess"), se = FALSE) stat_bin() (which powers geom_histogram() and geom_freqpoly()), has been overhauled to use the same algorithm as ggvis. This has considerably better parameters and defaults thanks to the work of Randall Pruim. Changes include: Better arguments and a better algorithm for determining the origin. You can now specify either boundary (i.e. the position of the left or right side) or the center of a bin. origin has been deprecated in favour of these arguments. drop is deprecated in favour of pad, which adds extra 0count bins at either end, as is needed for frequency polygons. geom_histogram() defaults to pad = FALSE which considerably improves the default limits for the histogram, especially when the bins are big. The default algorithm does a (somewhat) better job at picking nice widths and origins across a wider range of input data. You can see the impact of these changes on the following two histograms: ggplot(diamonds, aes(carat)) + geom_histogram(binwidth = 1) ggplot(diamonds, aes(carat)) + geom_histogram(binwidth = 1, boundary = 0) All layer functions (geom_*() + stat_*()) functions now have a consistent argument order: data, mapping, then geom/stat/position, then ..., then layer specific arguments, then common layer arguments. This might break some code if you were relying on partial name matching, but in the longterm should make ggplot2 easier to use. In particular, you can now set the n parameter in geom_density2d() without it partially matching na.rm. For geoms with both colour and fill, alpha once again only affects fill. alpha was changed to modify both colour and fill in 2.0.0, but I’ve reverted it to the old behaviour because it was causing pain for quite a few people. You can see a full list of changes in the release notes. To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
Adding an R plot (as image) to Excel After creating a new sheet it is possible to put in this sheet a picture of a graph created in R library(ggplot2) fileGraph < paste(outDir,'graph.png',sep='/') png(filename = fileGraph, width = 800, height = 600) ozone.plot < ggplot(dtAir, aes(x=Day, y=Ozone)) + geom_point() + geom_smooth()+ facet_wrap(~Month, nrow=1) print(ozone.plot) invisible(dev.off()) addImage(exc2,fileGraph, 'OzonePlot',TRUE) saveWorkbook(exc2) The command addImage insert the png image of the graph just saved in the named region OzonePlot. The result is represented in next figure: Extra functions of XLConnect In this brief article we don’t focus the attention on the excel layout but on the sheets content. For example we can automatically create an excel calculator with an input sheet and an output sheet managing all information in R. XLConnect has also some features as defining cellstyles (data formats, borders, back fill colors..) , hide/unhide sheets, specyfing column width and row height, merged cells, autofilters, style actions (using templates) and behavior with error cells. Stay tuned to know the rest!
(This article was first published on DataScience+, and kindly contributed to Rbloggers) In this post I will present a simple way how to export your regression results (or output) from R into Microsoft Word. Previously, I have written a tutorial how to create Table 1 with study characteristics and to export into Microsoft Word. These posts are especially useful for researchers who prepare their manuscript for publication in peerreviewed journals. Get the results from Cox Regression Analysis As an example to illustrate this post, I will compute a survival analysis. Survival analysis is statistical methods for analyzing data where the outcome variable is the time until the occurrence of an event. The event can be a occurrence of a disease or death, etc. In R we compute the survival analysis with the survival package. The function for Cox regression analysis is coxph(). I will use the veteran data which come with survival package. ## Load survival package library(survival) # Load veteran data data(veteran) # Show first 6 rows head(veteran) trt celltype time status karno diagtime age prior 1 1 squamous 72 1 60 7 69 0 2 1 squamous 411 1 70 5 64 10 3 1 squamous 228 1 60 3 38 0 4 1 squamous 126 1 60 9 63 10 5 1 squamous 118 1 70 11 65 10 6 1 squamous 10 1 20 5 49 0 # Data description help(veteran, package="survival") trt: 1=standard 2=test celltype: 1=squamous, 2=smallcell, 3=adeno, 4=large time: survival time status: censoring status karno: Karnofsky performance score (100=good) diagtime: months from diagnosis to randomisation age: in years prior: prior therapy 0=no, 1=yes Now let say that we are interested to know the risk of dying (status) from different cell type (celltype) and treatment (trt) when we adjust for other variables (karno, age prior, diagtime). This is the model: # Fit the COX model fit = coxph(Surv(time, status) ~ age + celltype + prior + karno + diagtime + trt, data=veteran) And the output: summary(fit) Call: coxph(formula = Surv(time, status) ~ age + celltype + prior + karno + diagtime + trt, data = veteran) n= 137, number of events= 128 coef exp(coef) se(coef) z Pr(>z) age 8.706e03 9.913e01 9.300e03 0.936 0.34920 celltypesmallcell 8.616e01 2.367e+00 2.753e01 3.130 0.00175 ** celltypeadeno 1.196e+00 3.307e+00 3.009e01 3.975 7.05e05 *** celltypelarge 4.013e01 1.494e+00 2.827e01 1.420 0.15574 prior 7.159e03 1.007e+00 2.323e02 0.308 0.75794 karno 3.282e02 9.677e01 5.508e03 5.958 2.55e09 *** diagtime 8.132e05 1.000e+00 9.136e03 0.009 0.99290 trt 2.946e01 1.343e+00 2.075e01 1.419 0.15577  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(coef) lower .95 upper .95 age 0.9913 1.0087 0.9734 1.0096 celltypesmallcell 2.3669 0.4225 1.3799 4.0597 celltypeadeno 3.3071 0.3024 1.8336 5.9647 celltypelarge 1.4938 0.6695 0.8583 2.5996 prior 1.0072 0.9929 0.9624 1.0541 karno 0.9677 1.0334 0.9573 0.9782 diagtime 1.0001 0.9999 0.9823 1.0182 trt 1.3426 0.7448 0.8939 2.0166 Concordance= 0.736 (se = 0.03 ) Rsquare= 0.364 (max possible= 0.999 ) Likelihood ratio test= 62.1 on 8 df, p=1.799e10 Wald test = 62.37 on 8 df, p=1.596e10 Score (logrank) test = 66.74 on 8 df, p=2.186e11 As we see there are “a lot” of results. In manuscript we often report only the Hazard ratio and 95% Confidence interval and only for the variables of interest. For example in this case I am interested for the cell types and treatment. Note: I will not comment for the regression coefficients since is not the aim of this post. Prepare the table by creating the columns # Prepare the columns HR < round(exp(coef(fit)), 2) CI < round(exp(confint(fit)), 2) # Names the columns of CI colnames(CI) < c("Lower", "Higher") # Bind columns together as dataset table2 < as.data.frame(cbind(HR, CI)) table2 HR Lower Higher age 0.99 0.97 1.01 celltypesmallcell 2.37 1.38 4.06 celltypeadeno 3.31 1.83 5.96 celltypelarge 1.49 0.86 2.60 prior 1.01 0.96 1.05 karno 0.97 0.96 0.98 diagtime 1.00 0.98 1.02 trt 1.34 0.89 2.02 Select variables of interest from the table As I mentioned earlier, I am interested only for 2 variables (cell type and treatment). With the code below I will select those variables. # select variables you want to present in table table2 < table2[c("celltypesmallcell","celltypeadeno","celltypelarge","trt"),] table2 HR Lower Higher celltypesmallcell 2.37 1.38 4.06 celltypeadeno 3.31 1.83 5.96 celltypelarge 1.49 0.86 2.60 trt 1.34 0.89 2.02 Format the table In the manuscript we present the confidence intervals within brackets. Therefore, with the code below I will add the brackets. # add brackes and line for later use in table table2$a < "("; table2$b < ""; table2$c < ")" # order the columns table2 < table2[,c("HR","a","Lower","b","Higher","c")] table2 HR a Lower b Higher c celltypesmallcell 2.37 ( 1.38  4.06 ) celltypeadeno 3.31 ( 1.83  5.96 ) celltypelarge 1.49 ( 0.86  2.60 ) trt 1.34 ( 0.89  2.02 ) Finalize the table and make it ready for Microsoft Word The table is almost ready, now I will merge in one column by using package tidyr with function unite(). # Merge all columns in one library(tidyr) table2 = unite(table2, "HR (95%CI)", c(HR, a, Lower, b, Higher, c), sep = "", remove=T) # add space between the estimates of HR and CI table2[,1] < gsub("\(", " (", table2[,1]) table2 HR (95%CI) celltypesmallcell 2.37 (1.384.06) celltypeadeno 3.31 (1.835.96) celltypelarge 1.49 (0.862.6) trt 1.34 (0.892.02) Export Table from R to Microsoft Word To export table from R to Microsoft Word I will use the function FlexTable() from the package ReporteRs. I found a very good script in StackOverflow to achieve this task. I am sharing the code below. (Credits to the author in StackOverflow). # Load the packages library(ReporteRs) library(magrittr) # The script docx( ) %>% addFlexTable(table2 %>% FlexTable(header.cell.props = cellProperties( background.color = "#003366"), header.text.props = textBold(color = "white"), add.rownames = TRUE ) %>% setZebraStyle(odd = "#DDDDDD", even = "#FFFFFF")) %>% writeDoc(file = "table2.docx") This is the table in Microsoft Word: If you have any comment or feedback feel free to post a comment below. Related Post Learn R by Intensive Practice Learn R from the Ground Up Table 1 and the Characteristics of Study Population Learn R From Scratch – Part 3 Learn R From Scratch – Part 2 To leave a comment for the author, please follow the link and comment on their blog: DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
(This article was first published on R – TRinker's R Blog, and kindly contributed to Rbloggers) Several years back I wrote a two part blog series in response to seeing questions about plotting and reordering on list serves, talkstats.com, and stackoverflow. Part I discussed the basics of reordering plots by reordering factor levels. The essential gist was: So if you catch yourself using “rearrange”/”reorder” and “plot” in a question think…factor & levels Part II undertook reordering as a means of more easily seeing patterns in layouts such as bar plots & dot plots. Well there is at least one time in which reordering factor levels doesn’t help to reorder a plot. This post will describe this ggplot2 based problem and outline the way to overcome the problem. You can get just the code here. The Problem In a stacked ggplot2 plot the fill ordering is not controlled by factor levels. Such plots include a stacked bar and area plot. Here is a demonstration of the problem. Load Packages if (!require("pacman")) install.packages("pacman") pacman::p_load(dplyr, ggplot2) Generate Data Here I generate a data set containing a time series element (Month), counts (Count), and a leveling variable (Level). The counts are transformed to proportions and the Level variable is converted to a leveled factor with the order “High”, “Medium”, “Low”. This leveling is key to the problem as it will be used as the fill variable. It is here that reordering the factor levels will not work to reorder the plot. dat < data_frame( Month = rep(sort(month.abb), each = 3), Count = sample(10000:60000, 36), Level = rep(c("High", "Low", "Medium"), 12) ) %>% mutate( Level = factor(Level, levels = c("High", "Medium", "Low")), Month = factor(Month, levels = month.abb) ) %>% group_by(Month) %>% mutate(Prop = Count/sum(Count)) Plot a Stacked Area Plot Next we generate the area plot. The accompanying plot demonstrates the problem. Notice that the legend is ordered according to the factor levels in the Level variable (“High”, “Medium”, “Low”) yet the plot fill ordering is not in the expected order (it is “Medium”, “Low”, “High”). I arranged the factor levels correctly but the plot fill ordering is not correct. How then can I correctly order a stacked ggplot2 plot? dat %>% ggplot(aes(x=as.numeric(Month), y=Prop)) + geom_area(aes(fill= Level), position = 'stack') + scale_x_continuous(breaks = 1:12, labels = month.abb) + scale_fill_brewer(palette = "YlOrBr") The Solution Reorder the Stacked Area Plot It seems ggplot2 orders the plot itself by the order in which the levels are consumed. That means we need to reorder the data itself (the rows), not the factor levels, in order to reorder the plot. I use the arrange function from the dplyr package to reorder the data so that ggplot2 will encounter the data levels in the correct order and thus plot as expected. Note that base R‘s order can be used to reorder the data rows as well. In the plot we can see that the plot fill ordering now matches the legend and factor level ordering as expected. dat %>% arrange(desc(Level)) %>% ggplot(aes(x=as.numeric(Month), y=Prop)) + geom_area(aes(fill= Level), position = 'stack') + scale_x_continuous(breaks = 1:12, labels = month.abb) + scale_fill_brewer(palette = "YlOrBr") This blog post has outlined a case where reordering the factor levels does not reorder the plot and how to address the issue. To leave a comment for the author, please follow the link and comment on their blog: R – TRinker's R Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updates from the site? Choose your flavor: email, twitter, RSS, or facebook...
by Verena Haunschmid Since I have a cat tracker, I wanted to do some analysis of the behavior of my cats. I have shown how to do some of these things here. Data Collection The data was collected using the Tractive GPS Pet Tracker over a period of about one year from January 2014 to November 2014 (with breaks). From March to November I additionally took notes in an Excel sheet which cat was carrying the tracker. Libraries you need If you want to reproduce my example you need the following libraries: XML plyr xlsx devtools leaflet (installed via devtools::install_github("rstudio/leaflet”)) Loading the data into R There are some methods to read .gpx files in R, but since I just wanted to use it for this one specific file I created my own method: readTrackingFile<function(filename) { library(XML) library(plyr) xmlfile < xmlParse(filename) xmltop < xmlRoot(xmlfile) tracking < ldply(xmlToList(xmltop[['trk']][['trkseg']]), function(x) {data.frame(x) }) tracking < data.frame("ele"=tracking$ele[seq(1, nrow(tracking), 2)], "time" = as.character(tracking$time [seq(1, nrow(tracking), 2)]), "lat" = tracking$.attrs[seq(1, nrow(tracking), 2)], "lon" = tracking$.attrs[seq(2, nrow(tracking), 2)]) tracking$ele < as.numeric(levels(tracking$ele))[tracking$ele] tracking$lat < as.numeric(levels(tracking$lat))[tracking$lat] tracking$lon < as.numeric(levels(tracking$lon))[tracking$lon] time_pattern < "%Y%m%dT%H:%M:%SZ" tracking$time < strptime(as.character(tracking$time), time_pattern) tracking$min < 60*tracking$time$hour + tracking$time$min message(paste("read", nrow(tracking), "tracking points")) return(tracking) } Then I used this method to read the tracks: track < readTrackingFile("../../data/LJWSIZUT.gpx") And made a rudimentary plot to see where the data was: # showed that some were far off plot(track$lon, track$lat, pch=19, cex=0.5) track < track[track$lat > 30,] Since we also used our tracker to track our vacation, I had to filter that out: time_pattern < "%Y%m%dT%H:%M:%SZ" vacation_start < strptime("20150723T04:00:00Z", time_pattern) vacation_end < strptime("20150804T22:00:00Z", time_pattern) track_cat < track[track$timevacation_end,] To be able to distinguish the tracks I loaded the Excel file I use to take notes. I matched the dates in the Excel file with the ones in my data.frame. cats < read.xlsx("../tractive/data/Katzen1.xlsx", sheetIndex=1, header=FALSE, stringsAsFactors = FALSE) names(cats) < c("TrackingDate", "Cat") cats < cats[!is.na(cats$Cat),] time_pattern < "%d. %B %Y" cats$TrackingDate < strptime(paste(cats$TrackingDate, "2015"), format = time_pattern) # add cat name track_cat$cat < "Unknown" for (i in 1:nrow(track_cat)) { cat_idx < which((cats$TrackingDate$mday == track_cat[i,"time"]$mday) & (cats$TrackingDate$mon+1 == track_cat[i,"time"]$mon+1) & (cats$TrackingDate$year+1900 == track_cat[i,"time"]$year+1900)) if (length(cat_idx) == 1) { track_cat[i,"cat"]<cats[cat_idx, "Cat"] } } After the vacation I did not take notes anymore because Teddy was the only one who use the tracker: track_cat[track_cat$time > vacation_end,"cat"] < "Teddy" Since there were some points far off and I noticed those also had very wrong elevation, I decided it would make sense to remove all that deviated to far from the elevation. I guess there are other more scientific ways to do this, but it did the trick track_cat_teddy < track_cat[abs(track_cat$ele423) < 30 & track_cat$cat=="Teddy",] Create a map To create the map I used the leaflet package. Since it did not work with the rainbow() function (I guess it can only take color as hex numbers) I defined a vector with colors. I Also defined a vector with the months I had data for. This of course be done in a nicer way (without hard coding the colors and the months), but since I currently only have this data set and don’t need to create anything dynamically, I’ll stick with the easiest way. monthCol < c("blue", "green", "yellow", "purple", "brown", "orange") month < c("March", "May", "June", "July", "August", "November") I use the two methods leaflet() and addTiles() without any parameters which creates a default world map with tiles. If you are wondering about %>%, this is the piping operator from the package magrittr. catMap < leaflet() %>% addTiles() Then I loop over the unique months. Using < assigns the variable catMap a new map with a new polyline. The last line, catMap renders the map in the View tab of RStudio. for (i in 1:length(unique(track_cat_teddy$time$mon))) { m<unique(track_cat_teddy$time$mon)[i] catMap < catMap %>% addPolylines(track_cat_teddy[track_cat_teddy$time$mon ==m,"lon"], track_cat_teddy[track_cat_teddy$time$mon ==m,"lat"], col=monthCol[i], group=m) } catMap You might note that I added a parameter group to the method and assigned the value of the month. This value is used to define a control to the map where you can select/unselect certain months. Unfortunately the legend does not display the color of the line. catMap %>% addLayersControl( overlayGroups = month, options = layersControlOptions(collapsed = FALSE)) You can now use these checkboxes to select/deselect different polylines. Where to use this map You can do some useful things with this created map. You can use R shiny to create nice app that includes your map. You can export the HTML from the Viewer window clicking at “Export” > “Save as Web Page…” Code and further links: The code is available in my github repo. I recently published my collected data on my own blog. The other data is also available in a github repo.
