Find Poisson probability (PDF and CDF) using free, online calculator: quick, easy, and accurate. Statistical table from Stat Trek.
Get Started for FREE
Sign up with Facebook Sign up with Twitter
I don't have a Facebook or a Twitter account
Tag 

Your new post is loading...
Your new post is loading...
Scoop.it!
Find Poisson probability (PDF and CDF) using free, online calculator: quick, easy, and accurate. Statistical table from Stat Trek.
No comment yet.
Sign up to comment
Scoop.it!
The mosaic PackageProject MOSAIC is sponsoring work on an R package to facilitate teaching modeling, statistics, and calculus using R The mosaic package is available on CRAN (the comprehensive R archive network) and via github....
Curve sketching for calculus ⇇
> Supportive
▣ Plotting, Derivatives, and Integrals for Teaching Calculus in ⒭ ▣
Bonus ◁
Scoop.it!
Crossvalidation is a process by which a method that works for one sample of a population is checked for validity by applying the method to another sample from the same population. See also ◕
Surprisingly, many statisticians see crossvalidation as something data miners do, but not a core statistical technique.
...It might be helpful to summarize the role of crossvalidation in statistics...
Crossvalidation is primarily a way of measuring the predictive performance of a statistical model.
Every statistician knows that the model fit statistics are not a good guide to how well a model will predict: high R2 does not necessarily mean a good model.
It is easy to overfit the data by including too many degrees of freedom and so inflate R2 and other fit statistics. For example, in a simple polynomial regression I can just keep adding higher order terms and so get better and better fits to the data. But the predictions from the model on new data will usually get worse as higher order terms are added... . .
Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen.
One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on ``new'' data. This is the basic idea for a whole class of model evaluation methods called cross validation.
╔ The holdout method is the simplest kind of cross validation. The data set is separated into two sets, called the training set and the testing set. The function approximator fits a function using the training set only. Then the function approximator is asked to predict the output values for the data in the testing set (it has never seen these output values before).
The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to compute. However, its evaluation can have a high variance.
The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made. ◑
╚ Kfold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times.
Each time, one of the k subsets is used as the test set and the other k1 subsets are put together to form a training set. Then the average error across all k trials is computed.
The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k1 times.
The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation.
A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over. ◑
╝Leaveoneout cross validation is Kfold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model.
The evaluation given by leaveoneout cross validation error (LOOXVE) is good, but at first pass it seems very expensive to compute.
Fortunately, locally weighted learners can make LOO predictions just as easily as they make regular predictions. That means computing the LOOXVE takes no more time than computing the residual error and it is a much better way to evaluate models. We will see shortly that Vizier relies heavily on LOOXVE to choose its metacodes. ◑
╬ Improve Your Model Performance using Cross Validation (in Python and R) ◔
╍ Example ◓
Supportive
Easy Cross Validation in R with `modelr`◒
Crossvalidation for Generalized Linear Models◗
Package ‘cvTools’ ◖
□ More R thingies
╠ Hey! Watch a video and take it further from there◂
Bonuses ┢ CrossValidation for Predictive Analytics Using R ●
┡ How To Estimate Model Accuracy in R Using The Caret Package ○
┫Bayesian networks and crossvalidation ◉
Mhd.Shadi Khudr's insight:
CrossValidation in plain English?
( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)( ͡° ͜ʖ ͡°)
Scoop.it!
Expressing yourself in R" Hadley Wickham, Rice University.
This seminar series features dynamic professionals sharing their industry experience and cutting edge research within the humancomputer interaction (HCI) field.
Each week, a unique collection of technologists, artists, designers, and activists will discuss a wide range of current and evolving topics pertaining to HCI...
Scoop.it!
The degrees of freedom (DF) are the amount of information your data provide that you can "spend" to estimate the values of unknown population parameters, and calculate the variability of these estimates.
This value is determined by the number of observations in your sample and the number of parameters in your model.
Increasing your sample size provides more information about the population, and thus increases the degrees of freedom in your data.
Note that adding parameters to your model (by increasing the number of terms in a regression equation, for example) "spends" information from your data, and lowers the degrees of freedom available to estimate the variability of the parameter estimates.
Supportive
⌘ HowToFindHow to find them?
◉Towards an intuitive explanation!
Scoop.it!

Scooped by Mhd.Shadi Khudr 
Visualization of orthogonal (disjoint) or overlapping datasets is a common task in bioinformatics.
Few tools exist to automate the generation of extensivelycustomizable, highresolution Venn and Euler diagrams in the R statistical environment.
To fill this gap the authors of this paper introduce VennDiagram, an R package that enables the automated generation of highlycustomizable, highresolution Venn diagrams with up to four sets and Euler diagrams with up to three sets.
Highly Supportive:
A Venn diagram is an illustration of the relationships between and among sets, groups of objects that share something in common. Usually, Venn diagrams are used to depict set intersections (denoted by an upsidedown letter U).
This type of diagram is used in scientific and engineering presentations, in theoretical mathematics, in computer applications, and in statistics.
☟
DanteR?!
♒Bonus I:
Exact and Approximate Areaproportional Circular Venn and Euler Diagrams http://bit.ly/1CmoEeM
♒Bonus II:
VennPlex–A Novel Venn Diagram Program for Comparing and Visualizing Datasets with Differentially Regulated Datapoints
Post Image: http://1.usa.gov/1CmmvzG
Scooped by Mhd.Shadi Khudr 
In quantitative finance both R and Excel are the basis tools for any type of analysis.
Whenever one has to use Excel in conjunction with R, there are many ways to approach the problem and many solutions.
It depends on what you really want to do and the size of the dataset you’re dealing with. I list some possible connections in the table below.
RExcel is an addin for Microsoft Excel. It allows access to the statistics package R from within Excel...
The Excel addin RExcel.xla allows to use R from within Excel. The package additionally contains some Excel workbooks demonstrating different techniques for using R in Excel.
Scooped by Mhd.Shadi Khudr 
Learn statistics in a practical, experimental way, through statistical programming with R, using examples from the health sciences. We will take you on a journey from basic concepts of statistics to examples from the health science research frontier.
Audit this course for free and have complete access to all of the course material, tests, and the online discussion forum. You decide what and how much you want to do...
Do you want to learn how to harvest health science data from the internet? Do you want to understand the world through data analysis? Start by exploring statistics with R!
In this course you will learn the basics of R, a powerful open source statistical programming language. Why has R become the tool of choice in bioinformatics, the health sciences and many other fields?
One reason is surely that it’s powerful and that you can download it for free right now. But more importantly, it’s supported by an active user community.
In this course you will learn how to use peer reviewed packages for solving problems at the frontline of health science research.
Commercial actors just can’t keep up implementing the latest algorithms and methods.
When algorithms are first published, they are already implemented in R. Join us in a gold digging expedition. Explore statistics with R.
Scooped by Mhd.Shadi Khudr 
Easy to use, light weight statistics program.
Statist is a small and portable statistics program written in C.
It is terminalbased, but can utilise GNUplot for plotting purposes.
It is simple to use and can be run in scripts.
Big datasets are handled reasonably well on small machines.
Download:
http://wald.intevation.org/frs/?group_id=12
Further Info:
Documentation:
Post Image: http://bit.ly/1iPA5WK
Scooped by Mhd.Shadi Khudr 
Scooped by Mhd.Shadi Khudr 
Interquartile range (IQR) is the difference between the third and the first quartiles in descriptive statistics.
Make use of this free online calculator to find the interquartile range from the set of observed numerical data (values).
Supportive:
Post Iamge: http://bit.ly/150q7gh
Scooped by Mhd.Shadi Khudr 
Part of the "Data Science" Specialization »
Learn how to program in R and how to use R for effective data analysis. This is the second course in the Johns Hopkins Data Science Specialization.
In this course you will learn how to program in R and how to use R for effective data analysis.
You will learn how to install and configure software necessary for a statistical programming environment and describe generic programming language concepts as they are implemented in a highlevel statistical language.
The course covers practical issues in statistical computing which includes programming in R, reading data into R, accessing R packages, writing R functions, debugging, profiling R code, and organizing and commenting R code.
Topics in statistical data analysis will provide working examples.
Post ImagE: http://bit.ly/1t0N91m
Scooped by Mhd.Shadi Khudr 
Misconception: In a very large data set, strong correlations supplant the need for causality; decisions made using Big Data analytics are quick, inexpensive and accurate.
Truth: Data analytics does not equate the right decisions nor does Big Data imply big ideas; a healthy dose of human intuition, experience and skepticism is needed to turn data into answers.
What have Rumpelstiltskin and Pygmalion Done With the Data?
Many years ago when I was pursuing an advanced degree in computational mathematics (arguably computer science at its geekiest), a professor offered two insights on mathematical modeling that turned out to be astonishingly prescient in these days of Big Data.
Here is my paraphrasing of his advice in the context of Big Data and my coinage of the terms Rumpelstiltskin Fallacy and Pygmalion Effect:
.
.
.
The Big Data Promise...
Data: More the Merrier?...
Substituting Causation with Correlation...
Big Data Does Not Equate Big Ideas...
.
.
.
Let’s face it, making decisions based on analytics alone is like driving a car blindfolded while taking directions from someone looking out the rear window.
To stay ahead of the competition, executives need to create a management culture that breeds natural skepticism towards conformity, foments instinctive scrutiny against Big Data results, and encourages decisionmaking using a mixture of metrics, experience and intuition.
Manage the talents, not the data.
Scooped by Mhd.Shadi Khudr 
If statements can be very useful in R, as they are in any programming language,. Often, you want to make choices and take action dependent on a certain value.
Defining a choice in your code is pretty simple: If this condition is true, then carry out a certain task. Many programming languages let you do that with exactly those words: if . . . then. R makes it even easier:
You can drop the word then and specify your choice in an if statement.
Highly Supportive
↳ Nested ifelse statement in R
↬ If Statement with more than one condition
Scooped by Mhd.Shadi Khudr 
Sometimes one has the problem to make two samples comparable, i.e. to compare measured values of a sample with respect to their (relative) position in the distribution. An often used aid is the ztransform which converts the values of a sample into zscores:
with
zi ... ztransformed sample observations
xi ... original values of the sample
... sample mean
s ... standard deviation of the sample
The ztransform is also called standardization or autoscaling. zScores become comparable by measuring the observations in multiples of the standard deviation of that sample. The mean of a ztransformed sample is always zero. If the original distribution is a normal one, the ztransformed data belong to a standard normal distribution (μ=0, s=1).
Useful background
> Super Supportive ✋
Scooped by Mhd.Shadi Khudr 
RSeek is a custom search engine that can help you find information on the official website, the CRAN, the archives of the mailing lists, the documentation of R and even selected websites. It is more effective than a simple Google search. ◌
Highly Supportive
R Site Search
This search will allow you to search the contents of the R functions, package vignettes, and task views.
Nabble R
R Multisite search
Useful Help 3 Available Packages
┢ Table of available packages, sorted by date of publication
┡ Table of available packages, sorted by name
Scooped by Mhd.Shadi Khudr 
Principal Components Analysis (PCA). What is it?
It is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences.
Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analysing data.
The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, ie. by reducing the number of dimensions, without much loss of information.
This technique used in image compression, as we will see in a later section. General Tutorial
▷ PCA is a way of simplifying a complex multivariate dataset. It helps to expose the underlying sources of variation in the data. URL
↑
↑
☟
▼
Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables. Different from PCA, factor analysis is a correlationfocused approach seeking to reproduce the intercorrelations among variables, in which the factors “represent the common variance of variables, excluding unique variance.
See also this info
▼
See also this link
On the emergence of the MFA
▼
Network component analysis (NCA) takes advantage of partial network connectivity knowledge and is able to reconstruct regulatory signals and the weighted connectivity strength. In contrast, traditional methods such as PCA and ICA depend on statistical assumptions and cannot reconstruct regulatory signals or connectivity strength. Source
▼
■ On the relation between PCA and Kmeans clustering
Highly Important Comparative Note
△
◎Applications in computational biology
An obvious application of PCA is to explore highdimensional data sets, as outlined above. Most often, threedimensional visualizations are used for such explorations, and samples are either projected onto the components, as in the examples here, or plotted according to their correlation with the components.
As much information will typically be lost in two or threedimensional visualizations, it is important to systematically try different combinations of components when visualizing a data set.
As the principal components are uncorrelated, they may represent different aspects of the samples. This suggests that PCA can serve as a useful first step before clustering or classification of samples.
⇛ Support:
☛ What is Sparse Principal Component Analysis?
☝
⇢ The pcaPP R Package See also this URL
Well, are robust methods really any better? ↰
If so, then PCA or SPCA or NSPCA?
✍(◔◡◔)
➽ Bonus: PCA Explained Visually ⇖
http://setosa.io/ev/principalcomponentanalysis/
PCA technique is useful to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize.
➻ Addendum 1
How many PCAs to use and other cool stuff...
➻ Addendum 2
Loadings vs eigenvectors in PCA: when to use one or another?
➻ Addendum 3
>> Further reading:
Scooped by Mhd.Shadi Khudr 
RASP (Reconstruct Ancestral State in Phylogenies) is a tool for inferring ancestral state using SDIVA (Statistical dispersalvicariance analysis),
Lagrange (DEC), BayesLagrange (SDEC), BayArea, BBM (Bayesian Binary MCMC), BayesTraits and ChromEvol.
Scooped by Mhd.Shadi Khudr 
This calculator is free to use and is designed for biologists, ecologists, teachers, and students needing to quickly calculate the biodiversity indexes of an ecosystem.
First, enter the number of species, and then enter the name you wish to give the species, if available, and the given populations for each of the species—in any given order.
The script will return the Simpson and ShannonWiener values (among almost two dozen others) for the given data...
¶ Supportive Calculators:
♣ On Simpson Index:
A measure that accounts for both richness and proportion (percent) of each species is the Simpson's diversity index. It has been a useful tool to terrestrial and aquatic ecologists for many years and will help us understand the profile of biofilm organisms and their colonization pattern in the Inner Harbor.
The index, first developed by Simpson in 1949, has been defined three different ways in published ecological research. The first step for all three is to calculate Pi, which is the number of a given species divided by the total number of organisms observed.
♣ On Shannon Index:
This diversity measure came from information theory and measures the order (or disorder) observed within a particular system. In ecological studies, this order is characterized by the number of individuals observed for each species in the sample plot (e.g., biofilm on a acrylic disc).
It has also been called the Shannon index and the ShannonWeaver index. Similar to the Simpson index, the first step is to calculate Pi for each category (e.g., species). You then multiply this number by the log of the number. While you may use any base, the natural log is commonly used (ln). The index is computed from the negative sum of these numbers.
♣♣ Important Definitions ♣♣
► Biodiversity:
Biological diversity, or biodiversity, is a term that is becoming more and more heard, yet few people really know what it is. There are many definitions for it, but there are two that will be given here.
The first is from the Convention on Biological Diversity, also known as the Rio Summit: "'Biological diversity' means the variability among living organisms from all sources including, inter alia, terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part; this includes diversity within species, between species and of ecosystems."
The Canadian Biodiversity Strategy defines it as "…the variety of species and ecosystems on Earth and the ecological processes of which they are a part". It is often simply used as a catchall term for nature. No definition is perfect; as with life itself, it's a bit nebulous and there are always exceptions.
► Biodiversity Indices
A Biodiversity Index gives scientists a concrete, uniform way to talk about and compare the biodiversity of different areas. Learn how to calculate this number yourself.
►Species Richness
Species Richness is the number of species present in a sample, community, or taxonomic group.
Species richness is one component of the concept of species diversity, which also incorporates evenness.
► Species Evenness
Evenness is, the relative abundance of species. It refers to the evenness of distribution of individuals among species in a community. In other words, species evenness refers to how close in numbers each species in an environment are.
♣ Supportive Info:
Post Image: http://bit.ly/1FK0wI0
Scooped by Mhd.Shadi Khudr 
"Opinions surely differ a little on this, but in my opinion (and I suspect the opinions of others on this board), R is your best bet"...
➽ Supper Supportive:
http://bit.ly/1ibfvig
➭ How best to learn R?
http://bit.ly/1gi9Mqe
➲➲ Time to Tweak it ➲➲
The fastest way to learn a new programming language
Post ImagE: http://bit.ly/1ibeR4n
Scooped by Mhd.Shadi Khudr 
Predictive analytics makes predictions about unknown future using data mining, predictive modeling. Process,Software and industry applications of predictive analytics.
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events.
In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities.
Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions.
Predictive analytics is used in actuarial science, marketing, financial services, insurance, telecommunications, retail, travel, healthcare, pharmaceuticals and other fields.
>> Supportive:
>> Bonus:
"It's hard to make predictions, especially when they are about the future" is a quote usually attributed to American baseballlegend"
Yogi Berra
>> Not a good start when discussing predictive analytics...
>> What Are Predictive Analytics?!
>> The Traditional View...
>> Predicting the Present...
>> Shaping The Future...
>> A Better Approach...
>> Words of Warning...
http://bit.ly/1unDnE5
Post ImagE: http://bit.ly/1rx1m3M
Scooped by Mhd.Shadi Khudr 
There's a small but growing number of women who are single mothers by choice—and the narrative of single motherhood isn't complete without them...
Yet again, single mothers are in the news. The most recent Shriver Report has a list of statistics that make the plight of single motherhood seem quite daunting—numbers that say they are more likely to live with regret and at the height of poverty, struggling so much more than those with partners by their sides...
◐ But the research doesn’t always tell you the full story ◑
The Shriver report
◔ Stats on Gingerbread:
Gingerbread works to tackle the stigma around single parents by dispelling myths and labels.
◔ Rise of the singleparent family
◔ Single Motherhood Increases Dramatically For Certain Demographics, Census Bureau Reports
◔ The Mysterious and Alarming Rise of Single Parenthood in America
◔ Children in singleparent families by race
>> Complementary:
Scooped by Mhd.Shadi Khudr 
Postgraduate students from nonstatistical disciplines often have trouble designing their first experiment, survey or observational study, particularly if their supervisor does not have a statistical background.
Such students often present their results to a statistical consultant hoping that a suitable analysis will rescue a poorly designed study.
Unfortunately, it is often too late by that stage.
A statistical consultant is best able to help a student who has some grasp of statistics.
It is appropriate to use the Web to deliver training when required and that is the mechanism used in this project to encourage postgraduate students to develop statistical thinking in their research.
Statistical Thinking is taught in terms of the PPDSA cycle and students are encouraged to use other Web resources and books to expand their knowledge of statistical concepts and techniques...
Post ImagE: http://bit.ly/1vbodof
Scooped by Mhd.Shadi Khudr 
STATS Indiana focuses on data for actionable use by Hoosier government, business, education, nonprofits, health organizations and anyone needing to understand “how many, how much, how high or low” for their community.
With nearly 1 million page views and more than 300,000 visits each year, STATS Indiana has won multiple awards from national organizations.
Because of its unique state government/public university partnership and its wideranging data and tools, it is frequently cited as a “data jewel in Indiana’s crown.”
STATS Indiana has become Indiana’s information utility and the heart of the Information for Indiana data dissemination channel.
It provides convenient access to data for geographic areas in Indiana and across the nation because we think context and the ability to compare areas on all measures is crucial.
The original catalyst for a statewide, digitally accessible database began with the Indiana Business Research Center at Indiana University's Kelley School of Business, but has received major support from the State of Indiana since the 1980s, becoming an outstanding example of the creative partnership that can occur between state agencies and statefunded research institutions.
>> About the Data
The data on STATS Indiana are provided by more than 100 federal and state agencies, along with commercial or private data sources.
The STATS Indiana database powers also powers Hoosiers by the Numbers, the Stats House and dozens of local and regional websites throughout Indiana.
We add value to these data in the form of calculations, graphs, comparisons of time or geography, time series and maps.
At STATS Indiana, timeliness and accuracy are both critical:
We work daily to ensure the data on STATS Indiana are updated as they are released from the source agencies — we don’t let new data sit in a queue waiting for a “scheduled” quarterly update. To help users know what to expect and when, we maintain a release calendar.
Each topic has a landing page that provides the data as well as metadata. These "About the Data" pages provide the essentials users need, including info on frequency, the specific source agency, geographic coverage, years of availability and any caveats related to the data.
>> About the Data
http://www.stats.indiana.edu/data_calendar/whats_new.asp
>> Special Toolkit
http://www.stats.indiana.edu/tools/index.asp
Post ImagE:
Scooped by Mhd.Shadi Khudr 
GRASS GIS, commonly referred to as GRASS (Geographic Resources Analysis Support System), is a free and open source Geographic Information System (GIS) software suite used for geospatial data management and analysis, image processing, graphics and maps production, spatial modeling, and visualization.
GRASS GIS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies.
It is a founding member of the Open Source Geospatial Foundation (OSGeo).