Your new post is loading...
Your new post is loading...
What is statistics? – Charts – Boxplot Boxplot: A Little History The “box and whisker plot” or simply the Boxplot used for first time by John W. Tukey in 1970, a very important mathematician. He wanted to produce a graph that will summarize the properties of a Continuous Distribution. Boxplot: What is it? A classic Boxplot includes: i) a rectangular Box ii) the Median which is drawn as a line inside the box. If a Normal distribution is represented, then this line is drawn in the center of the “box”. iii) A whisker like a “T letter” is drawn in the up and down side of this rectangular Box. The Boxplot can be drawn Horizontally or Vertically. Tukey’s Boxplot has been modified in various ways in order to accommodate visually additional statistics such as the Arithmetic Mean. Boxplot: What Information is shown? A Classic Boxplot, as it is seen in the following figure, includes statistics about the dispersion of a dataset, and thus, about its shape. Specifically, it shows information about: i) Median value (2nd Quartile): is represented by a line inside the Box. Note that the Median value is the 2nd quartile. Therefore, this line separates the data of a variable exactly in halve: 50% of the values are before this line and 50% of the values are after this line. The position of this line shows if the most values are near to the 1st or 3rd Quartile. ii) Quartiles: 50% centered values: The rectangular Box of the Boxplot represents visually the values that exist between the 1st Quartile (Q1) and the 3rd Quartile (Q3). That is, the Bottom Side of this Box represents the 1st Quartile (Q1) and the Upper Side of this Box represents the 3rd Quartile (Q3). In other words, this Box represents the Interquartile Range (IQR=Q3Q1). In that way, the 50% of the values of a Continuous variable is visually represented by this Box. Note that the length of the other two sides of the Box, those that do not include whiskers, is arbitrarily drawn. iii) Quartiles: 50% of the Noncentered values: Note that the 25% of the values of a variable is positioned before the Bottom Side of the Box (1st Quartile (Q1)), and another 25% of the values of this variable is positioned after the Upper Side of this Box (3rd Quartile (Q3)). iv) Whiskers: The Bottom Whisker represents the value that is 1.5*IQR below the 1st Quartile (Q11.5*IQR) and the Upper Whisker represents the value that is 1.5*IQR upper the 3rd Quartile (Q3+1.5*IQR). Some modified Boxplots replaces these “Whisker” values with the values of the 2nd and 98th percentile. v) Outliers: The values that are further below the Bottom Whisker and further above the Upper Whisker, – (“Q1” or “Q3+” 1.5*IQR) – can be described as outliers, that is, they are extreme values. These outliers are visually represented by circles. In some modified Boxplots, the values that are even further from these points – (“Q1” or “Q3+” 3*IQR) – are represented as a Star. These points are named also “Fences”. vi) The maximum and minimum limits or values of a variable can be suggested from a Boxplot too. Some modified Boxplots replace the values of the Whiskers by the minimum and maximum values of the variable. Boxplot: Example figure The below Graph figure presents the “five” statistical points of the Boxplot as well its relation to the Curve of a Standard Normal Distribution. The “gray star” could show a possible outlier / extreme value, IF a such value existed. Note that the Standard Normal Distribution do not present Outliers / extreme values. It is symmetrical and therefore, its boxplot also is symmetrical: both whiskers has the same distance from the box and the median is exactly in the center of the box. Boxplot: Modifications Tuckey’s Boxplox has been modified in a number of ways in order to present visually additional properties of the dataset: i) The Traditional Boxplot as it was described by Tuckey ii) Variable Width Boxplot: The size of the sample defines the size of the “side” / width of the Boxplot iii) Notched Box Plot: Notches emphasizes the size of the Median iv) Violin plot: The perimeter (yellow one) shows the Probability Density (pdf) of the dataset / group v) Vase plot: The perimeter (yellow one) shows the Modality of the dataset: Unimodal or Bimodal? vi) Bean Plot: The black lines is each individual observation of the dataset and its thickness or their width can present duplicate values. The big line shows the mean.
What is statistics? – Charts – Bar chart Bar Chart: Definition A Bar Chart or a Bar Graph can show visually the size of the levels or Categories of a Qualitative Variable. Therefore, visually, it is easier to compare these categories and to extract useful results. Bar Chart: Bars Each bar can represent a Main Category as well its subcategories. Subcategories can either be presented one upon another on a bar, or side by side, always grouped by a main category. Space exists between the main categories. This space signifies that a Discrete (noncontinuous) Variable is displayed. This is the main visual difference with a Histogram. These bars can be either Horizontal based on Y axis or Vertical based on X axis. One of the axis will display the names or levels of the Categories, and the other axis will display the Discrete values of the variable. The Height of each bar is relative to the size of each Main category or one of its subcategories. Bar Chart: A little History A type of Bar chart was used first time by Nicole Oresme (14th century) and by Sir Isaac Newton (17th century) to present visually the Laws of Motion: velocity and Acceleration per time units. The 17th century, Joseph Priestley presented a timetable of Biographies that was a type of bar Graph. William Playfair (17th century) presented a Bar chart that is the same type of Bar chart that we use today. In this Bar chart, the imports and exports from various countries from and to Scotland was presented for one year. The Bar Chart that was presented by William Playfair (17th century)
Bar Chart: Horizontal vs Vertical bars There are two main types of Bar chart based on how its bars are positioned: i) Horizontal Bar Chart: Its bars are positioned horizontally, parallel to X axis. ii) Vertical Bar Chart or Column Bar Chart: Its bars are positioned vertically (Standing), parallel to Y axis. Bar Chart: Subcategories presentation The type of a Bar chart can differ based on how a Main Category as well its subcategories are visually presented: i) Stacked Bar Chart: The subcategories of a Main category can presented in a single bar, one upon another, like Legos. Then, the size/height of the bar shows the total size of the Main Category. It can present either the normal size (f) or the Relative size (f%) of each subcategory. If relative size is presented, then the total size of the bar/Main category is equal to the value of 100%. ii) One front of other according to its height: A Bar can be presented in front of another bar in an ascending order based on their height/size. Then, the difference between subcategories is visually presented e.g. by different colors. The main categories are presented with space between them. iii) Side by Side: The subcategories of a Main Category are presented side by side, grouped under this main Category. The main Categories/grouped bars are presented with space between them. Vertical Bar Chart: Example Ι: Side by Side In this Bar graph, the size of 470 cats big cats (Red Bars) versus small cats (Blue Bars) and its color Black, White, Brown, two colors, and Multicolor is compared (fake data): i) The Multicolor category is the highest color category of cats (190) which is represented by 100 Small cats and 90 Big cats. ii) The Big Brown cats is the least represented category. Horizontal Stacked Bar Chart: Example ΙΙ The following Horizontal Stacked Bar Chart presents the number of Biological (Orange bar) versus of Nonbiological (Deep Blue bar) bottle of wines that were sold in a region for four months from January to April (fake data): i) The total number of bottle of wines that were sold in February, is the highest one between these four months. ii) The biological bottle of wines were sold most in Mars and least in April. Horizontal Stacked (100%) Bar Chart: Example ΙII The following Horizontal Stacked (%) Bar Chart compares the % quantity of Green (Green bars) versus Other type of Tea (Light Blue bars) for Cold and Hot Tea cups that were ordered in a Restaurant e.g. in August (fake data): i) Almost 40% of the clients ordered a Green Tea and 60% ordered other type of Tea when they ordered a Cold Tea cup. ii) 20% of the clients ordered a Green Tea and 80% ordered other type of Tea when they ordered a Hot Tea cup. iii) The total quantity of Green Tea cups that was ordered cannot be compared with the total quantity of Other type of Tea cups that was ordered because both bars are expressed in %. iv) For example, clients may ordered 200 Hot Green Tea cups versus 800 Hot Other Tea cups and they may have ordered 4 Cold Green Tea cups versus 6 Cold Other Tea cups.
Type of Research – Focus Group Focus Group: Definition A Focus group is usually a small group of participants that were selected to take part in the discussion of a particular topic. Focus groups are used in many settings such as Social and Marketing settings. Focus group is a qualitative research method. Focus Group: A little History The phrase “Focus group” was coined by the AustianAmerican Ernest Dichter which was specialized in Marketing Psychology. He made a focus group of children who watched television advertisements and then he videotaped their reactions to these ads. The results of this Focus group led to the invention of “Barbie”. He specifically noted that: “What they wanted was someone sexy looking, someone that they wanted to grow up to be like,”“Long legs, big breasts, glamorous.”. Focus Group: Properties The members of a Focus Group usually are some selected participants based on some specific personal characteristics they have. For example: a company has developed a new biological product and it would like to find how to build its marketing strategy. In order to find this information, it must find participants that they will have a sensitive biological personality / attitude. These participants can consist a focus group for this aim. Focus Group: Single way or Twoway The Singleway Focus group or the traditional Focus Group is a group of selected people that participate in the discussion of a particular topic. A Twoway Focus group is consisted from the Traditional Focus Group as well by an additional group of selected people that actively observe and discuss the behaviors and attitudes of the Traditional Focus Group. Focus Group: Two Moderators A Focus group can have two Moderators. The first Moderator is responsible that the discussion is smoothly progressed. The second Moderator is responsible to check if all topics of interest were covered. Focus Group: Dueling Moderators A Focus group can have two Moderators which deliberately take polar positions, opposite sides in order to develop group dynamics further. Focus Group: Respondent moderator A moderator can select a respondent from the group members that temporarily will act as Moderator in order to develop group dynamics further. Focus Group: Conformity A main disadvantage of Focus group and generally, in all types of groups, is conformity. That is, members of a group can develop the same opinions. The personality of each member as well how moderators utilize group discussion can influence the levels of conformity.
Type of Variables – Dependent and Independent Variables Variable: Defined based on how it will be used/treated The Variables can be divided into Dependent variables and Independent variables based on how a researcher will treat them. Note that the scale that a variable has been measured does not influence if this variable can be treated as Dependent variable or as Independent Variable but it influences the statistical way that can be analyzed. Dependent variables: Definition Dependent variable is the presumed result that a researcher is waiting to see after the manipulation of other factors, usually of the Independent variable. Independent Variables: Definition An Independent variable is the variable that can influence the status of another variable, that of the dependent variable. The Independent Variable is the variable that the researcher is manipulating in order to see the presumed effect on the dependent variable. In summary, the Independent variable is the presumed caused and the Dependent variable is the presumed effect or result. That is, the effects of Independent variable on the Dependent variable are studied. Example I A researcher can be interest in to study the effects of Tea and Coffee (Independent variable / manipulated) on the Quality of Sleep of participants (Dependent variable / measured outcome). Example ΙΙ A Researcher is interested in to study if different type of music (Independent variable / manipulated) can influence / affect the hours that you sleep (dependent variable, measured outcome). Extraneous Variables Extraneous Variables are those variables that a researcher failed to control or to include into his/her research design but it can influence the final output / outcome. That is, extraneous Variables can alter the influence of the Independent variable on the Dependent Variable. Extraneous Variables: Special case: Confounding variable Note that when the levels of the independent variable varies according to some Extraneous Variable/s, then this variable is called Confounding Variable. Extraneous Variables: Special case: Control variable When the Experimenter is examining the relationship between the Independent and Dependent variable, and keep constant some other variable/s then these variables are called Control variables. Extraneous Variables: Example Ι A researcher can be interest in to study the effects of Tea and Coffee (Independent variable / manipulated) on the Quality of Sleep of participants (Dependent variable / measured outcome). Here, Extraneous Variables can be the mood of a participant as well what he/she has eaten or drunk, or how many hours he/she has slept before the experiment. Therefore, these extraneous Variables may have affected the levels of his/her Quality of Sleep. That is, Tea or Coffee may had an effect on the levels of wellbeing but these extraneous Variables may also have influenced the Quality of Sleep of the participants. Extraneous Variables: Example ΙΙ A Researcher is interested in to study if different type of music (Independent variable / manipulated) can influence / affect the hours that you sleep (dependent variable, measured outcome). The level of sound or the hearing quality of the participants can be such Extraneous Variables which may have affected also the hours of sleep. That is, higher level of sound or limited hearing may have tired some participants more / less than some other participants, and thus, their Quality of Sleep was negatively influenced. Confounding variable: Example explained If the Experimenter had divided the sample into age levels / categories e.g. young vs old, then, these older people may had limited hearing capabilities or they were more prone to have disturbed sleeps than some other older people that did not participate in that experiment. Then hearing capability is a confounding variable of this experiment. Control variable: Example explained If the Experimenter would like to test the Quality of sleep by Manipulating the type of Drinking before the sleep, then by keeping constant e.g. the room temperature, the noises inside and outside the building then, such variables are called control variables. As it appears, a researcher may fail to take into consideration all the extraneous factors / variables that can influence the effect of independent variable on the dependent variable in an experiment. A well designed experiment will try to take control of most of such factors / variables.
What is statistics? – Standard Normal Table – z scores – Rules – Part III Explanation on Rules of z / Φ(z) calculations Many publications or Internet sites give some rules of Φ(z). These rules are based on the Symmetry that the Standard Normal Distribution has. In order to fully explain these rules: Explanation of Starting and End point of Φ(z) / cdf in General —The Starting point is the point that the “Area under the Curve” begins. —The End point is the point that the “Area under the Curve” finishes. —The Starting point and The End point refers to points on the Horizontal X axis. —The Starting point and The End point defines Φ(z). —The values of Φ(z) starts from Zero “0” and they are increasing until One “1”. —Τhe Total “Area under the Curve” is Equal to One “1”. —The Total “Area under the Curve” has as a Starting and End point the Negative and Positive Infinity —Usually, Standard Normal Tables include z values from 4 to +4. Explanation of Starting and End point of Φ(z) in Standard Normal Tables —The Starting point of Φ(z) in relation to its given Table cell values can DIFFER from the Starting point of a related Φ(z) question. —The starting point of Φ(z) on X Axis may differ to each publication: i) Negative Infinity (most cases) or ii) Positive Infinity. —The publications may implicit or explicit state the “starting point” of the Table Φ(z) cell values as well of the related Φ(z) question. —Some Normal (z/Φ(z)) Tables may present only the Positive z values, from 0.00 to +4.00, and its corresponding Φ(z) values, from 0.50 until 1. Explanation of Starting and End point of Φ(z) in Questions —Note that some authors use as symbols “the big Z” or “the big X” which both refer to the Horizontal Axis of the Standard Normal Distribution. —Therefore, the “small z” or the “small x” refers to some specific value on this axis. —Intervals on the X Axis of the Standard Normal Distribution can be denoted with “a” and “b” symbols. Explanation of Probability symbolism —This notion requests the probability that is a Φ(z) value that will have as: i) Starting point, the Negative Infinity . ii) End point, any point that is higher than the Negative Infinity . iii) The End point is included in calculations. —This notion requests the probability that is a Φ(z) value that will have as: i) Starting point, the Positive Infinity . ii) End point, any point that is Lower than the Positive Infinity . iii) The End point is not included in calculations. The following rules are based on z / Φ(z tables that have as a Starting point the Negative Infinity —IF the starting point of the Φ(z) on this Table is the Positive Infinity —Then the guides for RULE I must replaced from the guides of RULE II and —The guides for RULE II must replaced from the guides of RULE I —The Table included in this page, it contains both Negative and Positive z values and its corresponding Φ(z) values with starting point the . Note that the “Left Table” includes the Negative z values and the “Right Table” the Positive z values. The cells inside the Table are the Φ(z) values (Probabilities). For more information about how a Standard Normal Table is used, please, see Part II
Rule I What is the Probability of a “Z value” to be lower or equal to a specific “z value”? —The starting point of the Question is and the end point is any value Higher than Negative Infinity . A) Tables that provide both Positive and Negative z values – Φ(z) Probabilities for Positive z values and Negative z values then: or —This notion “Φ(z)=z” or “Φ(z)=z” actually means “Φ(z) and Φ(z) =Corresponding Table cell value”. —The corresponding Φ(z)/Φ(z) value of a Positive or Negative z value can be found by: a) Using the given z value and then b) You find the corresponding Φ(z) value B) Tables that provide ONLY the Positive z values – Φ(z) Probabilities for Positive z values then: —This notion “Φ(z)=z” actually means “Φ(z)=Corresponding Table cell value”. —The corresponding Φ(z) value of a Positive z value can be found by: a) Using the given z value b) You find the corresponding Φ(z) value Probabilities for Negative z values then: —This notion “Φ(z)=1Φ(z)=z” actually means “Φ(z)=1Φ(z)=1Corresponding Table cell value”. —The corresponding Φ(z) value of a Negative z value can be found by: a) Using this “Negative z value” as a positive one and then b) You find the corresponding Φ(z) value and then c) you must subtract this Φ(z) value from One (1) – Remember, the total “Area under the Curve” is equal to one (1). Rule II What is the Probability of a “Z value” to be Higher to a specific “z value”? —The starting point of the Question is and the end point is any value Lower than Positive Infinity . A) Tables that provide both Positive and Negative z values – Φ(z) Probabilities for Positive z values and Negative z values then: or —This notion “1Φ(z)=1z” or “1Φ(z)=1z” actually means “1Φ(z or z)=1Corresponding Table cell value”. —The corresponding Φ(z) value of a Positive or Negative z value can be found by: a) Using the given z value and then b) You find the corresponding Φ(z) value and then c) you must subtract this Φ(z) value from One (1) B) Tables that provide ONLY Positive z values – Φ(z) Probabilities for Positive z values then: —This notion “1Φ(z)=1z” actually means “1Φ(z)=1Corresponding Table cell value”. —The corresponding Φ(z) value of a Positive z value can be found by: a) Using the given z value and then b) you must find the corresponding Φ(z) value and then c) You must subtract this Φ(z) value from One (1) Probabilities for Negative z values then: —This notion “Φ(z)=Φ(z)=z” actually means “Φ(z)=Φ(z)=Corresponding Table cell value”. —The corresponding Φ(z) value of a Negative z value can be found by: a) Using this “Negative z value” as a positive one and then b) You must find the corresponding Φ(z) value Rule ΙΙΙ What is the Probability of a “Z value” to be i) Lower or Equal to a specific “z value” (a) as well Higher than a specific “z value” (b): —The Starting and The End point of the Question can be any TWO points “a and b” on X axis, representing Z values. Therefore: then: —This Statistical expression of Probabilities is consisted of TWO main parts: a) which has been fully analyzed in Rule ΙΙ b) which has been fully analyzed in Rule Ι c) When you find the Φ(z) values of the “a and b” points according to the Rules I and II, then, d) You must subtract the Lowest Φ(z) value (a) from the Highest Φ(z) value (b) , that is: “Φ(b)Φ(a)” The result as in all the previous cases shows the Percentage of the Data that are represented from the “Area under the Curve” between these two points (a and b) of a Random Variable X.
inter#Quartiles, #Deciles and other weird things of #Statistics. What is its meaning, especially for grouped data? How can they be calculated and interpeted?
Method A: The Median of the Median: Odd Cases: Example Let’s say we have the following dataset that includes nine (n=9) numbers in an ascending order (a prerequisite for finding the Median value / Quartile positions): , , , , , , , , i) Because we have a dataset that includes Odd number of values, Median is equals to the value of the middle number of this set which is 11. ii) Every Median splits a dataset into two subsets: 50% of the values will be left of the Median and the other 50% of values will be right of the Median. iii) Then, we have 4 values left of “11” (Median) and other 4 values right of “11” (Median). Therefore, we have: Left of “11” (Median): , , , Right of “11” (Median): , , , iv) In the next step, we must find the Median for each Arithmetic subset which will be the Arithmetic Mean (average) of the two middle numbers in each Arithmetic subset – because each subset of this example contains an Even number of values. v) Therefore, the Median for these two Arithmetic subsets is 8.5 (Q1 position) and 14 (Q3 position), respectively: –Left Arithemtic subset: Middle values: 7 and 10. Therefore: –Right Arithemtic subset: Middle values: 13 and 15. Therefore: vi) The Median value in each Arithmetic subset corresponds to the value that exists in Q1 and Q3 positions, respectively. Then, the InterQuartile Range can be referred as “8.514″. vii) Therefore, the value of InterQuartile Range can be found by subtracting the value that exists in Q1 position from the value that exists in Q3 position, which equals 5.5: Therefore, the value of InterQuartile Range, in that case, is:
Method A: The Median of the Median: Even Cases This method is entirely based on the Median. Note that Median is always found in Q2 position. Also, every Median and Q2 position splits any given Arithmetic set in half. The value of the InterQuartile Range can be found, if the next steps are followed: i) We must find the Median value for a given Arithmetic set. —–If the Arithmetic set includes an Even number of values e.g. 10, 12, 100, then we must calculate the Arithmetic Mean (average) of the two middle values. —–If the Arithmetic set includes an Odd number of values e.g. 9, 11, 101, then the middle number can be identified as the Median of this set. ii) Due to Median’s position or Q2 position, two new Arithmetic Subsets will virtually be produced that each will be equal to the number of values that will include. That is, the 50% of the values of the Main Arithmetic set will be Left from the Median and the 50% of the values will be Right from the Median. Based on this method, Median is Always not included in these two new Arithmetic subsets. iii) Then, we must find the Median value in these two new Arithmetic subsets, too. —–If the Main Arithmetic set had an Even number of values, then these two Arithmetic subsets will contain an Odd number of values. Therefore, the middle number can be identified as the Median of this set. —–If the Main Arithmetic set had an Odd number of values, then these two Arithmetic subsets will contain an Even number of values. Therefore, we must calculate the Arithmetic Mean (average) of the two middle values for each Arithmetic subset. These two new values will be the Median for each Arithmetic subset. iv) These two new values corresponds to the values that exist in the 1st Quartile position and in the 3rd Quartile position in the Main Arithmetic set. v) Then, in order to find the value of the InterQuartile Range, we must subtract the value that exists in Q1 position from the value that exists in the Q3 position.
Calculating the InterQuartile Range There are two main methods for finding the value of the InterQuartile Range which is based either on Median (Method A) or based on a Statistical Formula (Method B). Each method may produce different results. Therefore, you must refer what method you used in order to find the value of IQR.
What is Quartile: i) While Percentile results 100 equal pieces by implying 99 dividing positions, Quartile results four (4) equal pieces by implying 3 dividing positions. ii) Therefore, the position that is indicated by the first Quartile (Q1) implies that the 25% of the information or data are before this position —while the 75% is after this position. iii) Another 25% of this information will be between the position that is indicated by Q1 and the position that is indicated by Q2. iv) Another 25% of this information will be between the position that is indicated by Q2 and the position that is indicated by Q3. v) The final 25% of information or data will be after the position that is indicated by Q3. That is:4*25%=100%.
Watches and Statistical Formulas.... Can Watches be used as Infographics for Statistical Formulas? Measurements of Central Tendency for Grouped Data
It is amazing how many things you can Learn about what is A (Statistical) Range!!!

Η μεγαλύτερη μέχρι σήμερα μελέτη για τη σχέση των ψυχικών διαταραχών με τη δημιουργικότητα, η οποία εξέτασε το μεγαλύτερο μέρος του πληθυσμού της Σουηδίας, δείχνει να επιβεβαιώνει ότι οι επιστήμονες, οι καλλιτέχνες και άλλοι άνθρωποι που ασκούν...
Via Ioannis
What is statistics? – Graph figures – Histogram Histogram: Etymology Histogram was first introduced by Karl Pearson in 1891. Pearson coined the word “histogram” by using the following two words: “historical diagram” which is the function of a histogram. It is a graph figure which is used to display past data. Another not very possible explanation is that the word “Histogram” is the product of two Greek words: “Histos” which means “web”. ‘Literally “anything set upright,” from histasthai “to stand” and the “+gram” which means “something written” (Online Etymology Lexicon). Histogram: Definition A Histogram is a chart which is consisted by bars that are named bins. An histogram tells the story of “how many” values represent a specified range in a dataset, in a graphical way. For example, the the range 1.3 to 2.3 can include 20 values, and the range 2.4 to 3.4 can include 40 values while its width remain stable, in a given dataset from 1 to 10. Each value is represented by only one given range. Histogram: What its bins can represent There is no space between the bins which shows that a continuous (quantitative) variable is depicted. Each bar/bin represents a range. The raw values of the dataset can appear in the X axis. The height of these bins show the Frequency of the number of values that a given range represents. The Frequency or the Relative Frequency is depicted on the Y axis. Histogram: Usages The Histogram is one of the most used graphical tool. It can be used to check the type of a distribution: how many modes a dataset has, one or multiple ones? Or about the dispersion of a dataset: what Skewness or Kurtosis has. Statistically, It is a common practice to display the curvature of a dataset on a histogram. In that way, quick information on the shape and dispersion of a dataset / distribution is provided. Histogram: How to calculate the width of ranges / bins and How many Ranges to divide your dataset The most common approach about the width of the ranges in a histogram is to be of equal width. That is, the dataset is divided in a number of ranges that can vary BUT of equal width e.g. let’s say 5 points. Note that as the width of ranges increases for a given dataset, the number of ranges that represent a dataset will decrease, and as the width of ranges is narrowed, the number of ranges that represent this dataset will increase. A variable that is represented by too few bins (ranges) in a Histogram and thus, the width of the range is very wide, relative to the size of the given dataset, it can alter the real shape of the distribution of the dataset. Note also that some ranges can represent zero (0) values, and thus, graphically, empty spaces exist in these positions instead of a bin. Histogram: How to divide your dataset in ranges Statistically, multiple rules of thumbs exist that can relate the size of a Continuous Variable to the ‘optimal’ number of divisions by ranges and the ‘optimal’ width of these ranges (bins). Such formulas are given below: i) ii) Square root option (Tukey & Mosteller, 1977): iii) Sturges’ formula (1926): iv) Rice Rule (Terrell & Scott, 1985): * *Note that most websites present the formula without the parentheses, which produce results that are not in agreement with most of the other formulas. v) Wichard’s rule (2008): vi) Scott’s Rule (1979): vii) Freedman–Diaconis rule (1981): viii) Bendat & Piersol Rule (1966): ix) Doane’s Rule (1976): where: and is the absolute value (signs are omitted) of the skewness. x) Cochran (1954): xi) Rule of Twelve: Any random continuous variable can be represented by 12 ranges. Comments Note that formulas denoted as are giving the “k” number of ranges, and thus the number of bins, while formulas denoted as are giving the “W” width of the ranges. is the size of a variable. is the value of the interquartile range. The denotes the standard deviation of the sample (sd). The shows the transformation of a decimal digit to the nearest upper whole number such as e.g. 2.10 –> 3. Histogram: Example For the given example, we use the Galton dataset which contains the Height of 928 children in inches. Height is a continuous variable. Here, the measurement unit of Height has been converted from inches to cm. Real limits of a Range The real limits of e.g. the first range is not the values of 156 and 160 but the values of 155.5 and 160.5. These are the real upper and lower limits of this Range. Therefore, this method ensures that when data are grouped in ranges, each range represents a unique set of numbers. In the following table, the height of these children has been grouped into seven (7) ranges. The real limits of these ranges are presented too. HEIGHT RANGES (in cm)REAL RANGESIMPLE FREQUENCY (F)CUMULATIVE FREQUENCYRELATIVE FREQUENCY (F%)RELATIVE CUMULATIVE FREQUENCY156160155.5160.544444.7%4.7%161165160.5165.5591036.4%11.1%166170165.5170.516526817.8%28.9% 171175170.5175.525852627.8%56.7%176180175.5180.526679228.7%85.4%181185180.5185.510589711.3%96.7%186190185.5190.5319283.3%100% Histogram and types of Frequency The table presents the (i) Simple Frequency (F) and the (ii) Cumulative Frequency as well the Relative Frequency (F%) of the (iii) Simple and (iv) Cumulative Frequencies. Frequencies i) The Simple Frequency can tell “how many values” represent a range e.g. the range of 156160 represents 44 height values. ii) The Relative Frequency is the ratio of the Simple Frequency of a given range over the Total Simple Frequency of the heights. — For example, the Simple Frequency for the third Height Range (166170) is 165, its Relative Frequency is then: . iii) The Cumulative Frequency is produced by adding (i) the Simple Frequency of a given Range PLUS (ii) the Simple Frequency of all the previous Ranges from the given one. — For example, the Cumulative Frequency of the third Height Range (166170) is: . iv) Finally, the Relative Cumulative Frequency is produced by adding (i) the Relative Frequency of a given Range PLUS (ii) the Relative Frequency of all the previous Ranges from the given one. The table presents the Number of Ranges that a dataset can be divided based on the given formulas (see latin numbers to find the corresponding formula) based on various sample sizes (928, 400, 100, 20), assuming that these sample sizes have the same properties as the original sample (n=928). Note that minimum and maximum Heights were needed in formula (i). n  MAX=187.2  MIN=156.79284001005020ii30201074iii1110876iv1310654v  Kurt=2.661110876vi  σ=6.5w=2.33 / 14w=3.08 / 10w=4.89 / 7w=6.16 / 5w=8.34 / 4vii  IQR=10.32w=2.12 / 15w=2.80 / 11w=4.45 / 7w=5.60 / 6w=7.60 / 5viii29211296ix  γ=0.088σγ=0.08 / 12σγ=0.12 / 10σγ=0.24 / 8σγ=0.34 / 7σγ=0.47 / 6x149432 In case of vi and vii, the formulas produced the width of the range e.g. (2.33) and then using formula (i), the Number of Ranges was calculated e.g. (14), and thus in the table is presented as: w= 2.33 / 14. All final results have been rounded up or down to whole numbers. We can see that the formulas suggested to divide a dataset from: i) 30 to 11 ranges of equal width when sample size was n=928 ii) 6 to 2 ranges of equal width when sample size was n=20. The following picture shows all the calculations for all the previous formulas when the given sample is equal to 928.
Histogram versus Bar chart When we have a qualitative variable, then, a bar chart can be used instead of histogram. Note that usually, the “bins” of a bar chart do not touch each other, which shows that a noncontinuous variable is represented. Sources Karl Pearson Francis Galton: Dataset with the Heights of Children and Parents Table with relevant Range formulas Doane’s rule and others Histogram: wiki Histogram etymology
Poisson Probability Distribution: A little History The Poisson Probability Distribution has taken its name after Siméon Denis Poisson (17811840). Mr Poisson was exploring the probability of the trial convictions that were wrongful. Moreover, Mr. Ladislaus Bortkiewicz (1898) found that “the number of Prussian soldiers that have died because they were kicked by a horse” follows also a Poisson distribution! Poisson Probability Distribution: Theoretical Definition The Poisson Probability Distribution is a Discrete Probability distribution which represents random events which can happen in integral points of time or space (volume, area, or distance), and its occurrence is known. Also, a Poisson Probability Distribution can be a Binomial Probability Distribution when is counted the probability of non occurrence of the same event in the same time or space intervals. A such example can be the number of pink bicycles that can pass by outside your house (or not) in relation to the total number of bicycles that can pass by outside your house in e.g. one hour. Therefore, it is suitable to describe rare events that have a probability of occurrence enough that can be counted in standard intervals. Therefore, it is also called the “Law of Rare events”. Poisson Probability Distribution: Applications The Poisson Probability Distribution is applied to numerous scientific fields such as Telecommunications, Astronomy (the rate of incoming photons in telescopes), in Biology (the number of mutations per DNA piece/length), in Insurances, in Seismography and many other scientific fields.
What is statistics? – Standard Normal Table – z scores – Part II Φ(z) values and Z – Standard Normal Distribution Table As it was already told, the Standard Normal Distribution (pdf) has an absolute symmetry around the Mean. Also, its Mean is equal to Zero (0) and its SD to one (1). Moreover, a z value shows the distance of a Raw score that has relative to its Mean, expressed in Population SD points (σ).
Normal Distribution Examples Percentages
The Bell Curve represents the f(x) – pdf – function and the “Area under the Curve” represents the Φ(z) – cdf – function. The Normal Table or the z / Φ(z) table is used to transform a z value to Φ(z) value.
Standard_Normal_Distribution (cdf_Cumulative_Density_Function)_functions2
How z / Φ(z) table is working In this page, a short edition of this table was created. This short z table has index values from 4 to +4 in steps of 0.5. The full edition of the z / Φ(z) table in steps of 0.1 can be found here, in a PDF form.
Table explanation A z / Φ(z) table usually includes:
i) The values of Φ(z) – cdf which are found inside the Table —These values shows how much data are contained in a specified area of Φ(z) “under the curve”. —The total “Area under the curve” equals to 1. —These values also can be expressed in percentages. —These values are increasing as the Area Coverage is increasing. —In that Table, the Φ(z) Area has as a starting point the most Left point on X axis \infty. —As the “Area under the Curve” increases its Coverage to +\infty point, its value is going to 1 (100%), in this Table. —The middle point is the Zero “0”, therefore, this point splits the total “Area under the curve” in Half —Therefore, 50% of this Area is Left and 50% is Right of this point. —Due to Symmetry of this Area, the Area properties in one of these halves can be reflected to the other Half Area.
Standard_Normal_Distribution_pdf_area_under_curve_Symmetry
ii) The z values in a special way: —The first part of z value is found on the First column (the gray one) that includes the first digit and the first decimal digit. —The second decimal digit of the Z value is found on the First Row (gray one) —The intersection of these two values indicate the corresponding Φ(z) – cdf value for the z value.
Note that the “Left Table” includes the Negative z values and the “Right Table” the positive z values. Z values  Z Table Short from +4 to 4
z / Φ(z) Table: Example Note that any “Area” cannot be described by a single point, but it needs at least two points. Let’s say that you need to find what is the value that describes the “Area under the curve” – that is, the Φ(z) value – between \infty and +3.55, always expressed in SD points: 3.55σ. Also, let’s suggest that the “Area under the curve” describes the data of a random X variable.
Standard_Normal_Distribution_pdf_area_under_curve_example
—The first part of z value is found on the First column: you must find the value “+3.5″. —The second decimal digit of the Z value is found on the First Row: you must find the value “.05″ —Then, you look at the cell that these two values are intersecting, which is Φ(+3.55)=.9998.
z_table_z_example_standard_normal_table
Conclusion —This result shows that the “Area under the curve” that is defined by \infty to +3.55 on X axis includes the 99.98% of the total data. —The rest 0.02% of the data are described by the “Area under the curve” that is defined by +3.55 to +\infty on X axis.
Notes —The “Area under the curve” that is defined by \infty to 0 includes the 50% of the values and —The “Area under the curve” that is defined by 0 to +\infty includes the rest 50% of the values —This can be seen if you find the intersecting point of “0” and “.00″ values on the Table. —Then, you can suggest that the “Area under the curve” that is defined by 0 to +3.55 on X axis includes the 99.98\%50\%=49.98\% of the total data —Note that the 0 point is not included in 49.98% because it was included in “50%”.
Standard_Normal_Distribution_pdf_area_under_curve_Symmetry_2
Z values / Φ(z) Table for Standard Normal Distribution from 4 to +4 in steps of .01
What is statistics? – Normal Distribution – Bell curve Normal Distribution: Definition Normal Distribution or “Bell Curve”, or Gaussian Distribution / curve is a Cumulative Probability Distribution which is very Symmetrical around its Arithmetic Mean, and its shape is a “Bell” curve shape. That is, it is wide in the middle and narrowed in its tails. Normal Distribution: A Little History Abraham de Moivre (16671754) as well Carl Friedrich Gauss (17771855) were the first scientists that studied mathematical functions that produce such type of Distribution. Abraham de Moivre was a Mathematician / Statistician and he was studying the rate of Mortality over people’s age in order to calculate profits from their annual payments. The results of this function produced normally distributed data. Nowadays, insurance companies make use of this function. Also, Carl Friedrich Gauss, a very talented person, invented the “Gaussian” function which shows howthe distribution of arbitrarily selected real numbers which are constants, can produce a special distributionthe “Normal Distribution”. Moreover, he studied the random errors that were produced in various measurements and he found that they were normally distributed. For example, the electronic noise in electrical circuits produces a Normal Distribution. Therefore, sometimes, the Normal Distribution is also called as Error Distribution. Normal Distribution: Density of Distribution: Definition Density of a Distribution can be defined in a graph figure, as the distance that has the curve, that describes some random selected measurements of an event, from the “X” axon, the Horizontal axon. The Density of a Distribution is referred to the number of data that can contain under some area. Therefore, as some points of this curve are placed in a higher distance from this axon, the density of this distribution will increase accordingly. Thus, more data will be contained in some specified intervals under this curve. Normal Distribution: pdf and cdf Here, we must have a clear understanding what is the Probability Density Function (pdf) and theCumulative Probability Function (cdf). Note that the pdf of a continuous random variable is the derivative of its cdf. In both functions, pdf and cdf, the X axis represents the values of the variable, usually in some specified intervals and the Y axis represents the Probabilities. Probability Density Function The pdf represents the relative distribution of frequency of a continuous random variable and it has a Bell shape. Note that the “area under the probability curve” is equal to 1, or otherwise, is equal to 100%. Thus, a single point (x) has a Probability of “0”, because, it “covers” an almost zero area: . Therefore, intervals are used in order to provide meaningful answers to questions based on pdf: e.g. . Therefore, it can answer questions that are using phrases such as: “What is the probability of a specific value of this variable to be lower than, higher than, between some values for an event?”. Cumulative Probability Function The cdf can represent any random variables and it has a Sigmoid shape. Here, instead of area, the actual points on the X axis are used in relation with their probabilities. It can show what is the probability of this variable (X) to have a value (x) below or equal to some specified number/event. Therefore, it can answer questions of this type: ” What is the probability that people will have a Heightbelow or equal to 1.90cm?. Normal Distribution: Usefulness The Normal Distribution or Gaussian Distribution is used in various scientific fields such as: i) in Processing Images (Blurring), ii) in Gases Behavior, iii) in Communications (Signaling). iv) It can describe the distribution of many natural phenomenon such as the measurements of Height, and Weight. Normal Distribution: Example If we use the Height Distribution of an Adult Population, which tends to be Normally Distributed, we are expecting that: i) Very few people will have a Height shorter than e.g. 1.40cm and even less people will have a Height shorter than 1.10cm. ii) Very few people will have a Height higher than e.g. 1.90cm and even less people will have a Height higher than e.g. 2.20cm. iii) We are expecting that the majority of this population will have a Height between e.g. 1.40cm and 1.90cm. That is the reason of the “bell curve” shape. The extreme values are in the tails of this distribution which have lower probability to happen, and thus, lower density, and the “popular” measurements are placed in the middle of the Bell Curve which is associated with higher probability as well higher density – “more space” exists. The below Graph Figure presents the dataset of Galton which measured the height of 928 children. The Red curve shows the distribution of the first 50 measurements while the Green Curve shows the Height distribution of the whole sample. Note, that the measurements of Height shape a Normal Distribution. These two Gaussian curves differ only on the size of sample used. The Red one has a low density (small number of data) and the Green one has a higher density (more data). Children Height Distributions: Green one includes 928 Height observations (original dataset) while the Red one includes the first 50 Height Observations (in cm) of the original dataset. The below graph figure presents three Normal Distributions that differ in Mean and Standard Deviation. The Blue one has a Mean of 7 and Standard Deviation 0.2. The Yellow one has a Mean of 3 and Standard Deviation 0.4. Finally, the Red one has a Mean of 9 and a Standard Deviation 0.3. Normal Distribution: Central Limit Theorem As the number of randomly and independent selected data that describe some events increases, this distribution get more close, more similar to a Normal Distribution. That is, If your randomly taken data, with each observation independent from the other, have a shape that is close to a Normal Distribution shape, then you may suggest that the underlying population of your data is normally distributed, then the properties of the Normal Distribution can be applied to your sample too. This is stated by the Central Limit Theorem Note that as Population we define the total observations that exist and which is related to the statistical sample. For example, all UK women (above 40s) consist the population for a research that tries to investigate the effects of menopause. Normal Distribution: Assumptions Randomness in sample selection is that every participant in the population of statistical interest has the same probability to be selected. For example, UK women (above 40s) selected randomly from Demographical lists instead of selecting the first e.g. 70. Independence of observations refers that the manipulation of an observation or a participant will not influence another observation or participant. For example, the Menstrual cycle of the women who work or live in the same place in a daily basis can begin the same day. Therefore, this event, the Menstrual cycle of the women who work or live in the same place in a daily basis cannot consist random and independent observations or selection. The same is true about TV Channel choices for people who live in the same house but they have only one TV set. All these persons are “forced” to watch a specific TV channel. This problem is also known as Autocorrelation. It was first mentioned by Galton in 1888, also known as Galton’s Problem. Standard Normal Distribution or Z Distribution: Properties When the independent observations in a Random variable X are so many that get closer and closer to Infinity, and its distribution is a Normal one, and its Arithmetic Mean is equal to , and its Standard Deviation is equal to , then this Distribution is called Standard Normal Distribution or Z Distribution or Typical Distribution and it can be denoted as . Note that the Normal Distribution, generally, is denoted as . As it was mentioned, when researchers studied such variables in large scales using almost all available population data (e.g. Height data for all men from Military files in a specific Country), and after rescaling the Mean to equal “0”, then, always, a “Standard Normal Distribution” was produced. Note that this Distribution is the “Golden Standard” for all Normal Distributions. Then, the study of the properties of the “Standard Normal Distribution” can help us to understand the properties of every Normal Distribution. The tails of the Standard Normal Distribution Its “tails”, that is, both ends of this Standard Normal Distribution are getting closer and closer to Horizontal Axon X without never touching it. Central Tendency The values of Arithmetic Mean, Median, and Mode, which are measurements of the Central Tendency, have exactly the same value in Standard Normal Distribution. Moreover, if we draw a Straight Line from the Highest point of this Bell Curve until the point that we meet the value of Mean on Horizontal Axon X, then 50% of the data will be placed before and after this line, always inside the Probability Bell Curve, in Standard Normal Distribution. The rule of thumb: 689599.799 The rule of thumb 689599.799 is based on the fact that there is symmetry around the Arithmetic Mean in Standard Normal Distribution. Cumulative, the percentage of data that exists in the right and left of the Arithmetic Mean, always inside the Probability Bell Curve: i) In the distance of one (1) Standard Deviation, is 68.2%: ii) In the distance of Two (2) Standard Deviations, is 95.4%: iii) In the distance of Three (3) Standard Deviations, is 99.7%: iv) In the distance of Four (4) Standard Deviations, is 99.99%: In other words, the Probability of a “x” value to be in a distance of 1 or 2 or 3 or 4 Standard Deviations from the Arithmetic Mean in Standard Normal Distribution is: i) ii) iii) iv) Standard Normal Distribution and Symmetry (pdf) Probability (pdf) and Cumulative (cdf) Density functions for General Normal and Standard Normal Distribution Here, it must be said that the only difference of these two functions between General Normal Distribution and Normal Distribution is that the Arithmetic Mean is replaced with zero “0”, and the Standard Deviation is replaced with one (1), therefore: Probability Density function (pdf) —for EVERY Normal Distribution: —for Standard Normal Distribution: Cumulative Density function (cdf) —for General Normal Distribution: —for Standard Normal Distribution: Standard Normal Distribution: Cumulative Density function (cdf) Symbol Explanation We currently have explained the symbols for the Arithmetic Mean and the symbol for the Standard Deviation (for Population). Τhe symbol denotes the number from the division of Circumference by its diameter, and it is a constant which is equal to 3.14159. The is another constant which is also called Euler’s number and it is equal to 2.71828. Both constants are very essential ones! Note that they have an infinite number of decimals which still mathematicians try to count! The following graph figure shows that the Bell curve is described by the pdf and the under the curve area is described by the cdf. As it was said, the pdf of a continuous random variable is the derivative of its cdf. Resources Abraham de Moivre Normal Distribution Gauss Function Carl Friedrich Gauss Euler’s Number: e Galton Height Dataset explanation Galton’s Problem
Method B: The Statistical Formula The following formula do not care if the Arithmetic dataset has an Odd or Even number of values. However, the Statistical Formula will be used for both Arithmetic series ((n=10) and (n=9)) in order to be able to compare the results of both methods. Note that the Statistical Formula is only depended on the number of values that exist in a given dataset. The following datasets have been arranged in an ascending order (a prerequisite for finding Quartile positions): The (1st) Arithmetic set with an Even number of values (n=10) , , , , , , , , , . The (2nd) Arithmetic set with an Odd number of values (n=9) , , , , , , , , i) The first step is to find what is the exact position of the and the in each Arithmetic set. In order to find these positions, we must apply the relevant Statistical formula in each case: Explanation The is the exact position that and have on a given dataset. The is replaced by the total number of values that exist on a given dataset. Here, it is replaced with 9 or 10, respectively. The is replaced by the position we are searching to find. Here, it is replaced by 1 (Q1) and 3 (Q3), for each Quartile, respectively, for both datasets. The is replaced by the total number of Arithmetic pieces that are produced, here, by Quartiles. Quartiles always produce 4 Arithmetic pieces, therefore, it is replaced by 4. Results for the 1st Arithmetic set (n=10) For the position and , we have as a result: and
ii) The next step is to find what values correspond to these positions in the given Arithmetic set. Positions of 2.75 and 8.25 The position can be found between the 2nd and 3rd values of the Arithmetic set. That is, between the values of 7 and 10. The position can be found between the 8th and 9th values of the Arithmetic set. That is, between the values of 15 and 15. In order to find what values correspond to these positions in the given Arithmetic set, these steps must be followed: a) Between these two values, We subtract the lower number from the higher number. That is: and b1) For Q1 position: We multiply the result by or , both forms are equivalent. That is: b2) For Q3 position: We multiply the result by or , both forms are equivalent. That is: c) We add the produced result to the Lowest number between the two given values. That is: and
Method A: The Median of the Median: Even Cases: Example Here, an example is given (n=10) for Even cases. The dataset that we will use in this example is the following one. It has been already arranged in an ascending order (a prerequisite for finding the Median value / Quartile positions): , , , , , , , , , . i) Because, here, we have a dataset with Even number of values, two middle numbers exists which are: 11 and 13. ii) Therefore, the Median value is found by calculating the Arithmetic mean (average) of these two values, which is 12: . iii) Every Median splits a dataset into half in such way that 50% of the values will be left of the Median and the other 50% of values will be right of the Median. Therefore: iv) we have 5 values left of “12” (Median) and other 5 values right of “12” (Median). Therefore, we have: Left of “12” (Median): , , , , Right of “12” (Median): , , , , . v) In The next step, we must find the Median for each Arithmetic subset which will be the same number as the Middle value of each Arithmetic subset – because each subset of this example contains an Odd number of values. Therefore: –i) in the Left Arithmetic subset, the middle value is 10, that is, the Median equals 10 (Q1 position) –ii) in the Right Arithmetic subset, the middle value is 15, that is, the Median equals 15 (Q3 position) vi) The Median value in each Arithmetic subset corresponds to the value that exists in Q1 and Q3 positions, respectively. Then, the InterQuartile Range can be referred as “1015″. Therefore, the value of InterQuartile Range can be found by subtracting the value that exists in Q1 position from the value that exists in Q3 position, which equals 5: , and thus: .
What is InterQuartile Range: Definition The value of the InterQuartile Range (IQR: Q1 – Q3) is the result that we will get when the value that exists in the First Quartile position will be subtracted from the value that exists in the Third Quartile position. The InterQuartile Range (IQR) is also known as midspread or middle fifty. It shows the statistical dispersion in an Arithmetic set. It also used for the construction of the Boxplot – a Statistical Graph figure.
What is tiles: Theoretical Definition You can define whatever iles you want e.g.: i) Tertiles occupies 2 positions that result 3 equal pieces. Each piece includes 33% of the total information or data. ii) Quartiles occupies 3 positions that result 4 equal pieces. Each piece includes 25% of the total information or data. iii) Deciles occupies 9 positions that result 10 equal pieces. Each piece includes 10% of the total information or data. iv) Percentiles occupies 99 positions that result 100 equal pieces. Each piece includes 1% of the total information or data. v) Permilles occupies 999 positions that result 1000 equal pieces. Each piece includes 1‰ or 0.1% of the total information or data. You must also have in mind that these iles exist: centile or percentile (99), vigintile (19), duodecile (11), decile (9), nonile (8), octile (7), septile (6), sextile(5), quintile (4), quartile (3), and tercile or tertile (2). Not all these tiles are convenient to be used in statistics.
What is statistics? – Percentiles, Quartiles, Quantiles, and Deciles The Cake! You have Birthdays and your mum bought you a BIG cake! You are expecting to have 10 Birthday guests. Your mom cut your cake without taking your advice in ten pieces. Order! We can suggest that each piece of Cake is exactly the same —let’s say— in length. However, the number of cookies that each piece includes differ. Let’s say that while you are waiting until your guests arrive, you have ordered your cake pieces in an ascending fashion based on the number of cookies that they have. Therefore, your first guest will receive the piece of cake with the least number of cookies and your last guest will receive this cake piece that is richest in cookies.
Frequency In Statistics, Frequency can be defined as any number that indicates the times that “something” such as a value, word, or category exists in a set. It is denoted as (Frequency). There is the Simple Frequency, the Relative Frequency, and the Cumulative Frequency, as well its combinations e.g. Cumulative Relative Frequency.

¡Vaya mapa de ruta del crowdfunding! A ver si podemos tener uno en España
La scelta a voi, buon Crowdfunding!