Section 6 (4 credits)
Making Sense of Data – Statistics
Statistics is about “making sense of data”. From early days mankind have been statisticians. There are patterns in weather, animal behavior, seasons, food sources and trends. All of this is data and noticing where the stars align and when the fishing is better is just a matter of time and observation. All driven by survival and hunger no doubt. In today's world mankind has measured just about as much as can be measured and data flows in torrents.
The Maori Fishing Calendar shows the connections between the moon, the month and likely fishing results.
All the result of looking and noticing over time.
The website Our World in Data
processes enormous amounts of data and produces graphs on just about any topic you choose. The Covid graphs for individual countries have been a popular visit for many people.
This graph shows several countries daily infections during the past 18 months. Individual countries can be selected and teh time period changed.
Go to this website and take a look, have a play and see what you are interested in. The graph showing World infections clearly shows the "Third Wave" and Delta.
Statistics is about “making sense of data”.
That single statement will drive everything that follows. The image in the header holds a strong clue to statistical success also. It is all about looking and of course, noticing.
WARNING! Be wary of false statements using statistics. A nice example is how "The longer your foot the better you are at Mathematics". At first glance this statement is funny and then it seems to be ridiculous. How can this be? A little pondering will recall little 5 year old children with little feet are not being very good at maths but by Year 10 or age 15 can excel. All that time a child grows and the feet get longer. Hence the apparent connection between two unrelated events. "Cause and Effect" or "Effect and Cause", depoending on how you look at it, are very difficult to establish so always be wary. The statement could also have been "As a child grows taller so so do the stories they tell..." but there could be more truth in that.
Firstly some language. Like all studies, knowing the language is knowing the subject. Know and use mathematical and statistical terms, correctly!
Data (prounounced dah-ta or day-ta, your call) Plural is data as well. just like sheep, one sheep, ten sheep.
If you think of a cat, for example, and then try and describe it, you might say it has 4 legs, a tail, it has black fur, is female, weighs about 4 kilograms, has a 20cm tail, has green eyes, is very friendly, and so on. These are the properties of the cat. These are the data that describes a cat. You could notice it likes liver but not kidney, mince, a nice mouse to eat, milk, does not like dogs, likes cat-nip. We are only talking about this cat. Each property is a variable or something that can change.
A variable contains data which can be numerical (numbers) or categorical (words). A collection of variables and the data from a number of cats, say 200, is called a database. Numerical data can be added and multiplied and still make sense, categorical data are words and describe.
A number pretending to be a number pretending to be real.
Is a Telephone Number numerical or categorical?
Examples of numerical data above is the number of legs, the cat’s weight and the length of its tail.
Examples of categorical data is the gender, the colour, the eye colour, the friendliness.
There are thousands of databases available about everything you might imagine. It is making sense of this data that occupies a study in statistics. Telephone numbers do not make sense when added so must be categorical. A typical telephone number now is built up by [country code][region code][town code][house code] but all that changed with cell phones.
Describe three numerical and three categorical variables and give examples of the data they might contain.
All cats are not the same. They vary. The amount of variation is interesting and very important. A manufacturer of nuts and bolt, however, needs a very small variation in sizes.
Cats are different colours, weights and gender. Some are friendly and some are not, the tail length varies, and what they eat varies a lot. Variation causes statements in statistics to be what I call “floppy”. Statements become more like “This sample data suggests the typical weight of a cat is between 2kg and 3kg” and is much more meaningful than “The mean weight of a cat is 2.54kg”. Start using the word "typical" rather than "mean" when talking about a population.
Give an example of a variable that might have a large variation in the data and one that might have a small variation in the data.
Write statements to describe the typical item you are talking about.
The Earths Climate has varied a lot and we are currently in a cycle of hotter than usual.
Measures of Middle
The traditional statistics course makes a lot of use of measures such as mean, median, and mode. “Typical” has largely replaced mean and median in early studies and is mentioned as a range of values to convey an idea of variation. Mode is almost never used except by people who sell shoes.
Describe what a mean, a median and the middle 50% are.
Graphs of Data
There are some excellent sense making graphs now used in statistics and they are easy to make. The days of bar-graphs, histograms, pie charts, are pretty much gone. Sense making graphs appeared in the 1970s and included the two shown above. The "dot plot" and the "box and whisker". A dot plot is simply a number line and dots placed, and stacked, to represent a numerical data. The shape and spread of the dots is important. The box and whisker shows the 25% groups or quartiles of the data. From the left, Minimum Min, Lower Quartile LQ, Median Med, Upper Quartilwe UQ, and Maximum Max. The MIddle 50% from LQ to UQ is the "typical value" and the "width of the box" is the spread or variation in the sample. The sample if it is big enough and randomly selected is representative of the population and can be considered to be truth about the population. Statistics studies all those factors to improve the truth or reliability of the sample.
Another modern graph is “stem and leaf”. There are hundreds of ways to display data and the graphs that make the most sense are the best!
Find out what a stem and leaf graph looks like, how it is made and if it might be useful.
In the following box and whisker graph for the males label and read the value of the minium, the LQ, the median, the UQ and the maximum.
In the box and whisker graph…
• is there more variation in the males or females?
• what is the difference in the medians of the two genders?
• what is a sensible range of values that describes a typical male, female and student weight?
It is a bit blurry but can be read.
A Trout Example
Variation is what makes statistics hard. Variation happens in all manufacturing and where anything to do with measurement is involved. Measurement was the original cause of the study of statistics as engineers tried to manage machining and manufacturing variation. They figured a way to calculate the average distance any data point was from the mean and called this standard deviation. They then invented standard error and mathematicians developed all that into a confusion called statistics.
Spread is another name for variation and becomes obvious when data is graphed. Here is a “dot plot” of a sample of trout. Each dot represents a trout. There are two female trout near 3kg and the smallest trout is a male at about 1.1kg.
Task 6Look at the trout dot plot sample and list 5 things that could be used to compare the two groups of dots.
The spread of the females is from about 1.2kg to 3.1kg and the spread of the males is from 1.1kg to about 2.5kg. Most of the females are between 1.25 and 2.1kg and most of the males are between 1.4 and 2.5kg. So there may be more variation in the size of the males, 2.5 – 1.4 = 1.1kg compared to 3.1 – 1.2 = 1.9kg for the females.
When comparing two groups using a numerical measure it is usual to look at
• Middle 50%
When saying something about each of these a concise literacy strategy is “WWW”.
• W1 • What am I talking about?
• W2 • Where is it?
• W3 • What does it mean?
For example, the shape (W1) of both female and male trout samples is one bump (W1). The bump (W1) is between about 1.25 and 2.25 kg (W2) for both. This means most trout are within this 1kg range in our sample (W3).
The Middle 50% (15) of the female trout sample (30)is between 1.5 and 2.5kg. The middle 50% of the male trout sample is between 1.7 and 2.4kg. This means a typical female and typical male trout would be within this range.
The spread is explained above and may suggest there is slightly more variation in the male sample than the female sample.
There are no oddities in this data. There are two big female trout but there weight is usual, not odd.
All statistics should be undertaken in an investigation. Modern statistical study uses the term “statistical investigation cycle”. This is summed up as PPDAC. Problem, Plan, Data, Analysis, Conclusion. You are a Data Detective when you use this cycle.
What is missing so far is a question or problem that might need to be answered. Comparative questions “compare” a numerical variable for a categorical variable with two groups. The categorical variable is gender, male and female. The numerical variable is weight. The population that the trout were selected from was all of the trout caught and weighed in during the Lake Taupo Fishing Competition held in April 1993.
A suitable question for this data is “I wonder if the weight of a female trout is typically more than the weight of a male trout in the 1993 lake Taupo Fishing Competition data.
Relationship questions involve two numerical variables. I wonder what if there is relationship is between the weight of a trout and the length of a trout?
Over 1200 fish were caught and weighed in that 1993 and taking a sample from 1200 seems a good option to save a bit of time to figure out an answer to that question. So how big should a sample be? [Just of interest, there is about 1.4 million catchable trout in Lake Taupo which is of course the true "population".]
As a general rule of thumb a sample of about 20 to 50 in size works for measures on animal populations. A bigger sample does help to give a clearer and more reliable picture but it costs more. It is a good idea to explore sample size to see what happens to the variation.
The only remaining event is to answer the question. This data suggestions that the female trout are not bigger than the male trout so back in the population of the 1993 Lake Taupo Fishing Competition it is expected that the same result would be evident.
In the text and examples above identify each of the PPDAC steps for the trout example.
To gain 4 Credits for this section screenshot or photograph your answers or otherwise email a copy to firstname.lastname@example.org for registering(first time), checking and getting feedback. This is intended to be a painless process and all questions are accepted. Once you have been awarded the 4 credits a fee of $5 will be requested for this section.
Well done, four sections over! On to Section 7! Is the course fun? Use the Navigator to find the next section.