Section 6 (4 credits)

Making Sense of Data – Statistics

Statistics is about “making sense of data”.

That single statement will drive everything that follows. Firstly some language.

Data

If you think of a cat, for example, and then try and describe it you might say it has 4 legs, a tail, it has black fur, is female, weighs about 4 kilograms, has a 20cm tail, has green eyes, is very friendly, and so on. These are the properties of the cat. These are the data that describes a cat. You could notice it likes liver but not kidney, mince, a nice mouse to eat, milk, does not like dogs, likes cat-nip. We are only talking about this cat. Each property is a variable.

Data can be numerical (numbers) or categorical (words). Each data is described or called by a variable. A collection of variables and the data from a number of cats, say 200, is called a database.

Examples of  numerical data above is the number of legs, the cat’s weight and the length of its tail.

Examples of categorical data is the gender, the colour, the eye colour, the friendliness.

There are thousands of databases available about everything you might imagine. It is making sense of this data that occupies a study in statistics.

Describe three numerical and three categorical variables and give examples of the data they might contain.

Variation

All cats are not the same. They vary. They are different colours, weights and gender. Some are friendly and some are not, the tail length varies, and what they eat varies a lot. Variation causes statements in statistics to be what I call “floppy”. Statements become more like “This sample data suggests the typical weight of a cat is between 2kg and 3kg” and is much more meaningful than “The mean weight of a cat is 2.54kg”.

Give an example of a variable that might have a large variation in the data and one that might have a small variation in the data.

Measures of Middle

The traditional statistics course makes a lot of use of measures such as mean, median, and mode. “Typical” has largely replaced mean and median in early studies and is mentioned as a range of values to convey an idea of variation. Mode is almost never used.

Describe what a mean, a median and a mode are.

Graphs of Data

There are some excellent sense making graphs now used in statistics and they are easy to make. The days of bar-graphs are pretty much gone.

Today “dot plots” and “box and whisker” graphs are common. Another modern graph is “stem and leaf”. There are hundreds of ways to display data and the graphs that make the most sense are the best!

The graph below is a box and whisker graph. The box contains 50% of the data and pretty much represents the typical value and indicates the spread as well. The heavy bar in the middle is the median. This pair of box and whisker graphs shows half of make students, all those above the mean are heavier than ¾ of all the female students or those in the box and below. The data in a box and whisker graph is located by quarters, lower quartile LQ, upper quartile, UQ, and of course the minimum and maximum an the median. In the box and whisker graph for the males label and estimate the value of the minium, the LQ, the median, the UQ and the maximum.

In the box and whisker graph…

• is there more variation in the males or females?

• what is the difference in the medians of the two genders?

• what is a sensible range of values that describes a typical student weight?

A Trout Example

Variation is what makes statistics hard. Variation happens in all manufacturing where anything to do with measurement is involved. Measurement was the original cause of the study of statistics as engineers tried to manage machining and manufacturing variation. They introduced terms such as standard error.

Spread is another name for variation and become obvious when data is graphed. Here is a “dot plot” of a sample of trout. Each dot represents a trout. There are two female trout near 3kg and the smallest trout is a male at about 1.1kg. The spread of the females is from about 1.2kg to 3.1kg and the spread of the males is from 1.1kg to about 2.5kg. Most of the females are between 1.25 and 2.1kg and most of the males are between 1.4 and 2.5kg. So there may be more variation in the size of the males, 2.5 – 1.4 = 1.1kg compared to 3.1 – 1.2 = 1.9kg for the females.

Look at the trout dot plot sample and list 5 things that could be used to compare the two groups of dots.

Analysis

When comparing two groups using a numerical measure it is usual to look at

• Shape

• Middle 50%

• Oddities

When saying something about each of these a concise literacy strategy is “WWW”.

• W1 • What am I talking about?

• W2 • Where is it?

• W3 • What does it mean?

For example, the shape (W1) of both female and male trout samples is one bump (W1). The bump (W1) is between about 1.25 and 2.25 kg (W2) for both. This means most trout are within this range in our sample (W3).

The Middle 50% (15) of the female trout sample (30)is between 1.5 and 2.5kg. The middle 50% of the male trout sample is between 1.7 and 2.4kg. This means a typical female and typical male trout would be within this range.

The spread is explained above and may suggest there is slightly more variation in the male sample than the female sample.

There are no oddities in this data. There are two big female trout but there weight is usual, not odd.

All statistics should be undertaken in an investigation. Modern statistical study uses the term “statistical investigation cycle”. This is summed up as PPDAC. You are a data detective when you use this cycle. What is missing so far is a question that might need to be answered. Comparative questions “compare” a numerical variable for a categorical variable with two groups. The categorical variable is gender, male and female. The numerical variable is weight. The population that the trout were selected from was all of the trout caught and weighed in during the Lake Taupo Fishing Competition held in April 1993.

A suitable question for this data is “I wonder if the weight of a female trout is typically more than the weight of a male trout in the 1993 lake Taupo Fishing Competition data.

Sample

Over 1200 fish were caught and weighed in that year and taking a sample seems a good option to save a bit of time to figure out an answer to that question. So how big should a sample be?

As a general rule of thumb a sample of about 20 to 50 in size works for measures on animal populations. A bigger sample does help to give a clearer and more reliable picture but it costs more. It is a good idea to explore sample size to see what happens to the variation.

The only remaining event is to answer the question. This data suggestions that the female trout are not bigger than the male trout so back in the population of the 1993 Lake Taupo Fishing Competition it is expected that the same result would be evident.