**Section 6 (4 credits)**

**Making Sense of Data – Statistics**

Statistics is about “**making sense of data”.**

That single statement will drive everything
that follows. Firstly
some language.

**Data**

If you think of a cat, for example, and then
try and
describe it you might say it has 4 legs, a tail, it has black fur,
is female,
weighs about 4 kilograms, has a 20cm tail, has green eyes, is very
friendly,
and so on. These are the **properties** of the cat. These are
the data that
describes a cat. You could notice it likes liver but not kidney,
mince, a nice
mouse to eat, milk, does not like dogs, likes cat-nip. We are only
talking
about this cat. Each property is a **variable**.

Data can be numerical (numbers) or categorical
(words). Each
data is described or called by a variable. A collection of
variables and the
data from a number of cats, say 200, is called a database.

Examples of numerical
data
above is the number of legs, the cat’s weight and the length of
its tail.

Examples of categorical data is the gender, the
colour, the
eye colour, the friendliness.

There are thousands of databases available
about everything
you might imagine. It is making sense of this data that occupies a
study in
statistics.

*Task 1*

*Describe three numerical and three
categorical variables
and give examples of the data they might contain. *

**Variation**

All cats are not the same. They vary. They are
different
colours, weights and gender. Some are friendly and some are not,
the tail
length varies, and what they eat varies a lot. Variation causes
statements in
statistics to be what I call “floppy”. Statements become more like
“This sample
data suggests the typical weight of a cat is between 2kg and 3kg”
and is much
more meaningful than “The mean weight of a cat is 2.54kg”.

*Task 2*

*Give an example of a variable that might
have a large
variation in the data and one that might have a small variation
in the data. *

**Measures of Middle**

The traditional statistics course makes a lot
of use of
measures such as mean, median, and mode. “Typical” has largely
replaced mean
and median in early studies and is mentioned as a range of values
to convey an
idea of variation. Mode is almost never used.

*Task 3*

*Describe what a mean, a median and a mode
are. *

**Graphs of Data**

There are some excellent sense making graphs
now used in
statistics and they are easy to make. The days of bar-graphs are
pretty much
gone.

Today “dot plots” and “box and whisker” graphs
are common.
Another modern graph is “stem and leaf”. There are hundreds of
ways to display
data and the graphs that make the most sense are the best!

The graph below is a box and whisker graph. The
box contains
50% of the data and pretty much represents the typical value and
indicates the
spread as well. The heavy bar in the middle is the median. This
pair of box and
whisker graphs shows half of make students, all those above the
mean are
heavier than ¾ of all the female students or those in the box and
below. The
data in a box and whisker graph is located by quarters, lower
quartile LQ,
upper quartile, UQ, and of course the minimum and maximum an the
median.

*Task 4*

*In the box and whisker graph for the males
label and
estimate the value of the minium, the LQ, the median, the UQ and
the maximum. *

*Task 5*

*In the box and whisker graph…*

*
• is
there more variation in the males or females?*

*
• what
is the difference in the medians of the two genders?*

*
• what
is a sensible range of values that describes a typical student
weight?*

**A Trout Example**

Variation is what makes statistics hard.
Variation happens
in all manufacturing where anything to do with measurement is
involved.
Measurement was the original cause of the study of statistics as
engineers
tried to manage machining and manufacturing variation. They
introduced terms
such as standard error.

Spread is another name for variation and become
obvious when
data is graphed. Here is a “dot plot” of a sample of trout. Each
dot represents
a trout. There are two female trout near 3kg and the smallest
trout is a male
at about 1.1kg.

The spread of the females is from about 1.2kg
to 3.1kg and
the spread of the males is from 1.1kg to about 2.5kg. Most of the
females are
between 1.25 and 2.1kg and most of the males are between 1.4 and
2.5kg. So
there may be more variation in the size of the males, 2.5 – 1.4 =
1.1kg
compared to 3.1 – 1.2 = 1.9kg for the females.

*Task 6*

*Look at the trout dot plot sample and list 5
things that
could be used to compare the two groups of dots. *

Analysis

When comparing two groups using a numerical
measure it is
usual to look at

• Shape

• Middle
50%

• Spread

• Oddities

When saying something about each of these a
concise literacy
strategy is “WWW”.

• W1 • What
am I talking about?

• W2 • Where
is it?

• W3 • What
does it mean?

For example, the shape (W1) of both female and
male trout
samples is one bump (W1). The bump (W1) is between about 1.25 and
2.25 kg (W2)
for both. This means most trout are within this range in our
sample (W3).

The Middle 50% (15) of the female trout sample
(30)is
between 1.5 and 2.5kg. The middle 50% of the male trout sample is
between 1.7
and 2.4kg. This means a typical female and typical male trout
would be within
this range.

The spread is explained above and may suggest
there is
slightly more variation in the male sample than the female sample.

There are no oddities in this data. There are
two big female
trout but there weight is usual, not odd.

All statistics should be undertaken in an
investigation. Modern
statistical study uses the term “statistical investigation cycle”.
This is
summed up as PPDAC. You are a data detective when you use this
cycle.

What is missing so far is a question that might
need to be
answered. Comparative questions “compare” a numerical variable for
a
categorical variable with two groups. The categorical variable is
gender, male
and female. The numerical variable is weight. The population that
the trout
were selected from was all of the trout caught and weighed in
during the Lake
Taupo Fishing Competition held in April 1993.

A suitable question for this data is “I wonder
if the weight
of a female trout is typically more than the weight of a male
trout in the 1993
lake Taupo Fishing Competition data.

**Sample**

Over 1200 fish were caught and weighed in that
year and
taking a sample seems a good option to save a bit of time to
figure out an
answer to that question. So how big should a sample be?

As a general rule of thumb a sample of about 20
to 50 in
size works for measures on animal populations. A bigger sample
does help to
give a clearer and more reliable picture but it costs more. It is
a good idea
to explore sample size to see what happens to the variation.

The only remaining event is to answer the
question. This
data suggestions that the female trout are not bigger than the
male trout so
back in the population of the 1993 Lake Taupo Fishing Competition
it is
expected that the same result would be evident.

*Task 2*

*In the text and examples above identify each
of the PPDAC
steps for the trout example. *

*more. *