It’s been a while since we’ve written about statistics, so I want to start a short series about that. Here, we’ll look into stem-and-leaf plots (also called stemplots).
Creating and using a stem-and-leaf plot
We’ll start with a question from 1997:
Stem-and-leaf Graph or Stemplot Hi! I was doing a math-a-thon and I got a problem about a stem leaf graph. I am in the advanced math class. My math teacher said it would take two days to teach his advanced class how to do it. Can you help?
Doctor Chita answered:
Hi Scott: Sure, I can try to help. A stem-and-leaf graph, also called a stemplot, is a way to represent the distribution of numeric data. It was invented by John Tukey, a mathematician, and is a quick way to picture data for numbers that are greater than 0. I'll explain using an example.
Tukey’s “exploratory data analysis” is used to visualize data by hand, when there are not too many numbers; the plot looks much like a histogram, showing the “shape” of the data at a glance, but includes the actual data values. It can also be used as a trick for sorting data, as we’ll see. (It can actually be used with negative data, but we rarely see that.)
Making a stemplot
Suppose you have the following set of numbers (they might represent the number of home runs hit by a major league baseball player during his career). 32, 33, 21, 45, 58, 20, 33, 44, 28, 15, 18, 25 The stem of a stemplot can have as many digits as needed, but the leaves should contain only one digit. To create a stemplot to display the above data, you must first create the stem. Since all of the numbers have just two digits, start by arranging the tens digits from smallest to largest. 1 2 3 4 5
Usually we will be dealing with two-digit numbers; sometimes we need to round in order to have only two digits, and often we need to work around a decimal point, as we’ll see. We think of each number as consisting of a “leaf”, the last digit, that identifies the individual number, and the rest of the number as a “stem”, by which the numbers are grouped.
To create the leaves, draw a vertical bar after each of the tens digits and arrange the ones digits from each number in the data set in order from smallest to largest. If there are duplicate numbers, like 33, list each one. 1|58 2|0158 3|233 4|45 5|8 The shape of the resulting display looks something like a bar graph oriented vertically. By examining the stemplot, you can determine certain properties of the data.
For example, to plot the first number, 32, we put its leaf, 2, to the right of its stem, 3. Commonly we will initially place the leaves in the order they arrive, which for our example of 32, 33, 21, 45, 58, 20, 33, 44, 28, 15, 18, 25 will produce this:
1|58 2|1085 3|233 4|54 5|8
For some purposes, it can be left unsorted like this; but for the uses to which we will put it, we need to sort the leaves on each stem, as he did above:
1|58 2|0158 3|233 4|45 5|8
In doing this, we have sorted all the numbers, which we can read back out as
15, 18,
20, 21, 25, 28,
32, 33, 33,
44, 45,
58
Now we can put the plot to use.
Finding the median and mode
You can find the median by counting from either end of the stemplot until you find its center. Here, since there are 12 numbers, the center lies between 28 and 32. The median is the average of the two data points: (28+32)/2 = 30.)
Here I have colored the leaves in spectrum order as I crossed them out, working from each end, and ending with the two middle numbers in bold:
1|582|0158 3|2334|455|8
Here we can see the middle numbers, 28 and 32. The median is their average. (If there had been an odd number of values, we would have found one middle number, which would be the median.)
We can also just count the total number of data values, \(2+4+3+2+1=12\), and count 6 from one end (left to right, top to bottom) and 6 from the other end (right to left, bottom to top) to find the middle:
---->
1|58
2|0158|
3|233
4|45
5|8
<----
Both approaches use the fact that the leaves represent all the data values listed in order, making this a shorthand for the complete sorted list.
You can also determine if there is a mode in the data set by looking at the plot. Here, the number 33 is the mode since it is the only value that occurs more than once.
We can determine this simply by looking in each row for duplicate digits:
1|58
2|0158
3|233
4|45
5|8
In general, there could be no mode, or several. See Three Kinds of “Average”.
Handling larger numbers
If your data contain three-digit numbers (like batting averages, for example), you can use the same technique. For example, let's assume the data are 298, 303, 285, 311, 225, 315, 250, 305 Ignore the ones digits in each number (these will be the leaves) and look at the remaining two digits in each number (the hundreds and tens digits). The stem will begin at 22 because the smallest number in the data set is 225. The stem will end at 31 because the largest number is 315. Include the two-digit numbers between 22 and 31 in the body of the stem.
It’s important to note that even stems with no leaves are to be included (see below), in order to accurately reflect the shape of the entire distribution. This is why we first find the smallest and largest numbers and list all stems between them, rather than just writing them as we find them.
Once you have the stem, then list the ones digits in each number after the corresponding two-digit number before it. The stemplot will look like this, with no leaves after the numbers without a corresponding value. 22|5 23| 24| 25|0 26| 27| 28|5 29|8 30|35 31|15 If these data represent the batting averages for a particular player, this display indicates that he has had a very successful career - most of his averages are clustered between 280 and 320.
If the numbers were more widely scattered (e.g. from 225 to 791, with 58 stems from 22 to 79, rather than just ten), this method would not work well, and we would probably round to the nearest ten, so that the stems would have only one digit.
One thing not mentioned here is that we often find a decimal point in the data, which we ignore; the plot above could just as well have represented the data 0.298, 0.303, 0.285, 0.311, 0.225, 0.315, 0.250, 0.305, or the data 2.98, 3.03, 2.85, 3.11, 2.25, 3.15, 2.50, 3.05. For this reason, it is common to include a “key” to explain the interpretation. For the original set of data, this might look like
Key: 29|8 = 298
For the others, it might be
Key: 29|8 = 0.298 Key: 29|8 = 2.98
Finding the mean
A 1996 question fills in a little gap:
Stem and Leaf Plots Dear Dr. Math, I am in the Math Counts math competition, and when doing practice problems we came across this problem: Use the stem-and-leaf plot of the recent art project scores to find the mean score. Express as a decimal. 5 | 0 0 4 | 9 7 3 3 1 3 | 8 7 2 | 9 What in the world is a stem-and-leaf plot? Thank you very much, Molly
Here, rather than starting with data and making a stemplot, we are given one and asked to interpret it. (Note that the stems here are given in reverse order.) Doctor Robert answered, not giving a full explanation, but focusing on how to find the mean:
Stem and leaf plots are a way that statistician can look at the distribution of numbers given to them to analyze. For example, in the stem-and-leaf plot you show, there were two scores in the 50's (They were both 50), 5 scores in the forties (49, 47, 43, 43, 41), two scores in the thirties (38, 37) and one score in the twenties (29). So all of the art scores were 50, 50, 49, 47, 43, 43, 41, 38, 37, and 29. You can find the average score by adding them and dividing by 10.
So the mean is just $$\frac{50+50+49+47+43+43+41+38+37+29}{10}=\frac{427}{10}=42.7$$
The mean doesn’t fit as well into this format as the median and mode; here we are just extracting the original data and finding their mean, rather than using the numbers as displayed. I’ll suggest a possible alternative below.
Finding the mode, mean, and median
One last question, from 2002, will provide a useful review.
Mode, Mean, and Median in Stemplots I'm trying to help my 6th grader do homework. How do I find a "mode," "mean," and "median" using a stem/leaf plot? Problem: stem leaf 1 889 2 035579 3 138 4 235
Doctor TWE answered:
Hi Linda - thanks for writing to Dr. Math. Each stem-and-leaf combination represents a data point in our set. So to find the mode, mean, and median of the set, we have to figure out how to interpret their definitions for this type of representation.
Presumably the student this time knows how to make and read a stemplot, which in this example represents the data $$18,18,19,20,23,25,25,27,29,31,33,38,42,43,45$$
Mode
The mode is defined as the data value that occurs most often. So we are looking for the leaf (number) that occurs the most often on one stem of the diagram. In your example, there are two 8 leafs on the 1 stem (i.e. two data points of value 18), and two 5 leafs on the 2 stem (i.e. two data points of value 25). So the data set is "bi-modal" with modes of 18 and 25. Note that I did not count the 5 leaf on the 4 stem because it represents a different value (45) - it just happens to have the same last digit as my mode of 25. I similarly did not count the 8 leaf on the 3 stem, nor the three different 3 leaves.
This is important: Digits on different stems represent different numbers, so we are not counting identical digits, but identical digits on the same stem. The two 9’s do not represent the same number, so we ignore them. Here, the two modes are in red and in green:
1 889 2 035579 3 138 4 235
$$\mathbf{{\color{Red}{18,18}}},19,20,23,\mathbf{{\color{DarkGreen}{25,25}}},27,29,31,33,38,42,43,45$$
Mean
The mean is the conventional "average," and perhaps the best way to find this is to do it the conventional way - add the values and divide by the number of numbers. With the stem-and-leaf plot, that means that we'll have to "read" each stem-and-leaf as a conventional number. For your example we'll get: (18+18+19+20+23+25+25+27+29+31+33+38+42+43+45) / 15 = 436/15 = 29.1 (Do you see how I got the numbers I added?)
We could, instead, add all the leaves, then add the sum of each stem digit multiplied by its number of leaves, in order to more directly use the stemplot format: $$(8+8+9+0+3+5+5+7+9+1+3+8+2+3+5)+3(10)+6(20)+3(30)+3(40)=\\76+[30+120+90+120]=76+360=436$$ I haven’t seen this done, though!
We can also observe that the mean is located in the middle of the data, as indicated by the asterisk:
1 889
2 035579*
3 138
4 235
Median
The median is the middle value in the set. This is relatively simple. Start crossing off pairs of high and low leaves. Start with the leftmost leaf on the bottom stem and the rightmost leaf on the top stem. When you only have one (or two) leaves left that have not been crossed out, that value (or the average of the two values) is the median. In your example (I'm using matching symbols to show which two were crossed out as a pair): stem leaf 1 X*# 2 -+=@7@ 3 =+- 4 #*X The one I'm left with is the 7 leaf on the 2 stem, so the median is 27.
That is, using the coloring scheme I used above,
1889203557931384235
In real life we would just mark digits in the order I did here, crossing them off or underlining. And the process is just what we do when the data are all written out: $$18,18,19,20,23,25,25,\mathbf{27},29,31,33,38,42,43,45$$