From time to time we get questions about finding the median in statistics. Some are entirely routine; but the three I want to discuss today take us gradually deeper into a morass of ambiguity. A recurring theme of my experience in Ask Dr. Math has been that definitions are not always what they seem – they may be purely arbitrary, they may vary from country to country or application to application, they may be taught incorrectly; yet they are the foundation of all of math. This is a perfect example.
What two parts are equal?
The first of these is, on the surface, very ordinary. Jerry asked,
A Closer Look at the Definition of Median The definition of median is the middle number and it divides the group into two equal parts. What if you have an odd quantity? If you have an odd quantity, lets say $100, $200 $300, $400, $500. The middle number is 300. How does this divide the group of numbers into two equal groups?
Another Math Doctor, probably thinking the question was (as it usually is) about what to do with an even number of items, where there is no actual “middle number”, just referred Jerry to an archived answer, Mean, Median, Mode, Range, which states the usual definition.
I saw that this was not that usual question, but a very thoughtful one going in the opposite direction:
The definition you quote is an informal one often used in elementary presentations. It's reasonably understandable when stated clearly, but glosses over some difficulties. Let's start with that basic definition, which is generally something like this, including both a motivating concept and specific calculations: The median of a set of numbers is the number in the middle when the data is sorted, dividing the values into two equal parts. If there are an odd number 2n+1 of values, the median is the (n+1)st value, so that there are n values below it and n above. If there are an even number 2n of values, the median is the average (arithmetic mean) of the two middle numbers, namely the nth and the (n+1)st, so that there are n values below it and n above. I have included here a little more explanation than is usually given initially, showing in what sense we say that the median divides the data into two equal parts.
I went on to illustrate this: with an odd number of values, the median divides the rest of them into two equal groups; it is not included in either part. With an even number of values, any number between the two middle ones would divide all of them into two equal groups. This, I think, answers Jerry’s question.
But what about duplicate numbers?
But I saw that if he thinks just as carefully about what the definition means in specific examples, he will run into more trouble, so I dug deeper:
Another difficulty that you didn't mention arises when some of the values are equal. What about this: 1 1 2 2 2 3 3 3 3 ^ The median is 2; but there are two values less than 2 and four greater than 2. In what way does this divide the set equally?? We have to pretend that the three 2's are different in order to make sense of the statement, so that it is the third 2 that divides the set evenly. But I've never heard anyone say this. If we want a really precise definition that covers all cases and does not depend on your accepting different meanings for terms like "two equal parts" in different cases, it will be something like this: The median of a set of data is a number such that NO MORE THAN half are LESS than the median, and no more than half are greater than the median.
This (or something like it) is the actual definition used by mathematicians who want to define exactly what we really mean, not just wave our hands at the idea and pretend it’s clear.
Applying this to the two cases: With an even number 2n of values, any number less than the (n+1)st will have no more than half (n) less than it, and any number greater than the nth will have no more than half greater. Note that if several values are equal to either of these two middle values, then there will definitely be fewer than half on either side, but it still fits the definition. Again, by convention we choose the mean of the two middle values, but this is not really required by this definition. With an odd number 2n+1 of values, there are no more than n values less than the (n+1)st value, and n is less than (2n+1)/2. In my bad example, 1 1 2 2 2 3 3 3 3 ^ we see that the number of smaller values (2) and the number of larger values (4) are both less than half of the total (4.5), so it fits the definition. Any number larger than 2, say 2.1, would not, because then there would be 5 values less than the median. Similarly, nothing less than 2 would work.
Note that part of our procedure is actually a mere convention: we could take any number in between, but choose to use the average for consistency. This goes beyond the actual definition, yet becomes part of our practical “definition”.
As I go on to explain, this precise definition ends up yielding exactly the usual rule for finding the median, which is much easier than thinking through the definition. We seldom need to use the precise definition in actual practice; and it would be confusing if we started there in teaching the concept; but if we start with the basic idea (“the number in the middle”) and refine it into the careful definition, then we know that the practical rule we end up with really stands on a firm foundation.
A year later, Henrik asked us the question Jerry hadn’t asked:
Finding the Median with Ties Find the median for the set {2,3,5,5,5,10}. I understand how to calculate the median when there are odd or even number of elements in a set. However, I am confused about situations when there are ties. For the set given, if I use the traditional method, it would be 5. But 5 would not be a correct median since only one value (10) is above 5, and two values are below 5 (2 and 3). It is therefore not a true central tendency. Is there an alternative way to calculate the correct median in these instances? Thanks.
In answer, I gave a reference to the previous question, and stated the definition I gave there, then showed how it answered Henrik’s question:
Note two things: First, it is _A_ value fitting the condition; we commonly take it as the average of two middle values, but really any number between them would work! Second, we don't say that EXACTLY half are on each side, but only that AT MOST half are on each side. This deals with your issue. In your example, 2, 3, 5, 5, 5, 10, there are 2 values less than 5 and 1 value greater than 5, which fits the definition: no more than 3 are in either part. If we chose anything greater than 5, more than half the data would be less than our "median", and if we chose anything less than 5, more than half would be greater than that. So the only possible choice is 5. The traditional method is in fact an efficient way to find a median that fits the definition.
I also made a comment on his use of the term “measure of central tendency”, which you can read if you are wondering.
How do we extend this to quartiles?
Another 6 years later, Luke asked a long, carefully thought-out question about quartiles; the same basic issue arises here (after all, the median is the second quartile, and also the 50th percentile, so their definitions are closely related). In this case, things are even worse: not only is the definition often not stated clearly, but different textbooks often give entirely different methods of finding “the quartiles”, which yield different answers for any particular data set.
Quartile Conflict
Luke applies two different rules for finding quartiles, first to a typical “nice” example and then to numbers that don’t work out so well, and asks which method is right.
I responded by pointing out the variation in sources (not only textbooks but software), and explaining my preferred method; then I expanded my answer by pointing out that his question couldn’t really be answered without a definition of quartile – far too often textbooks “define” the term only by the method they teach, and never really define what it means!
The problem here is: what should your quartiles "more accurately" reflect? What is the actual DEFINITION of a quartile that the result of the METHOD must agree with? In what I wrote before, I confused these two different concepts, because texts often present the method as the definition. This question is addressed somewhere in the links I gave, but can more easily be seen in this answer relating to the same problem in the definition of the median: A Closer Look at the Definition of Median http://mathforum.org/library/drmath/view/72726.html Adapting the definition I gave there to the first quartile: The first quartile of a set of data is a number such that NO MORE THAN 1/4 are LESS than the first quartile, and NO MORE THAN 3/4 are GREATER than the first quartile.
Armed with this definition, I showed that the linear interpolation method that he had assumed was the most correct way, actually didn’t satisfy the definition.
This is followed by further discussion of pedagogical issues: what do you do if your textbook and testing authority force you to use a method that is not the best? It’s an interesting issue, but I won’t go into it here. But you should at least read this whole page, which includes useful links.
Pingback: The Many Meanings of “Quartile” – The Math Doctors