(A new question of the week)
A recent question raised a different issue about grouped frequency distributions than we have discussed previously: What do you do when the last class is labelled something like “30 or more”? As we’ll see, there is no one right answer!
An open question
Here is the initial question, which came in last month:
I understand that to use the formula to find mean for grouped data I have to find the class midpoint first. How do I find the midpoint when the information I have only states “above 100”?
We looked at the mean of a grouped distribution last time. There we had distributions like
Class Frequency ----- --------- 37-46 19 47-56 23 57-66 27 67-76 28
This might represent the number of people in an audience of various ages. We found the mean by multiplying the midpoint \(x_i\) of each class by its frequency \(f_i\), adding these up, and dividing by the total number of people:$$\frac{\sum_{i=1}^{n}x_i\cdot f_i}{\sum_{i=1}^{n}f_i}.$$ The midpoint of each class is the average of its upper and lower limits, such as \((37+46)\div 2 = 41.5\) for the first class. (We saw that we can do the same thing with class boundaries, for a continuous distribution.)
What if it looked like this instead?
Class Frequency ----- --------- 37-46 19 47-56 23 57-66 27 67-76 28 67 or more 28
The last class has no upper limit; how can we find its midpoint?
Looking back at old unarchived answers (because there is nothing about this published in the Ask Dr. Math archive), I found a couple questions like this that were never answered, probably because we were too busy even to answer the questions we could answer confidently! Here is one from 2008:
I'm trying to find the mean and median for this frequency distribution: Minutes of delay Shop A Shop B ---------------- ------ ------ Less than 10 20 15 10-15 25 20 15-20 30 30 20-25 25 15 25-30 20 10 30 or more 10 5 How do I calculate the mean if the last interval is open? How do I calculate the midpoint of the last class? Is there another way to do it? I know that I have to find the midpoint for each interval, multiply it by the frequency of each class, sum it up for all classes and divide it by the total number of observations. Mean = (Sum mp x f)/n But I need the midpoint of the last interval!!!
(Note that this distribution is continuous, and has to be interpreted so that “10-15” means “at least 10, and less than 15”, so that 15 is not included in two classes.) The last interval goes, in principle, from 30 to infinity, so its midpoint would be infinite.
A similar question from 2006 included a hint to the intended answer:
The Department of Commerce, Bureau of the Census, reported the following information on the number of wage earners in more than 56 million American homes. Number of Earners Number (in thousands) 0 7,083 1 18,621 2 22,414 3 5,533 4 or more 2,797 a. What is the median number of wage earners per home? b. What is the modal number of wage earners per home? c. Explain why you cannot compute the mean number of wage earners per home. Hint - the data above is an example of grouped data.
This is not really grouped, as each row pertains to a single value – except for the last, which is a group representing all higher numbers! Apparently the author of this problem says that we can’t find the mean, because of the open-ended class. (Note that we can find the median and the mode, because neither is affected by the outliers in that last class!)
First answer: Give up
I answered, as I often do when there is no definitive answer in my experience, by searching for suitable sources to get a sense of what sort of answers knowledgeable people give. The sources I found are not necessarily authoritative, but are meant to provide a survey of what might be said about the subject. I started my reply with the technically correct answer:
The quick answer is, you really can’t find the mean of an open-ended distribution. Without having both limits for every class, you just don’t have the information you need. That is explicitly stated here:
https://people.richland.edu/james/lecture/m170/ch03-ave.html
The Mean is used in computing other statistics (such as the variance) and does not exist for open ended grouped frequency distributions.
With no upper limit for a class, we can’t find a midpoint, and therefore can’t use the formula I gave above. It simply doesn’t apply. (I myself would not say that the mean doesn’t exist, but just that we don’t have enough information to find it. If we knew the original data, we could. The author here is referring to the mean of the distribution as presented.)
I didn’t stop there, though:
But there are several ways to deal with this situation, depending on how much you care about accuracy. Since any statistics based on grouped data are only approximations anyway, some guesses can make sense.
As I mentioned last time, such statistics are just estimates based on inadequate data and hopeful assumptions (such as that the data within a class is distributed in such a way that its midpoint is its mean). The formula itself is therefore not perfect, and there is no reason not to try adjusting the data in order to try for the best estimate we can get with even less adequate data!
Second answer: Pretend that each class has the same width
One possibility is just to make the simplest possible assumption:
Looking around for examples, I found several sources that recommend just assuming an upper limit such that the class width of the open-ended class is the same as its nearest neighbor. This is perhaps the easiest solution; I suspect these rules may be given primarily for students, so that they can always find some answer, even if it is not ideal. An example of this is:
3. Open End Intervals:
These are those intervals or classes, which either the lower limit of first interval or the upper limit of last interval or both of these, are not given. Here only an assumption about the length of these intervals is made according to the length of the interval nearest to these intervals.
Let us suppose the given class intervals are: Less than 10, 10-20, 20-30, 30-40, 40-50, more than 50; Then the desired class intervals i.e. 1st and last are 0-10 and 50-60 respectively; as the length of intervals nearest to these two is also 10 i.e. in intervals 10-20 and 40-50. But if class intervals are not equal, then first interval should be taken equal to second and last equal to penultimate one.
But that is not really a very good guess in many real situations, because the presence of the open-ended class is likely due to the fact that there are extreme values — and as you probably know, extreme values have a significant effect on the mean.
As I hinted, I felt that the sources I found that made this recommendation typically were meant not for serious statisticians, but for classes whose students just expect a simple rule for every case. I don’t know the credentials of this source, but it probably represents what is taught in some curriculum at that level. The fact that no justification is given makes the whole idea suspect.
In the example given here, it makes good sense to interpret the first class, “less than 10”, as “0 to 10” (not really because we take the same width, but merely because the numbers presumably can’t be negative). But assuming the last class ends at 60 ignores the fact that if no data values were greater than 60, they would have used 60! On the other hand, probably there are not many greater than 60, so maybe the assumption is good enough.
But, as I said in my response, this reasoning ignores the fact that the mean is strongly affected by outliers – or we might do this in order to deliberately ignore outliers. If that “more than 50” included a value of 100, the mean would be far larger than if we just use a midpoint of 55.
Third answer: Try a subjective, but informed, guess
What if you really want an answer, and you want it to be the best you can get with the limited information at hand?
I found a nice, long discussion of better guesses (but still guesses) here:
http://uregina.ca/~gingrich/ch51.pdf (pp 37-42)
Open Ended Intervals. As noted in Chapter 4, data is often presented so that it has open ended intervals. If the mean is to be determined for such a distribution, some value has to be entered for X for the open ended interval when using the formula for the mean. Exactly what this value should be is not readily apparent from the table of the distribution. …
About all that can be done in the case of open ended intervals is to pick a value of X which seems reasonable based on what is known about the distribution of the data. Do not pick a value too high, or too low, but pick a value which you think approximately represents the mean value of the variable for the set of cases in the open ended interval.
Note that this is a serious attempt at accuracy, with each choice justified, and with explicit mention of the fact that the conclusions are approximate. For example, the author deliberately chooses as a “midpoint” for the last class in the first example that is larger than if that class were the same size as others, explaining why; and concludes by saying, “The mean is thus 12.195 thousand dollars, or $12,195. Given the approximations that have been made in this calculation, it might be best to round the mean to the nearest $100 and report it as $12,200, for perhaps round it to the nearest thousand dollars and report it as $12,000.”
Note that if this is done in a classroom exercise, each student might make a different choice – that is what I mean by “subjective”. Each such choice might be equally reasonable; some might be based on better background knowledge than others. Teachers (or students) who are not comfortable with this, or who think every math problem has to have a single correct answer, may choose to stick with the second method, but they are not being honest about the validity of their method.
Addendum:
After writing this, I ran across an article that gives a perspective similar to mine:
What if I have open-ended classes?
For open classes (i.e. classes that don’t have an upper limit or a lower limit), in most cases you can assume those have the same width as the other classes when doing your calculations. In an elementary statistics class, it’s highly unlikely your instructor will throw you a curve ball by creating an unusually wide open-ended class.
However, if you’re working with real-world data—perhaps from a graduate study or work-related study— you may need to use your best judgment when it comes to the midpoint for open classes. If the open class is extremely large, or extremely small, your best guess might be better than a calculated midpoint.
Fourth answer: Make a careful model of the data
If you are doing a really serious study, and you can’t get more detailed data, then you need to approach the matter scientifically, using a model based on what you know about the subject you are studying:
But when you really want a good number, you would want to model the overall data set in such a way that you can estimate the distribution of values in the tail. One place I found this discussed (just as an example) is
https://arxiv.org/ftp/arxiv/papers/1210/1210.0200.pdf
To answer these questions, we have to estimate the mean and variance from the bins. How can we do that? A simple approach is to assume that each family’s income is at the midpoint of its bin. For example, we might assume that all households with incomes in the bin [$0,$10,000) have an income of exactly $5,000. This assumption is unrealistic, but it can be serviceable if the bins are narrow. If the bins are wide, then the midpoint approximation may be less accurate, since within some bins the distribution of households may be highly variable and may not be centered around the bin midpoint. The midpoint approximation also runs into practical difficulties if the data are “top-coded” so that the highest bin is unbounded or censored on the right—as in the Rancho Santa Fe school district, where nearly half the households are in the top bin [$200,000, +∞). Analysts commonly handle top-coding by assuming that the incomes within the top bin fit some distribution (e.g., Pareto). But such assumptions can be inaccurate and are hard to test (Hout 2004).
A more sophisticated approach is to fit a flexible distribution not just to the top bin, but to the entire distribution.
Note that if you want this level of accuracy, you shouldn’t use the midpoint even for closed classes. That is far beyond the context of your question! But this is what you would do if it really mattered.
A simpler example of modeling might be to recognize that within any one class (bin), the data is likely more dense on the side toward the mode, and use the overall shape of the distribution to estimate the slope of the underlying curve, and use that to choose a better number than the midpoint to represent that class. (This is just the musing of a non-statistician; I am not aware of a technique that actually does this.) For an open-ended class, however, this would be extrapolation rather than interpolation, and therefore more risky. The suggestion here is to use knowledge of the nature of the data to choose an appropriate distribution (the choice being justified by the characteristics of the histogram), and then use that to estimate the parameters. This is well beyond my knowledge.
Final answer: Ask your instructor
That’s probably a longer answer than you want. Ultimately, if you are in a class, you need to ask your instructor what to do in these cases; if you were doing serious statistical analysis, you might need professional advice.
There are more places than you might think in math, where there is no general consensus, and you just have to find out what conventions are used in your class (or in your field).
Our reader found the answer helpful, even though it was open-ended, as we might say …
nice try though….
I like this article as its benefit for me in BS level
I am a teacher and my previous head of department gave another answer. We were teaching IGCSE and IB Maths, and her suggestion was to make the open group twice as big as the previous one. I guess that assumes that the data tails off on the upper end.
Having myself also worked as an examiner (for A-level Maths – Stats module), it is unlikely that in an exam you would be penalised for any sensible decision you make to get to an answer. If you write down your assumption somewhere in your solution, so that the examiner can read it, you should get credit for your calculations.
As for defining what a sensible decision might be, that is harder to be certain about, and depends on the distribution of the data.
This is certainly an open question, isn’t it? As you suggest, there may be many ways to guess at a reasonable approach, depending on what you see in a particular distribution.
I heartily agree with stating assumptions in your work, not only in order to get full credit for answering the problem as you understand it, but simply as a way to distinguish absolute truth from speculation, which is essential to mathematics. This is also one reason why I strongly dislike exams on which you can do nothing but state your answer.
Thanks for your contribution!
Pingback: A Test Dilemma: Do As You’re Told, or Do What’s Right? – The Math Doctors
Hi all,
Could you use a normal distribution on the data in the other classes and model its skewness and kurtosis, then from this give an estimate for the upper limit?
I have run across a similar problem calculating the average wind speed in an area over the year, where the data is presented as the number of days that had wind speeds in certain classes. To further complicate this problem, there were varying class widths!
Not sure how to adjust and model a distribution curve from partial data though. Sorry I cant help much further, hopefully someone else can?!
Yes, this would fall under my fourth answer, and depends very much on the nature of the data you are given (which is not necessarily normal). Unfortunately, such modeling is beyond my level of expertise; so I would revert to my earlier suggestions unless precision was worth hiring a professional statistician!