We all know that the “average” person or research participant is just a concept – that they don’t actually exist – yet in market research we still regularly base our feedback and forecasts on means, medians, and modes.
Occasionally, quoting the average is the best choice, but in my experience they are mostly unrepresentative and misguiding. Yet business decision-makers are seemingly addicted to them.
Why do we use averages?
- An average is simple to understand, and to communicate
- An average, especially the mean, is part of the status-quo – you won’t get blamed for using it
- The mean has a mathematical basis that gives it a top-coat of credibility
- All averages fail to properly account for irksome outliers
- Working with better alternatives is more complicated (i.e. more work)
The irony is that the thing we lose the most when using averages is the thing we most want our research to deliver: realism.
Missing the big picture: Marketers in healthcare talk a lot about "Real World Evidence" like this - adapted from patient diary findings. Unfortunately, in this case, reporting only the mean dose (i.e the mathematical average) delivers a completely unreal result, and totally misses the point. Neither would the other two commonly used averages, the median (value in the middle when all are arranged in order) or the mode (most common value/s) help us here. To get the message - the polarisation in dosing - we must show the distribution. | |
---|---|
Overlooking outliers: I hear senior business people summarily dismiss outliers. They are also sometimes edited out of data-sets in advance, so as to avoid that kind of incredulous reaction. But they do exist. For example, doctors working at specialist centres often see very large numbers of a particular type of patient. In some therapy areas there may only be one or two such people, like the n=2 shown alongside (1001+ patients each) - so a marketing team working the middle ground might miss them. Outliers also distort the mean horribly, dragging it higher. Outliers should - once verified - be the centre of our analytical attention, not ignored. They are, by nature, hugely important. | |
Misleading: The #1 KPI (key performance indicator) is % market share. Whilst easy to consume, it embodies an inherently reductionist approach and inevitably over-simplifies the market situation. In this example, we also need to know about breadth of use, and care about the 18% of the sample who don't use Brand B at all, as well as the 13% who use it exclusively. Feeding only the mean 33% market share into a forecast model, or a board room, would be a mistake. |
Is the average ever useful?
Well, the mean is more accurate when our data is normally distributed (i.e. in the classic ‘bell’ curve shape). And it can be useful for sorting out multiple scale scores. For example in a brand performance exercise, average scale scores allows us to quickly sort the strong from the weak. But otherwise, not really, in my view.
What should I be using, if not the average?
The most obvious thing is to show the distribution, as in the visuals above. At least we can then see the whole picture. We process visual information so quickly that doing so won’t add to our mental burden.
Segmentation analysis is under-used. Cluster and factor analysis can be run on virtually any data-set to see if respondents can be sorted into discrete groups, based on their similarities and differences. This takes us closer to what really matters, the different behaviours of respondents.
What if I have to show the average?
Know which is most appropriate for your data – the mean, the median, or the mode. If using the mean accompany it with the standard error or standard deviation which summarises how the data is distributed, and tells us how representative the mean is likely to be (i.e. when data is well dispersed then the mean is essentially meaningless). The median and mode are often better guides to ‘typical’ behaviour than the mean.
John Aitchison, Managing Director First Line Research
john@firstlineresearch.com // +44(0)1904 799550 // @first_line_res