Summer is rapidly approaching and so too are the advertisements for gyms, weight loss products, and home delivery meal systems. Looking at search engine trends for the past few spring seasons, there is a marked spike for health and fitness related products which is expected based on the demand created by all of those pesky swimsuit advertisements. But what about people who search those terms earlier in the year? Are they the same people who search in spring or do they represent a different group of fitness enthusiasts? The availability of large datasets gives researchers an incredible advantage in identifying sub-populations that may have once been overlooked.
With the increasing availability of massive datasets like search engine data, finding sub-populations that don’t fit within the normal data distribution can be simultaneously vexing and exhilarating. The researcher needs to decide whether the outlying group of data represents an actual sub-group of the sample which rightfully deserves exploration or whether instead, it represents some sort of systematic error in the sample which deviates from the “normal” group.
Normal is defined as “conforming to a type, standard, or regular pattern.” But who or what defines the type, standard, or pattern? Although seemingly objective, normality is a rather subjective construct.
In a “normal distribution,” 95% of data points fall within two standard deviations of the mean and are considered “typical responses.” What happens to the data outside that 95% cut-off? That data gets slapped with an outlier or “abnormal” label and marked for further inspection. The very name “outlier” highlights that these data points lie outside of the expected or normal range of values.
Outliers in smaller market research datasets can exist for a multitude of reasons, including distracted respondents, lenient quality control checks, or incorrect transcription. In these cases, removing outliers from the data set is critical so as to reduce any undue influence on later analyses and models. However, what should a researcher do when an outlier isn’t an error and represents an individual’s true response? If it’s just one data point, then current practice often dictates removing or norming the outlier to fall in place with the rest of the sample. But what happens when it isn’t just one data point? What may have been excluded in a smaller data set can now be explored more fully in these larger datasets.
Prior to the availability of large datasets, it was more difficult to locate and identify outlying sub-groups within a population as the likelihood of collecting multiple responses from that sub-group was low. Now with datasets that capture responses from tens of thousands of people (and often more), these unique sub-groups that would have initially been written off as aberrant or removed from larger analyses can be assessed uniquely and given their own marker of “normal.” Identifying these unique sub-groups can be particularly critical for new markets, products, or brands that are trying to gain market share and expand into new product or customer territory.
Although normality is still highly subjective, with the availability of more, larger, and more inclusive datasets, understanding where multiple pockets of “normal” responses lie within a population can allow researchers to better understand customer segments and tailor messages and products to meet unique needs. It’s up to researchers and analysts to be open to listening to the story told by the outlying data points and bring that story to light.