Ask ten people how they define outliers (aka anomalies) and you’ll get ten different answers. It’s not that they are all wrong it’s just that the term outlier can mean different things to different people in different contexts.
“A data point on a graph or in a set of results that is very much bigger or smaller than the next nearest data point.” from the Oxford English Dictionary.
Sometimes we want to detect outliers so we can remove them from our models and graphics. This does not mean we completely disregard the outliers. It means we set them aside for further investigation.
On other occasions we want to detect the outliers and nothing else, e.g. in fraud detection. Regardless of what analytics project we are engaged in, outliers are very important so we have to come up with some techniques for handling them. The surest way to identify an outlier is with subject matter expertise, e.g. if I am studying children under the age of five and one of them is 6 foot tall, I don’t need statistics to tell me that is an outlier!
So what? Data practitioners don’t always have the luxury of subject matter expertise so we use heuristics instead. I will outline three simple univariate outlier detection methods and why I think the boxplot method outlined by NIST is the most robust method even though it involves a little more work.
The three methods are:
- Percentiles, e.g. flag values greater than 99th percentile.
- Standard deviations (SD), e.g. flag values more than 2*sd from the mean.
- Boxplot outer fence, e.g. flag values greater than the third quartile plus 3 times the interquartile range.
I generated dummy graduate salary data with some select tweaks to see how well each of these methods perform under different data distribution scenarios.
Scenario 1: Normally distributed data
n = 1,000, mean = $50,000, sd = $10,000. Below is a summary of the data including the distribution of the data points and a density curve.
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 16040 43270 49600 49730 56160 81960
Notice that the SD and percentile methods are too sensitive, i.e. they are flagging values that may be high but are nonetheless clearly part of the main distribution. This is an example of false positive outlier detection. The boxplot outer fence detects no outliers and this is accurate – we know there are no outliers because we generated this data as a normal distribution.
Scenario 2: Skewed data
Now let’s stick an outlier in there. Let’s imagine one graduate in the group struck it lucky and landed a big pay packet of $100k (maybe it’s his uncle’s company or maybe he’s really talented, who knows)!
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 16040 43300 49630 49800 56180 100000
Once again we see the SD and percentile methods are too sensitive. The boxplot method works just right, it catches the one outlier we included but not the rest of the normally distributed data.
Scenario 3: Even more skew
A handful of graduates came up with some awesome machine learning algorithm in their dissertation and they have been snapped up by Silicon Valley for close to $500k each!
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 16040 43330 49650 51960 56310 554200
Now with a handful more outliers, the SD threshold has moved to the right so much that it surpassed our $100k friend from scenario 2. He is now a false negative for the SD method. The percentile method is still too sensitive but the boxplot method is coming up goldilocks again.
Scenario 4: Percentile threshold on the move
In the first three scenarios the 99th percentile threshold hardly budged. Because simply put: n = 1,000 and so the 99th percentile represents the 10th highest value. Since we have only added 6 outliers, the 10th highest value is still in the main distribution. So let’s add 10 more outliers (for a total of 16) and see what happens to the percentile threshold.
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 16040 43430 49830 54200 56630 554200
Boom. The 99th percentile threshold has jumped from being too sensitive (too many false positives) to a point where it is not sensitive enough and it is missing some outliers (false negatives). Notice once again how robust the boxplot method is to skewed data.
- Boxplot method is no silver bullet. There are scenarios where it can miss, e.g. bimodal data can be troublesome no matter which method you choose. But in my experience boxplot outer fence is a more robust method of univariate outlier detection than the other two conventional methods.
- These methods are only good for catching univariate outliers. Scroll back up to the very first chart in this piece and note that the “outlier” is not really an outlier if we only look at x or y values univariately. Detecting outliers in multidimensional space is trickier and will probably require more advanced analytical techniques.
- The methods discussed here are useful heuristics for data practitioners but we must remind ourselves that the most powerful outlier detection method is often plain old human subject matter expertise and experience.
The R markdown script used to produce these examples and graphics is available for download from Google Drive here.