Extreme analytics: anomalies and outliers

Ask ten people how they define outliers (aka anomalies) and you’ll get ten different answers. It’s not that they are all wrong it’s just that the term outlier can mean different things to different people in different contexts.

“A data point on a graph or in a set of results that is very much bigger or smaller than the next nearest data point.” from the Oxford English Dictionary.

Sometimes we want to detect outliers so we can remove them from our models and graphics. This does not mean we completely disregard the outliers. It means we set them aside for further investigation.

outlier regression example
This simple regression model fits tighter when the outlying data point is removed. Outliers should be investigated and not just removed because they don’t fit the trend.

On other occasions we want to detect the outliers and nothing else, e.g. in fraud detection. Regardless of what analytics project we are engaged in, outliers are very important so we have to come up with some techniques for handling them. The surest way to identify an outlier is with subject matter expertise, e.g. if I am studying children under the age of five and one of them is 6 foot tall, I don’t need statistics to tell me that is an outlier!

So what? Data practitioners don’t always have the luxury of subject matter expertise so we use heuristics instead. I will outline three simple univariate outlier detection methods and why I think the boxplot method outlined by NIST is the most robust method even though it involves a little more work.

The three methods are:

  1. Percentiles, e.g. flag values greater than 99th percentile.
  2. Standard deviations (SD), e.g. flag values more than 2*sd from the mean.
  3. Boxplot outer fence, e.g. flag values greater than the third quartile plus 3 times the interquartile range.

I generated dummy graduate salary data with some select tweaks to see how well each of these methods perform under different data distribution scenarios.

Scenario 1: Normally distributed data

n = 1,000, mean = $50,000, sd = $10,000. Below is a summary of the data including the distribution of the data points and a density curve.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43270   49600   49730   56160   81960
s1
Scenario 1: Normally distributed data

Notice that the SD and percentile methods are too sensitive, i.e. they are flagging values that may be high but are nonetheless clearly part of the main distribution. This is an example of false positive outlier detection. The boxplot outer fence detects no outliers and this is accurate – we know there are no outliers because we generated this data as a normal distribution.

Scenario 2: Skewed data

Now let’s stick an outlier in there. Let’s imagine one graduate in the group struck it lucky and landed a big pay packet of $100k (maybe it’s his uncle’s company or maybe he’s really talented, who knows)!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43300   49630   49800   56180  100000
s2
Scenario 2: Skewed data

Once again we see the SD and percentile methods are too sensitive. The boxplot method works just right, it catches the one outlier we included but not the rest of the normally distributed data.

Scenario 3: Even more skew

A handful of graduates came up with some awesome machine learning algorithm in their dissertation and they have been snapped up by Silicon Valley for close to $500k each!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43330   49650   51960   56310  554200
s3
Scenario 3: Even more skew

Now with a handful more outliers, the SD threshold has moved to the right so much that it surpassed our $100k friend from scenario 2. He is now a false negative for the SD method. The percentile method is still too sensitive but the boxplot method is coming up goldilocks again.

Scenario 4: Percentile threshold on the move

In the first three scenarios the 99th percentile threshold hardly budged. Because simply put: n = 1,000 and so the 99th percentile represents the 10th highest value. Since we have only added 6 outliers, the 10th highest value is still in the main distribution. So let’s add 10 more outliers (for a total of 16) and see what happens to the percentile threshold.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43430   49830   54200   56630  554200
s4
Scenario 4: Percentile threshold on the move

Boom. The 99th percentile threshold has jumped from being too sensitive (too many false positives) to a point where it is not sensitive enough and it is missing some outliers (false negatives). Notice once again how robust the boxplot method is to skewed data.

Closing comments

  • Boxplot method is no silver bullet. There are scenarios where it can miss, e.g. bimodal data can be troublesome no matter which method you choose. But in my experience boxplot outer fence is a more robust method of univariate outlier detection than the other two conventional methods.
  • These methods are only good for catching univariate outliers. Scroll back up to the very first chart in this piece and note that the “outlier” is not really an outlier if we only look at x or y values univariately. Detecting outliers in multidimensional space is trickier and will probably require more advanced analytical techniques.
  • The methods discussed here are useful heuristics for data practitioners but we must remind ourselves that the most powerful outlier detection method is often plain old human subject matter expertise and experience.

The R markdown script used to produce these examples and graphics is available for download from Google Drive here.

A simple solution is often better

“Everything should be as simple as it can be, but not simpler” apocryphal quote attributed to Einstein.

“Among competing hypotheses, the one with the fewest assumptions should be selected.” Occam’s Razor.

We have all heard variations of the above quotes. I always strive to make my solutions as simple as possible so a non-technical audience can digest the findings. The following example is based (loosely!) on a real life analysis.

Background

Choc Bars Inc (CBI) produces high end chocolate bars for a small but loyal customer base. CBI prides itself on producing top quality chocolate but like any company, it wishes to keep production costs low. The new CFO Mr McMoney believes that there are substantial economies of scale if each machine produces as many bars as possible. The engineers are skeptical that trying to get too much out of the machines could lead to higher maintenance costs. The company founder, Madame Chocolat, does not wholly disagree but she is more concerned that choco bar quality remains high.

I collected data from 100 CBI machines and surveyed customers too. I plotted cost against production to see if McMoney’s economies of scale theory played out in practice. I also layered on the customer survey data so Mme Chocolat could see if customers were satisfied with the choco bar quality.

loess choco
Loess model fitted to the cost, production and satisfaction data

There are clear opportunities for savings if production can be increased in some machines but there are diminishing returns apparent too once production increases beyond about 70ish bars per day. In this case all parties agreed that production could be increased in some machines without affecting product quality too much. Mme Chocolat also noted that machines producing the most bars per day led to lower quality products.

There was a board meeting coming up at the end of the quarter. McMoney wanted to present projected savings but there was some concern that board members are unfamiliar with loess models and this may distract from the findings. A loess model is basically a locally weighted regression model and this gif gives a wonderful visual explanation. To allay concerns, I built a tool with the same data which allowed the user to adjust where they felt the optimum point was and simple linear models were fit either side of the selected optimum.

CBI tool
Interactive analysis tool uses simpler linear regression models either side of an inflection point that is selected by the user. Color indicates choco bar quality.

So what?

The interactive tool gave the decision makers greater control and allowed them to interact with the data. McMoney had a rough rule-of-thumb for savings projections. Mme Chocolat pushed for a lower production target of 68 bars per day for each machine – beyond that point, quality is reduced. Of course this is a dummy example but it is inspired by a real life scenario. For example the data could be teaching costs, student/teacher ratio and student satisfaction – the numbers would change but the principle would be the same.

The R script for the loess model is here and the link for the interactive simple regression shiny R script is here.