Extreme analytics: anomalies and outliers

Ask ten people how they define outliers (aka anomalies) and you’ll get ten different answers. It’s not that they are all wrong it’s just that the term outlier can mean different things to different people in different contexts.

“A data point on a graph or in a set of results that is very much bigger or smaller than the next nearest data point.” from the Oxford English Dictionary.

Sometimes we want to detect outliers so we can remove them from our models and graphics. This does not mean we completely disregard the outliers. It means we set them aside for further investigation.

outlier regression example
This simple regression model fits tighter when the outlying data point is removed. Outliers should be investigated and not just removed because they don’t fit the trend.

On other occasions we want to detect the outliers and nothing else, e.g. in fraud detection. Regardless of what analytics project we are engaged in, outliers are very important so we have to come up with some techniques for handling them. The surest way to identify an outlier is with subject matter expertise, e.g. if I am studying children under the age of five and one of them is 6 foot tall, I don’t need statistics to tell me that is an outlier!

So what? Data practitioners don’t always have the luxury of subject matter expertise so we use heuristics instead. I will outline three simple univariate outlier detection methods and why I think the boxplot method outlined by NIST is the most robust method even though it involves a little more work.

The three methods are:

  1. Percentiles, e.g. flag values greater than 99th percentile.
  2. Standard deviations (SD), e.g. flag values more than 2*sd from the mean.
  3. Boxplot outer fence, e.g. flag values greater than the third quartile plus 3 times the interquartile range.

I generated dummy graduate salary data with some select tweaks to see how well each of these methods perform under different data distribution scenarios.

Scenario 1: Normally distributed data

n = 1,000, mean = $50,000, sd = $10,000. Below is a summary of the data including the distribution of the data points and a density curve.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43270   49600   49730   56160   81960
Scenario 1: Normally distributed data

Notice that the SD and percentile methods are too sensitive, i.e. they are flagging values that may be high but are nonetheless clearly part of the main distribution. This is an example of false positive outlier detection. The boxplot outer fence detects no outliers and this is accurate – we know there are no outliers because we generated this data as a normal distribution.

Scenario 2: Skewed data

Now let’s stick an outlier in there. Let’s imagine one graduate in the group struck it lucky and landed a big pay packet of $100k (maybe it’s his uncle’s company or maybe he’s really talented, who knows)!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43300   49630   49800   56180  100000
Scenario 2: Skewed data

Once again we see the SD and percentile methods are too sensitive. The boxplot method works just right, it catches the one outlier we included but not the rest of the normally distributed data.

Scenario 3: Even more skew

A handful of graduates came up with some awesome machine learning algorithm in their dissertation and they have been snapped up by Silicon Valley for close to $500k each!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43330   49650   51960   56310  554200
Scenario 3: Even more skew

Now with a handful more outliers, the SD threshold has moved to the right so much that it surpassed our $100k friend from scenario 2. He is now a false negative for the SD method. The percentile method is still too sensitive but the boxplot method is coming up goldilocks again.

Scenario 4: Percentile threshold on the move

In the first three scenarios the 99th percentile threshold hardly budged. Because simply put: n = 1,000 and so the 99th percentile represents the 10th highest value. Since we have only added 6 outliers, the 10th highest value is still in the main distribution. So let’s add 10 more outliers (for a total of 16) and see what happens to the percentile threshold.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16040   43430   49830   54200   56630  554200
Scenario 4: Percentile threshold on the move

Boom. The 99th percentile threshold has jumped from being too sensitive (too many false positives) to a point where it is not sensitive enough and it is missing some outliers (false negatives). Notice once again how robust the boxplot method is to skewed data.

Closing comments

  • Boxplot method is no silver bullet. There are scenarios where it can miss, e.g. bimodal data can be troublesome no matter which method you choose. But in my experience boxplot outer fence is a more robust method of univariate outlier detection than the other two conventional methods.
  • These methods are only good for catching univariate outliers. Scroll back up to the very first chart in this piece and note that the “outlier” is not really an outlier if we only look at x or y values univariately. Detecting outliers in multidimensional space is trickier and will probably require more advanced analytical techniques.
  • The methods discussed here are useful heuristics for data practitioners but we must remind ourselves that the most powerful outlier detection method is often plain old human subject matter expertise and experience.

The R markdown script used to produce these examples and graphics is available for download from Google Drive here.


Geographic Information Systems

Geographic information system is a clunky term for what the layman simply calls maps. Ok, ok, there is more to it than that with shapefiles and polygons and metadata, etc, etc. But the general gist is a visualization of geographical data and this challenge has been tackled for aeons, or at least for a long time before data-viz became trendy. In 2012, Tableau put together what they consider to be The 5 Most Influential Data Visualizations of All Time and I was not surprised to see John Snow’s Cholera Map of London in the mix as well as Napoleon’s March on Moscow (which is kinda sorta GIS mapsy).

Personally, I cut my teeth in GIS as a young civil engineer when I worked in the Irish sewerage and rain water drainage industry – this Wad River Catchment Flood Study (pdf) includes some elegant geographic visualizations that I helped develop. Being a an engineer during the Irish property bubble I witnessed a lot of housing construction in areas where there was subsequently little or no demand. Depending on who you talk to this was either due to greedy bankers, over-exuberance in the market or a myriad of other explanations that spew from experts’ mouths. Given my proximity to the construction industry I was keenly aware of various tax incentives that were on offer for building houses in certain geographic areas and I suspected these government interventions, although well-intentioned, may have had a negative impact.

In 2012 I analyzed the National Unfinished Housing Developments Database to see if there was a link between government incentives and ghost estates. The results of my geographic analysis indicate that, yes, there is some evidence to suggest that the government exacerbated the housing bubble/bust for the very areas they were trying to help. My analysis was crude but compelling (if I do say so myself!)

Quality, Scope, Time: Pick 2

It’s an old project management adage that appears in many similar forms and the general gist is: in many projects, though we strive to deliver the full scope, to a high quality, and in good time, we often have to choose two and sacrifice one.

For example, if I asked you to cook a chicken breast in one minute, your options are:

  1. Sacrifice quality: I’m sure you could cook it for one minute but the quality would be suspect and I would not recommend eating it.
  2. Sacrifice scope: If I insisted that I needed something in one minute, perhaps you could reduce the scope by cutting off a tiny morsel of chicken and cooking only that.
  3. Sacrifice time: Use as many minutes as you need to cook the full chicken breast to a delicious high quality.

In the above example, I presented the three options in the best case whereby we are taking control up front and choosing which element to sacrifice. Problems are exacerbated when we kid ourselves into thinking we can get everything done and when this happens we often end up in one of the following undesirable scenarios:

  1. Delivering sloppy low quality work on time.
  2. Dropping entire components of the work last minute because we did not have time to deliver everything.
  3. Realizing too late that we are not going to meet a deadline and having to make that awkward call to whoever is awaiting delivery of our work.

There is no silver bullet solution to this conundrum but, in my experience, the best approach is to get ahead of it early and if you sense a project is facing this problem flag it and figure out which of the three elements can be adjusted. Usually we will not want to sacrifice quality so the discussion centers around either reducing the scope to meet a deadline or pushing out the deadline to a more reasonable future date. Don’t rule out sacrificing some quality, there may be options to pare back to a minimally sufficient solution – sometimes the customer just wants a simple tire swing!

A final note: I think the Agile methodology really addresses the nub of this issue because quality is assured by including testing as an inherent part of a sprint. Time is fixed – typically 2 week sprints – and this leave scope to be adjusted and agreed upon at the beginning of each sprint.

Tidy vs EAVil data structures

“Don’t take a fence down until you know why it was put up.” G.K. Chesterton

I once had the joy [sarcasm] of working with an entity-attribute-value (EAV) data model. When I first laid eyes on it I knew something was different but I didn’t know what exactly – I didn’t even know what EAV was at the time. I typically work downstream of database (DB) design, so ETL/SQL experts and DB administrators look after the design of DB and I have simply queried DBs to get the data I want for analysis. So I’ve seen a lot of DBs but I haven’t designed any – however, you don’t have to be an architect to recognize when a structure looks a bit dodgy!

From my basic knowledge of relational DB structure:

  1. Each table should be related to a particular subject or entity, e.g. a product table, a customer table, etc.
  2. Each row represents a record. So a row in in the product table is a product, a row in the customer table is a customer, etc.
  3. Each column is an attribute holding some type of information about the records,  e.g. in the customer table there could be a first_name column, an age column, etc. Also, columns should have reasonably meaningful names.

This comes from Codd’s 3rd normal form and it is the basis for Hadley Wickham’s approach to tidy data (pdf). The EAV model I came across was not structured like this, see the sample snippet in this google spreadsheet to compare how the same data is structured in the EAV versus the tidy format. As mentioned already, I am not a DB architect so maybe this data structure was something awesome that I just had not heard of yet – I’m always willing to learn new things and so I started to fish around for explanations but what I discovered only confirmed my initial fears:

  • This Microsoft Database Design Basics article specifically mentions that “When you see columns numbered this way [repeating groups like numerator 1, numerator 2, numerator 3, etc], you should revisit your design.”
  • My exact concern came up once on StackExchange and the most popular answer gives a very strong argument against EAV.
  • We needed to create scatter plots on dashboards for this client and to create scatter plots with tidy data I could simply plot one column against another but with EAV data we’d need to write quite a few lines of code to reshape the data before doing a simple scatter plot.
  • Tidy data is more amenable to the analytical base table structure that is required for model building.
  • EAV is placing different data types in the same column. Often our data are strictly counts or decimals and it is better practice to store them as integer or float data types as appropriate.
  • EAV is not easily human readable and generally needs to be pivoted or reshaped to do anything useful. A conventional relational model is more human readable because it is easy to look across a row and see attributes for a particular entity.

Unfortunately for me, a great deal of reporting and tools had been built on the existing database so even when I eventually persuaded the client to revert to a more conventional structure there was a real risk of breaking other things.

My takeaway from this experience is that a good default first guess in any arena is the conventional solution but we should always be open to hearing about new and better ways of doing things. Conversely, if you are proposing a new approach, be sure that you thoroughly know the conventional solution first. Which is my long winded way of paraphrasing that beautiful Chesterton quote at the top of this post – see here for more on that quote.

P.S. If you ever find yourself in my predicament and you need more ammo to talk someone out of EAV here are some more resources to check out:

  • Number 3 in this list of Five Simple Database Design Errors You should Avoid gives an example of how queries can become very complicated in EAV models.
  • This post on Ten Common Database Design Mistakes again mentions the issue of repeating column names appended with numbers, i.e. num1, num2, num3, etc, and the problems it can lead to.
  • This article outlines some of the attractions of EAV, namely it’s simple data model and the ease with which new attributes can be added but ultimately he considers it dangerous and recommends extreme caution before implementing EAV.
  • Here is a rather long but humorous story of how an EAV structure brought down an entire IT department.
  • I followed Karl Popper’s scientific method and kept looking for something that would prove me wrong and I eventually found one example of a large e-commerce company called Magento using EAV and it seems the attraction for them is the scalability. But it will take more than this one example for me to embrace EAV!

Scaled Agile Framework

In December 2015 I earned my Scaled Agile Framework Practitioner certificate. Suffice to say I am a big fan of the Agile Manifesto. The SAFe homepage is a rich source of information and if you google “scrum” or “agile” you will find a host of awesome resources.

But right here I just want to share my takeaways from the training and specifically how it can relate to delivering data analytics projects. Think of it as Anto’s unofficial guide to Agile in no particular order:

  1. Always remember what the fundamental deliverable is: value. We can lose sight of that when we’re down in the weeds of a project.
  2. Write your tests before you code, not after.
  3. Agile is robust and designed to handle the world as it is, not as we would like it to be. However, C-level people often want certainty, i.e. what will be completed when, and this is difficult to deliver in an uncertain world. Do not underestimate the challenge of selling Agile methodology to clients.
  4. Having a constant predictable work velocity is better than having alternate periods of high and low workload.
  5. User stories are not pushed on people, they are pulled down by members of the team. This is a bottom up approach to management, people are empowered to take on work items.
  6. Build good people and they will build good products.
  7. Five Whys: Asking why five times consecutively will often get to the root of a problem.
  8. Clearly define performance metrics upfront. Do not assume that everyone agrees on a definition, make sure stakeholders sign off on it.
  9. Data, like food, has a shelf life. It is better to get interim “good enough” information to stakeholders in a timely manner rather than spend a long time working on the “best” product.
  10. Multitasking is a drag, see here, here and here. Work one story at a time, one sprint at a time.
  11. Quality, scope, time: pick 2 out of the 3. We should always want to maintain high quality. Time can often be rigid due to external stakeholder needs. Scope is the one we sometimes need to adjust.
  12. Timeboxing at every level is important whether it’s planning meetings, scrum meetings or sprints themselves.
  13. There should be one product owner. I know from experience, too many chiefs can really suck.
  14. All activities should have business value and you should be able to explain that value to stakeholders. Even experimental efforts that fail can add value.
  15. Structure user stories like so: As a <…> I want to <…> so that <…>.
  16. User story is not complete without acceptance criteria.
  17. No changing priorities during 2 week sprint.
  18. Leave a little slack in your plan to allow for inevitable unknowns.

Anomaly detection with Benford’s Law

Benford’s Law: the principle that in any large, randomly produced set of natural numbers, such as tables of logarithms or corporate sales statistics, around 30 percent will begin with the digit 1, 18 percent with 2, and so on, with the smallest percentage beginning with 9. The law is applied in analyzing the validity of statistics and financial records. (from Google)

Benford’s Law can be used to detect anomalous results. Detecting anomalies is important in tax, insurance and scientific fraud detection.

In this example we are analyzing daily sales data from 11 Widgets Inc. stores. The board suspects some stores might be fabricating sales numbers. The only way to prove it is via a full audit of a store’s accounts. Auditing is expensive and so the board asked me to identify one store to audit.

We have 365 days of sales data for each of 11 stores – the data can be viewed here. In R I used the substring function to get the first digit from every day of sales for every store – full R script available here. Not every first digit distribution follows Benford’s Law but many do. It was clear from the plots that Benford’s Law was applicable on this occasion.

Benford G
The distribution of first digits from daily sales in Store G (orange bars) follows Benford’s Law (blue dots) quite well

By plotting the observed distribution of first digits for daily sales from each store against the expected distribution we can identify which store veers furthest from the expected distribution. Visually, we can see that the sales numbers from Store K might be worth auditing.

Benford K
The distribution of first digits from daily sales in Store K (orange bars) do not follow Benford’s Law quite so well

To augment this analysis and provide a more objective output a chi-squared test was carried out for each store. The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies. A lower chi-squared test p-value indicates a more anomalous distribution of first digits. Store K was audited and eventually they admitted to randomly generating sales numbers for many days during the year.

Note that disagreement with Benford’s Law does not prove anything. It merely offers a hint. A full audit of Store K was still necessary to prove any wrongdoing.

In this example we saw how data analytics identified which store was most likely to be the culprit thus avoiding audits of multiple stores. That’s what good analytics is about – the art and science of better.

Big data, big deal

An old blogpost of mine from 2014 when I worked at an energy management company. Below is my original (shorter and better) version. Big data guff that the marketing team added in has been crossed out!

When I first told friends and family that I was going back to school to study analytics, everyone kept asking, “Analytics? What’s that?” Fast forward a few years and data analytics is everywhere – often prefixed with the word BIG! So what drew me to the sexiest job of the 21stcentury? The exponential rise in digital data means there is a corresponding demand for people, organizations and tools that can harness that data and deliver insight. GridPoint is such an organization and GridPoint Energy Manager is that tool.
That word, insight, is a critical and sometimes neglected piece of the big data puzzle. Collecting data is relatively straightforward, but the results are complex. In fact, WIRED reports that researchers at the University of California at Berkeley discovered that five quintillion bytes of data are produced every two days! Here at GridPoint, we have been collecting energy data since 2003, and to date we have accumulated 75 billion data points which are being added to at a rate of 100 million data points per day. The data collection and storage problem has been cracked – so now what? If you are simply collecting data and storing it then your business plan is no better than the South Park Gnomes’ Three Phase Business Plan. Just like drilling for and refining commodities like crude oil, we only derive value from energy data through sophisticated data mining techniques using a range of analytical tools and statistical savvy. Our end product is not just mountains of data – our end product is insight, insight into how your buildings consume energy across your enterprise.
With data-driven insight, the decision makers in your organization can make better informed and quicker decisions. It is this insight, and not the data per se, that delivers reduced energy consumption, increased building comfort and improved operational efficiency. By using real data you can have confidence that you are optimizing the limited resources at your disposal. At GridPoint, we combine cutting-edge advanced analytics with decades of subject matter expertise to deliver real lasting value to our customers.