Disadvantages of pie charts:

– pie slices ranking is difficult to see even if attempted

– rely on color which is an extra level of mental effort for the viewer to process

– colors are useless if printed in black and white

– difficult to visually assess differences between slices

– 3-D is worse because it literally makes the nearest slice look bigger than it should be

Advantages of bar charts:

– easier to read

– no need for a legend

– we can rank the bars easily and clearly

– easy to visually assess difference

If you google “pie charts” you’ll find a bunch of people ranting far worse than me. Here is a good collation of some of the best arguments.

All that being said, data visualization is a matter of taste and personal preference does come into it. At the end of the day it’s about how best we can communicate our message. I wouldn’t dare say we should never use pie charts but personally I tend to avoid them.

]]>

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

If you haven’t seen the problem before, have a guess now before reading on – what would you do, stick or switch? My instinctive first intuition was that it does not matter if I stick or switch. Two doors unopened, one car, that’s a 50:50 chance right there. Was I right?

Let’s tease it out using Bayes’ Theorem:

P(A|B) = P(B|A) * P(A) / P(B)

That’s the generic form of Bayes’ Theorem. For our specific Monty Hall problem let’s define the discrete events that are in play:

P(A) = P(B) = P(C) = 1/3 = the unconditional probability that the car is behind a particular door.

Note I am using upper case notation for our choice of door and as you see below I will use lower case to denote the door that Monty chooses to open.

P(a) = P(b) = P(c) = 1/2 = the unconditional probability that Monty will open a particular door. Monty will only have a choice of 2 doors because he is obviously not going to open the door you have selected.

So let’s say we choose door A initially. Remember we do not know what is behind any of the doors – but Monty knows. Monty will now open door b or c. Let’s say he opens door b. We now have to decide if we want to stick with door A or switch our choice to door C. Let’s use Bayes’ Theorem to work out the probability that the car is behind door A.

P(A|b) is the probability that the car is behind door A

givenMonty opens door b – this is what we want to compute, i.e. the probability of winning if we stick with door AP(b|A) is the probability Monty will open door b

giventhe car is behind door A. This probability is 1/2. Think about it, if Monty knows the car is behind door A, and we have selected door A, then he can choose to open door b or door c with equal probability of 1/2P(A), the unconditional probability that the car is behind door A, is equal to 1/3

P(b), the unconditional probability that Monty opens door b, is equal to 1/2

Now we can write out the full equation:

P(A|b) = P(b|A) * P(A) / P(b) = (1/2) * (1/3) / (1/2) = 1/3

Hmmm, my intuition said 50:50 but the math says I only have a 1/3 chance of winning if I stick with door A. But that means I have a 2/3 chance of winning if I switch to door C. Let’s work it out and see.

P(C|b) is the probability that the car is behind door C

givenMonty opens door b – this is what we want to compute, i.e. the probability of winning if we switch to door CP(b|C) is the probability Monty will open door b

giventhe car is behind door C. This probability is 1. Think about it, if Monty knows the car is behind door C, and we have selected door A, then he has no choice but to open door bP(C), the unconditional probability that the car is behind door C, is equal to 1/3

P(b), the unconditional probability that Monty opens door b, is equal to 1/2

Now we can write out the full equation:

P(C|b) = P(b|C) * P(C) / P(b) = 1 * (1/3) / (1/2) = 2/3

There it is, we have a 2/3 chance of winning if we switch to door C and only a 1/3 chance if we stick with door A.

Bayes’ Rule is itself not the most intuitive formula so maybe we are still not satisfied with the answer. We can simulate the problem in R – grab my R code here to reproduce this graphic – by simulate I mean replay the game randomly many times and compare the sticking strategy with the switching strategy. Look at the results in the animation below and notice how as the number of iterations increase the probability of success converges on 1/3 if we stick with first choice every time and it converges on 2/3 if we switch every time.

Simulating a problem like this is a great way of verifying your math. Or sometimes, if you’re stuck in a rut and struggling with the math, you can simulate the problem first and then work backwards towards an understanding of the math. It’s important to have both tools, math/statistics and the ability to code, in your data science arsenal.

Ok, let’s say we’re still not happy. We’re shaking our head, it does not fit with our System 1 thinking and we need a little extra juice to help our System 2 thinking over the line. Forget the math, forget the code, think of it like this:

You have selected one of three doors. You know that Monty is about to open one of the two remaining doors to show you a goat. Before Monty does this, ask yourself, which would you rather? Stick with the one door you have selected or have

bothof the two remaining doors. Yes,both, because effectively that is your choice: stick with your first choice or havebothof the other doors.

Two doors or one, I know what I’d pick!

Coming at a problem from different angles: math, code, visualizations, etc, can help us out of a mental rut and/or reassure us by verifying our solutions. On the flip side, even when we ourselves fully understand a solution, we often have to explain it to a client, a manager, a decision maker or a young colleague who we are trying to teach. Therefore it is always a valuable exercise to tackle a problem in various ways and to be comfortable explaining it from different angles. Don’t stop here, google Monty Hall and you will find many other varied and interesting explanations of the Monty Hall problem.

]]>

Thankfully some heroes put together this comprehensive confusion matrix on Wikipedia for us all to use. I have simply mimicked their layout in a spreadsheet with all the formulas which you can grab and use as your own.

Hover over the cells to see the cell description in the comment box and, if you’re like me, reference this every time you need to compute these statistics just to be sure!

]]>

“There are three things that matter in property: Location, location, location!” British property tycoon, Lord Harold Samuel

For example, imagine two federal agencies, each with 100 buildings spread across the US. Due to their respective missions, agency A has many offices in rural areas, while agency B has many downtown office locations in major US cities.

However, we cannot conclude from this picture that agency B is overspending on rent. We can only claim agency B is overspending if we can somehow control for the explanatory variable that is location.

Naïve solution: Filter to a particular location, e.g. county, city, zipcode, etc, and compare costs between federal agencies in that location only. For example we could compare rents between office buildings in downtown Raleigh, NC. This gives us a good comparison at a micro level but we lose the macro nationwide picture. Filtering through every region one by one to view the results is not a serious option when there are thousands of different locations.

I once worked with a client that had exactly this problem. Whenever an effort was made to compare costs between agencies, it was always possible (inevitable even) for agencies to claim geography as a legitimate excuse for apparent high costs. I came up with a novel approach for comparing costs at an overall national level while controlling for geographic variation in costs. Here is a snippet of some dummy data to demonstrate this example (full dummy data set available here):

Agency | Zip | Sqft_per_zip | Annual_Rent_per_zip ($/yr) |

G | 79101 | 8,192 | 33,401 |

D | 94101 | 24,351 | 99,909 |

A | 70801 | 17,076 | 70,436 |

A | 87701 | 25,294 | 106,205 |

D | 87701 | 16,505 | 70,275 |

A | 24000 | 3,465 | 14,986 |

As usual I make the full dummy data set available here and you can access my R code here. The algorithm is described below in plain English:

- For agency X, compute the summary statistic at the local level, i.e. cost per sqft in each zip code.
- Omit agency X from the data and compute the summary statistic again, i.e. cost per sqft for all other agencies except X in each zip code.
- Using the results from steps 1 and 2, compute the difference in cost in each zip code. This tells us agency X’s net spend vs other agencies in each zip code.
- Repeat steps 1 to 3 for all other agencies.

The visualization is key to the power of this method of cost comparison.

This plot could have been generated in R but my client liked the interactive dashboards available in Tableau so that is what we used. You can download Tableau Reader for free from here and then you can download my Tableau workbook from here. There is a lot of useful information in this graphic and here is a brief summary of what you are looking at:

The height of each bar represents the cost difference between what the agency pays and what neighboring agencies pay in the same zip code. If a bar height is greater than zero, the agency pays more than neighboring agencies for rent. If a bar height is less than zero, the agency pays less than neighboring agencies. If a bar has zero height, the agency is paying the same average price as its neighbors in that zip code.

There is useful summary information in the chart title. The first line indicates the total net cost difference paid by the agency across all zip codes. In the second title line, the net spend is put into context as a percentage of total agency rent costs. The third title line indicates the percentage of zip codes in which the agency is paying more than its neighbors – this reflects the crossover point on the chart, where the bars go from positive to negative.

There is a filter to select the agency of your choice and a cost threshold filter can be applied to highlight (in orange) zip codes where agency net spend is especially high, e.g. a $1/sqft net spend in a zip code where the agency has 1 million sqft is costing more than a $5/sqft net spend in a zip code where the agency has only 20,000 sqft.

The tool tip gives you additional detailed information on each zip code as you hover over each bar. In this screenshot zip code 16611 is highlighted for agency B.

At a glance we get a macro and micro picture of how an agency’s costs compare to its peers while controlling for location! This approach to localized cost comparison provided stakeholders with a powerful tool to identify which agencies are overspending and, moreover, in precisely which zip codes they are overspending the most.

Once again, the R code is available here, the data (note this is only simulated data) is here and the Tableau workbook is here. To view the Tableau workbook you’ll need Tableau Reader which is available for free download here.

]]>

How many people would you need in a group before you could be confident that at least one pair in the group share the same birthday?

One day, back in Smurfit Business School, our statistics lecturer challenged us to a bet. He predicted, confidently (smugly even), that at least two of us shared a birthday. He bet us each the princely sum of €1. I glanced around me and I counted close to 40 students in the room. Being the savant that I am, I also know there are approximately 365 days in a year, and so I thought, you’re on! I mean, even allowing for some probability magic: 40 people, 365 days, this is free money!

I soon learned this was the famous birthday problem and although I was beginning to feel cocky as we got half way through my classmates’ birthdays, our teacher ultimately prevailed. It turns out that in a group of just 23 people the probability of a matching pair of birthdays is over 50%!

I hope this spreadsheet and the explanation below will help you understand why this is so.

- We need at least 2 people to have any chance of having a matching pair. This is trivial. Person A has a birthday on
*any*day. The probability of Person B matching is 1/365. - With 3 people, there are three possible matches: A matches B, A matches C or B matches C.
- With 4 people there are 6 possible combinations (count the edges in the little diagram shown here). You might spot a pattern by now. In mathematics these are known as combinations. After a while counting manually becomes tedious but, thankfully, for any given number of people we can use the combination formula to see how many possible combinations exist – jump to column B in the spreadsheet for a closer look.
- The probability for any one of these combinations being a matching pair is 1/365. Think of that like a bet: each individual combination is a bet with a 1/365 chance of winning. How many of these bets would we have to place to get
*at least*one win.- Here’s a neat little probability trick for answering an “
*at least*” type question. Compute the probability of not winning at all, i.e. precisely zero wins, and subtract that value from 1.*

- Here’s a neat little probability trick for answering an “
- Column C in the spreadsheet uses the binomial distribution formula to compute the probability of a specific number of wins from a given number of bets where each bet is independent and has an equal probability of success.
- In our case we want to compute the probability of precisely zero wins and subtract this value from 1. This gives us the probability of
*at least one win*.

In the results, we can see that 23 is the magic number where the probability of at least one match exceeds 0.5. Remember there were close to 40 in my class so my teacher knew at a glance that his probability of finding at least one pair was close to 0.9 … and there were enough suckers in the room to cover his lunch!

* This little problem inversion trick can be generalized further to any occasion when we are faced with a difficult question. If you’re struggling, try inverting the question. Having difficulty predicting fraud? Maybe try predicting “not fraud”! It sounds trivial, silly even, but inverting a problem can get you out of a mental rut. For a famous example, see how statistician Abraham Wald used this technique to help the Allies win WW2.

]]>

- Focus on other people in the group, in particular the key stakeholders like your client or boss. Ask yourself what do they need to get out of this meeting rather than what do you need. Listen to them and if you communicate everything clearly the meeting might be cut short and you could save yourself the dreaded “follow-up meeting”.
- Agendas are important but we often don’t have time to create one. As a minimum state the purpose and outcomes for the meeting. This could be as short as one sentence each and if nothing else it will help you to focus. If someone else set the meeting without an agenda and you have no idea what the purpose and desired outcome are – be ballsy and politely ask them.
- Don’t assume people read attachments you send them before meetings. Do you read every attachment sent to you? Of course not! Be respectful and highlight the top three issues – interested parties can read further if they like.
- If attendees are new to the location give clear and concise instructions regarding parking, traffic, building layout, etc. This saves time for everyone. You don’t want people turning up late and flustered, disrupting proceedings and requiring a repeat of issues already discussed.
- A quick roll call for key attendees can be helpful and, if the group is new, a rapid icebreaker can help people to connect. But if it’s a recurring meeting, introductions can quickly become banal.
- Stay focused on the meeting outcome. It might even help to start from there and work backwards.
- A short meeting is a good meeting. Everyone is happy when a meeting finishes earlier than the stated end time. Reverse is also true. Give yourself a little buffer time – like airlines do!
- Try, really try hard, to not call a meeting unless necessary.

]]>

Ask ten people how they define outliers (aka anomalies) and you’ll get ten different answers. It’s not that they are all wrong it’s just that the term outlier can mean different things to different people in different contexts.

“A data point on a graph or in a set of results that is very much bigger or smaller than the next nearest data point.”from the Oxford English Dictionary.

Sometimes we want to detect outliers so we can remove them from our models and graphics. This does **not** mean we completely disregard the outliers. It means we set them aside for further investigation.

On other occasions we want to detect the outliers and nothing else, e.g. in fraud detection. Regardless of what analytics project we are engaged in, outliers are very important so we have to come up with some techniques for handling them. The surest way to identify an outlier is with subject matter expertise, e.g. if I am studying children under the age of five and one of them is 6 foot tall, I don’t need statistics to tell me that is an outlier!

So what? Data practitioners don’t always have the luxury of subject matter expertise so we use heuristics instead. I will outline three simple univariate outlier detection methods and why I think the boxplot method outlined by NIST is the most robust method even though it involves a little more work.

The three methods are:

- Percentiles, e.g. flag values greater than 99th percentile.
- Standard deviations (SD), e.g. flag values more than 2*sd from the mean.
- Boxplot outer fence, e.g. flag values greater than the third quartile plus 3 times the interquartile range.

I generated dummy graduate salary data with some select tweaks to see how well each of these methods perform under different data distribution scenarios.

n = 1,000, mean = $50,000, sd = $10,000. Below is a summary of the data including the distribution of the data points and a density curve.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16040 43270 49600 49730 56160 81960
```

Notice that the SD and percentile methods are too sensitive, i.e. they are flagging values that may be high but are nonetheless clearly part of the main distribution. This is an example of false positive outlier detection. The boxplot outer fence detects no outliers and this is accurate – we know there are no outliers because we generated this data as a normal distribution.

Now let’s stick an outlier in there. Let’s imagine one graduate in the group struck it lucky and landed a big pay packet of $100k (*maybe it’s his uncle’s company or maybe he’s really talented, who knows*)!

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16040 43300 49630 49800 56180 100000
```

Once again we see the SD and percentile methods are too sensitive. The boxplot method works just right, it catches the one outlier we included but not the rest of the normally distributed data.

A handful of graduates came up with some awesome machine learning algorithm in their dissertation and they have been snapped up by Silicon Valley for close to $500k each!

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16040 43330 49650 51960 56310 554200
```

Now with a handful more outliers, the SD threshold has moved to the right so much that it surpassed our $100k friend from scenario 2. He is now a false negative for the SD method. The percentile method is still too sensitive but the boxplot method is coming up goldilocks again.

In the first three scenarios the 99th percentile threshold hardly budged. Because simply put: n = 1,000 and so the 99th percentile represents the 10th highest value. Since we have only added 6 outliers, the 10th highest value is still in the main distribution. So let’s add 10 more outliers (for a total of 16) and see what happens to the percentile threshold.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16040 43430 49830 54200 56630 554200
```

Boom. The 99th percentile threshold has jumped from being too sensitive (too many false positives) to a point where it is not sensitive enough and it is missing some outliers (false negatives). Notice once again how robust the boxplot method is to skewed data.

- Boxplot method is no silver bullet. There are scenarios where it can miss, e.g. bimodal data can be troublesome no matter which method you choose. But in my experience boxplot outer fence is a more robust method of univariate outlier detection than the other two conventional methods.
- These methods are only good for catching univariate outliers. Scroll back up to the very first chart in this piece and note that the “outlier” is not really an outlier if we only look at x or y values univariately. Detecting outliers in multidimensional space is trickier and will probably require more advanced analytical techniques.
- The methods discussed here are useful heuristics for data practitioners but we must remind ourselves that the most powerful outlier detection method is often plain old human subject matter expertise and experience.

The R markdown script used to produce these examples and graphics is available for download from Google Drive here.

]]>

Personally, I cut my teeth in GIS as a young civil engineer when I worked in the Irish sewerage and rain water drainage industry – this Wad River Catchment Flood Study (pdf) includes some elegant geographic visualizations that I helped develop. Being a an engineer during the Irish property bubble I witnessed a lot of housing construction in areas where there was subsequently little or no demand. Depending on who you talk to this was either due to greedy bankers, over-exuberance in the market or a myriad of other explanations that spew from experts’ mouths. Given my proximity to the construction industry I was keenly aware of various tax incentives that were on offer for building houses in certain geographic areas and I suspected these government interventions, although well-intentioned, may have had a negative impact.

In 2012 I analyzed the National Unfinished Housing Developments Database to see if there was a link between government incentives and ghost estates. The results of my geographic analysis indicate that, yes, there is some evidence to suggest that the government exacerbated the housing bubble/bust for the very areas they were trying to help. My analysis was crude but compelling (if I do say so myself!)

]]>

For example, if I asked you to cook a chicken breast in one minute, your options are:

- Sacrifice quality: I’m sure you could cook it for one minute but the quality would be suspect and I would not recommend eating it.
- Sacrifice scope: If I insisted that I needed something in one minute, perhaps you could reduce the scope by cutting off a tiny morsel of chicken and cooking only that.
- Sacrifice time: Use as many minutes as you need to cook the full chicken breast to a delicious high quality.

In the above example, I presented the three options in the best case whereby we are taking control up front and choosing which element to sacrifice. Problems are exacerbated when we kid ourselves into thinking we can get everything done and when this happens we often end up in one of the following undesirable scenarios:

- Delivering sloppy low quality work on time.
- Dropping entire components of the work last minute because we did not have time to deliver everything.
- Realizing too late that we are not going to meet a deadline and having to make that awkward call to whoever is awaiting delivery of our work.

There is no silver bullet solution to this conundrum but, in my experience, the best approach is to get ahead of it early and if you sense a project is facing this problem flag it and figure out which of the three elements can be adjusted. Usually we will not want to sacrifice quality so the discussion centers around either reducing the scope to meet a deadline or pushing out the deadline to a more reasonable future date. Don’t rule out sacrificing some quality, there may be options to pare back to a minimally sufficient solution – sometimes the customer just wants a simple tire swing!

A final note: I think the Agile methodology really addresses the nub of this issue because quality is assured by including testing as an inherent part of a sprint. Time is fixed – typically 2 week sprints – and this leave scope to be adjusted and agreed upon at the beginning of each sprint.

]]>

*“Don’t take a fence down until you know why it was put up.” G.K. Chesterton*

I once had the joy [sarcasm] of working with an entity-attribute-value (EAV) data model. When I first laid eyes on it I knew something was different but I didn’t know what exactly – I didn’t even know what EAV was at the time. I typically work downstream of database (DB) design, so ETL/SQL experts and DB administrators look after the design of DB and I have simply queried DBs to get the data I want for analysis. So I’ve seen a lot of DBs but I haven’t designed any – however, you don’t have to be an architect to recognize when a structure looks a bit dodgy!

From my basic knowledge of relational DB structure:

- Each table should be related to a particular subject or entity, e.g. a product table, a customer table, etc.
- Each row represents a record. So a row in in the product table is a product, a row in the customer table is a customer, etc.
- Each column is an attribute holding some type of information about the records, e.g. in the customer table there could be a first_name column, an age column, etc. Also, columns should have reasonably meaningful names.

This comes from Codd’s 3rd normal form and it is the basis for Hadley Wickham’s approach to tidy data (pdf). The EAV model I came across was not structured like this, see the sample snippet in this google spreadsheet to compare how the same data is structured in the EAV versus the tidy format. As mentioned already, I am not a DB architect so maybe this data structure was something awesome that I just had not heard of yet – I’m always willing to learn new things and so I started to fish around for explanations but what I discovered only confirmed my initial fears:

- This Microsoft Database Design Basics article specifically mentions that “When you see columns numbered this way [repeating groups like numerator 1, numerator 2, numerator 3, etc], you should revisit your design.”
- My exact concern came up once on StackExchange and the most popular answer gives a very strong argument against EAV.
- We needed to create scatter plots on dashboards for this client and to create scatter plots with tidy data I could simply plot one column against another but with EAV data we’d need to write quite a few lines of code to reshape the data before doing a simple scatter plot.
- Tidy data is more amenable to the analytical base table structure that is required for model building.
- EAV is placing different data types in the same column. Often our data are strictly counts or decimals and it is better practice to store them as integer or float data types as appropriate.
- EAV is not easily human readable and generally needs to be pivoted or reshaped to do anything useful. A conventional relational model is more human readable because it is easy to look across a row and see attributes for a particular entity.

Unfortunately for me, a great deal of reporting and tools had been built on the existing database so even when I eventually persuaded the client to revert to a more conventional structure there was a real risk of breaking other things.

My takeaway from this experience is that a good default first guess in any arena is the conventional solution but we should always be open to hearing about new and better ways of doing things. Conversely, if you are proposing a new approach, be sure that you thoroughly know the conventional solution first. Which is my long winded way of paraphrasing that beautiful Chesterton quote at the top of this post – see here for more on that quote.

P.S. If you ever find yourself in my predicament and you need more ammo to talk someone out of EAV here are some more resources to check out:

- Number 3 in this list of Five Simple Database Design Errors You should Avoid gives an example of how queries can become very complicated in EAV models.
- This post on Ten Common Database Design Mistakes again mentions the issue of repeating column names appended with numbers, i.e. num1, num2, num3, etc, and the problems it can lead to.
- This article outlines some of the attractions of EAV, namely it’s simple data model and the ease with which new attributes can be added but ultimately he considers it dangerous and recommends extreme caution before implementing EAV.
- Here is a rather long but humorous story of how an EAV structure brought down an entire IT department.
- I followed Karl Popper’s scientific method and kept looking for something that would prove me wrong and I eventually found one example of a large e-commerce company called Magento using EAV and it seems the attraction for them is the scalability. But it will take more than this one example for me to embrace EAV!

]]>