The Monty Hall problem and 3 ways to solve it

The Monty Hall problem is a classic probability conundrum which on the surface seems trivially simple but, alas, our intuition can lead us to the wrong answer. Full disclosure: I got it wrong when I first saw it! Here is the short Wikipedia description of the problem:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

On the surface the Monty Hall problem seems trivially simple: 3 doors, 1 car, 2 goats, pick 1, host opens 1, then choose to stick or switch

If you haven’t seen the problem before, have a guess now before reading on – what would you do, stick or switch? My instinctive first intuition was that it does not matter if I stick or switch. Two doors unopened, one car, that’s a 50:50 chance right there. Was I right?

Method 1: Bayes’ Theorem

Let’s tease it out using Bayes’ Theorem:

P(A|B) = P(B|A) * P(A) / P(B)

That’s the generic form of Bayes’ Theorem. For our specific Monty Hall problem let’s define the discrete events that are in play:

P(A) = P(B) = P(C) = 1/3 = the unconditional probability that the car is behind a particular door.

Note I am using upper case notation for our choice of door and as you see below I will use lower case to denote the door that Monty chooses to open.

P(a) = P(b) = P(c) = 1/2 = the unconditional probability that Monty will open a particular door. Monty will only have a choice of 2 doors because he is obviously not going to open the door you have selected.

So let’s say we choose door A initially. Remember we do not know what is behind any of the doors – but Monty knows. Monty will now open door b or c. Let’s say he opens door b. We now have to decide if we want to stick with door A or switch our choice to door C. Let’s use Bayes’ Theorem to work out the probability that the car is behind door A.

P(A|b) is the probability that the car is behind door A given Monty opens door b – this is what we want to compute, i.e. the probability of winning if we stick with door A

P(b|A) is the probability Monty will open door b given the car is behind door A. This probability is 1/2. Think about it, if Monty knows the car is behind door A, and we have selected door A, then he can choose to open door b or door c with equal probability of 1/2

P(A), the unconditional probability that the car is behind door A, is equal to 1/3

P(b), the unconditional probability that Monty opens door b, is equal to 1/2

Now we can write out the full equation:

P(A|b) = P(b|A) * P(A) / P(b) = (1/2) * (1/3) / (1/2) = 1/3

Hmmm, my intuition said 50:50 but the math says I only have a 1/3 chance of winning if I stick with door A. But that means I have a 2/3 chance of winning if I switch to door C. Let’s work it out and see.

P(C|b) is the probability that the car is behind door C given Monty opens door b – this is what we want to compute, i.e. the probability of winning if we switch to door C

P(b|C) is the probability Monty will open door b given the car is behind door C. This probability is 1. Think about it, if Monty knows the car is behind door C, and we have selected door A, then he has no choice but to open door b

P(C), the unconditional probability that the car is behind door C, is equal to 1/3

P(b), the unconditional probability that Monty opens door b, is equal to 1/2

Now we can write out the full equation:

P(C|b) = P(b|C) * P(C) / P(b) = 1 * (1/3) / (1/2) = 2/3

There it is, we have a 2/3 chance of winning if we switch to door C and only a 1/3 chance if we stick with door A.

Method 2: Write code to randomly simulate the problem many times

Bayes’ Rule is itself not the most intuitive formula so maybe we are still not satisfied with the answer. We can simulate the problem in R – grab my R code here to reproduce this graphic – by simulate I mean replay the game randomly many times and compare the sticking strategy with the switching strategy. Look at the results in the animation below and notice how as the number of iterations increase the probability of success converges on 1/3 if we stick with first choice every time and it converges on 2/3 if we switch every time.


When we simulate the problem many times we see the two strategies (always stick vs always switch) converge on 1/3 and 2/3 respectively just as we had calculated using Bayes’ Theorem

Simulating a problem like this is a great way of verifying your math. Or sometimes, if you’re stuck in a rut and struggling with the math, you can simulate the problem first and then work backwards towards an understanding of the math. It’s important to have both tools, math/statistics and the ability to code, in your data science arsenal.

Method 3: Stop and think before Monty distracts you

Ok, let’s say we’re still not happy. We’re shaking our head, it does not fit with our System 1 thinking and we need a little extra juice to help our System 2 thinking over the line. Forget the math, forget the code, think of it like this:

You have selected one of three doors. You know that Monty is about to open one of the two remaining doors to show you a goat. Before Monty does this, ask yourself, which would you rather? Stick with the one door you have selected or have both of the two remaining doors. Yes, both, because effectively that is your choice: stick with your first choice or have both of the other doors.

The Monty Hall problem can be reduced to this if we pause and think about the situation immediately before Monty opens a door to reveal a goat

Two doors or one, I know what I’d pick!

Parting thoughts

Coming at a problem from different angles: math, code, visualizations, etc, can help us out of a mental rut and/or reassure us by verifying our solutions. On the flip side, even when we ourselves fully understand a solution, we often have to explain it to a client, a manager, a decision maker or a young colleague who we are trying to teach. Therefore it is always a valuable exercise to tackle a problem in various ways and to be comfortable explaining it from different angles. Don’t stop here, google Monty Hall and you will find many other varied and interesting explanations of the Monty Hall problem.


Localized real estate cost comparison

Comparison of real estate costs across different regions presents a challenge because location has such a large impact on rent and operation & maintenance (O&M) costs. This large variance in costs makes it difficult for organizations to compare costs across regions.

“There are three things that matter in property: Location, location, location!” British property tycoon, Lord Harold Samuel

For example, imagine two federal agencies, each with 100 buildings spread across the US. Due to their respective missions, agency A has many offices in rural areas, while agency B has many downtown office locations in major US cities.

Agency B has higher rent costs than agency A. This cost difference is largely explained by location – agency B offices are typically in downtown locations whereas agency A offices are often in rural areas. To truly compare costs we need to control for location.

However, we cannot conclude from this picture that agency B is overspending on rent. We can only claim agency B is overspending if we can somehow control for the explanatory variable that is location.

Naïve solution: Filter to a particular location, e.g. county, city, zipcode, etc, and compare costs between federal agencies in that location only. For example we could compare rents between office buildings in downtown Raleigh, NC. This gives us a good comparison at a micro level but we lose the macro nationwide picture. Filtering through every region one by one to view the results is not a serious option when there are thousands of different locations.

I once worked with a client that had exactly this problem. Whenever an effort was made to compare costs between agencies, it was always possible (inevitable even) for agencies to claim geography as a legitimate excuse for apparent high costs. I came up with a novel approach for comparing costs at an overall national level while controlling for geographic variation in costs. Here is a snippet of some dummy data to demonstrate this example (full dummy data set available here):

Agency Zip Sqft_per_zip Annual_Rent_per_zip ($/yr)
G 79101 8,192 33,401
D 94101 24,351 99,909
A 70801 17,076 70,436
A 87701 25,294 106,205
D 87701 16,505 70,275
A 24000 3,465 14,986

As usual I make the full dummy data set available here and you can access my R code here. The algorithm is described below in plain English:

  1. For agency X, compute the summary statistic at the local level, i.e. cost per sqft in each zip code.
  2. Omit agency X from the data and compute the summary statistic again, i.e. cost per sqft for all other agencies except X in each zip code.
  3. Using the results from steps 1 and 2, compute the difference in cost in each zip code. This tells us agency X’s net spend vs other agencies in each zip code.
  4. Repeat steps 1 to 3 for all other agencies.

The visualization is key to the power of this method of cost comparison.

Screenshot from Tableau workbook. At a glance we can see Agency B is generally paying more than its neighbors in rent. And we can see which zip codes could be targeted for cost savings.

This plot could have been generated in R but my client liked the interactive dashboards available in Tableau so that is what we used. You can download Tableau Reader for free from here and then you can download my Tableau workbook from here. There is a lot of useful information in this graphic and here is a brief summary of what you are looking at:

The height of each bar represents the cost difference between what the agency pays and what neighboring agencies pay in the same zip code. If a bar height is greater than zero, the agency pays more than neighboring agencies for rent. If a bar height is less than zero, the agency pays less than neighboring agencies. If a bar has zero height, the agency is paying the same average price as its neighbors in that zip code.

There is useful summary information in the chart title. The first line indicates the total net cost difference paid by the agency across all zip codes. In the second title line, the net spend is put into context as a percentage of total agency rent costs. The third title line indicates the percentage of zip codes in which the agency is paying more than its neighbors – this reflects the crossover point on the chart, where the bars go from positive to negative.

There is a filter to select the agency of your choice and a cost threshold filter can be applied to highlight (in orange) zip codes where agency net spend is especially high, e.g. a $1/sqft net spend in a zip code where the agency has 1 million sqft is costing more than a $5/sqft net spend in a zip code where the agency has only 20,000 sqft.

The tool tip gives you additional detailed information on each zip code as you hover over each bar. In this screenshot zip code 16611 is highlighted for agency B.

At a glance we get a macro and micro picture of how an agency’s costs compare to its peers while controlling for location! This approach to localized cost comparison provided stakeholders with a powerful tool to identify which agencies are overspending and, moreover, in precisely which zip codes they are overspending the most.

Once again, the R code is available here, the data (note this is only simulated data) is here and the Tableau workbook is here. To view the Tableau workbook you’ll need Tableau Reader which is available for free download here.


Anomaly detection with Benford’s Law

Benford’s Law: the principle that in any large, randomly produced set of natural numbers, such as tables of logarithms or corporate sales statistics, around 30 percent will begin with the digit 1, 18 percent with 2, and so on, with the smallest percentage beginning with 9. The law is applied in analyzing the validity of statistics and financial records. (from Google)

Benford’s Law can be used to detect anomalous results. Detecting anomalies is important in tax, insurance and scientific fraud detection.

In this example we are analyzing daily sales data from 11 Widgets Inc. stores. The board suspects some stores might be fabricating sales numbers. The only way to prove it is via a full audit of a store’s accounts. Auditing is expensive and so the board asked me to identify one store to audit.

We have 365 days of sales data for each of 11 stores – the data can be viewed here. In R I used the substring function to get the first digit from every day of sales for every store – full R script available here. Not every first digit distribution follows Benford’s Law but many do. It was clear from the plots that Benford’s Law was applicable on this occasion.

Benford G
The distribution of first digits from daily sales in Store G (orange bars) follows Benford’s Law (blue dots) quite well

By plotting the observed distribution of first digits for daily sales from each store against the expected distribution we can identify which store veers furthest from the expected distribution. Visually, we can see that the sales numbers from Store K might be worth auditing.

Benford K
The distribution of first digits from daily sales in Store K (orange bars) do not follow Benford’s Law quite so well

To augment this analysis and provide a more objective output a chi-squared test was carried out for each store. The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies. A lower chi-squared test p-value indicates a more anomalous distribution of first digits. Store K was audited and eventually they admitted to randomly generating sales numbers for many days during the year.

Note that disagreement with Benford’s Law does not prove anything. It merely offers a hint. A full audit of Store K was still necessary to prove any wrongdoing.

In this example we saw how data analytics identified which store was most likely to be the culprit thus avoiding audits of multiple stores. That’s what good analytics is about – the art and science of better.

A simple solution is often better

“Everything should be as simple as it can be, but not simpler” apocryphal quote attributed to Einstein.

“Among competing hypotheses, the one with the fewest assumptions should be selected.” Occam’s Razor.

We have all heard variations of the above quotes. I always strive to make my solutions as simple as possible so a non-technical audience can digest the findings. The following example is based (loosely!) on a real life analysis.


Choc Bars Inc (CBI) produces high end chocolate bars for a small but loyal customer base. CBI prides itself on producing top quality chocolate but like any company, it wishes to keep production costs low. The new CFO Mr McMoney believes that there are substantial economies of scale if each machine produces as many bars as possible. The engineers are skeptical that trying to get too much out of the machines could lead to higher maintenance costs. The company founder, Madame Chocolat, does not wholly disagree but she is more concerned that choco bar quality remains high.

I collected data from 100 CBI machines and surveyed customers too. I plotted cost against production to see if McMoney’s economies of scale theory played out in practice. I also layered on the customer survey data so Mme Chocolat could see if customers were satisfied with the choco bar quality.

loess choco
Loess model fitted to the cost, production and satisfaction data

There are clear opportunities for savings if production can be increased in some machines but there are diminishing returns apparent too once production increases beyond about 70ish bars per day. In this case all parties agreed that production could be increased in some machines without affecting product quality too much. Mme Chocolat also noted that machines producing the most bars per day led to lower quality products.

There was a board meeting coming up at the end of the quarter. McMoney wanted to present projected savings but there was some concern that board members are unfamiliar with loess models and this may distract from the findings. A loess model is basically a locally weighted regression model and this gif gives a wonderful visual explanation. To allay concerns, I built a tool with the same data which allowed the user to adjust where they felt the optimum point was and simple linear models were fit either side of the selected optimum.

CBI tool
Interactive analysis tool uses simpler linear regression models either side of an inflection point that is selected by the user. Color indicates choco bar quality.

So what?

The interactive tool gave the decision makers greater control and allowed them to interact with the data. McMoney had a rough rule-of-thumb for savings projections. Mme Chocolat pushed for a lower production target of 68 bars per day for each machine – beyond that point, quality is reduced. Of course this is a dummy example but it is inspired by a real life scenario. For example the data could be teaching costs, student/teacher ratio and student satisfaction – the numbers would change but the principle would be the same.

The R script for the loess model is here and the link for the interactive simple regression shiny R script is here.

Commercial Energy Rates – R Shiny App

Here is a link to a US commercial energy rates shiny application I posted to in 2014. The app pulls in data from the US Energy Information Administration which posts US residential, commercial, industrial and transportation energy prices at the state level to this webpage on a monthly basis. At the time I created this app my employer was specifically interested in commercial energy rates.

commercial energy rates
Screengrab from EIA website where energy rates are posted every month

The app also pulls in 2013 US census population data from the US Census Bureau. [Note: reading in data from external websites like this is a little risky since it creates an external dependency but at the time I was keen to learn new things about R! For a more robust application, just get the population data one time and store it on your own system where you can control it.]

So what?

My company sold energy management hardware and software solutions to commercial entities in retail, grocery and fast food industries. Like any company we had limited sales and marketing resources. This tool helped efficiently allocate those resources to where returns were more likely, i.e. states with high energy costs and/or states with rising energy costs. Yes, you could just look up the data from tables on the website but this visualization makes it instantly apparent where the most interesting data opportunities are.

energy rates app
Bubble plot to demonstrate costs, change in costs, state population

Key features of the app:

  • By labelling the points and color coding them according to geographic region we can see at a glance that California and a bunch of northeastern states have substantially higher energy costs than the rest of the country.
  • We can also see that West Virginia has seen a 10+% year-on-year increase in costs. Their rates are still low relative to other states but if you are running a business in West Virginia, you are going to feel a 10+% increase in running costs.
  • The size of the points is proportional to population. A useful visual reminder that the opportunities in West Virginia may be slim on the ground (although the scenery is stunning and the locals are friendly!)

The app also features links to the raw data sources and 2 map views – notice again how California and West Virginia stand out for highest rates and highest increase in rates respectively.

energy rate maps
Map views for all you GIS nerds out there!

As usual, here is a link to the code that I have saved in Google Drive and that you are free to download and run in R. Notes on running this shiny application:

  1. Download the zip file and extract the 2 R files to a suitable location on your device. Let’s assume you saved them to “C:/My Documents/energyRatesApp”.
  2. Set your working directory to this address: setwd(“C:/My Documents/”)
  3. Ensure the shiny library is installed: install.packages(“shiny”)
  4. Run the app: shiny::runApp(“energyRatesApp”)
  5. You may get error messages if all the required libraries are not installed. Simply install the necessary libraries and try again.

Random Forest Model of Building Energy Consumption

I built an R Shiny app to model a building’s energy consumption. The app allows the user to select a baseline period and observe the building’s performance post-baseline. The energy rate can be adjusted to provide a quick estimate of savings (or losses) in the post-baseline period.

So what: My clients wanted to know the effect of energy saving measures they were taking, e.g. new AC units, new freezers, new temperature settings, new lighting, etc. Conversely they might want to assess the negative impact of an event (e.g. equipment malfunction). In the screenshot below energy use has increased post-baseline which would be a concern to any business. This tool gives decision makers and sales engineers quick and “good enough” information they need to take action.

Screenshot of the app

It is no surprise that ambient temperature has the biggest impact on electricity consumed. Below we see a typical annual profile – higher temperatures in summer lead to higher AC use. Note that if you use electric heating instead of gas your winter electricity consumption could be just as high if not higher than your summer consumption.

Typical annual profile of local temperature and energy consumption

The conventional way to build a model of a building’s energy consumption is to use temperature records to compute the number of degree days in each billing period (typically monthly), get the energy consumed from the monthly utility bill and build a regression model (with only 12 data points). For more on degree days and a good account of the pros and cons of this modelling approach in general see here.

When I worked at an energy management company I gained valuable insight into how office, retail, pharmacy and grocery buildings consume electricity. A quick glance at a daily profile of energy consumption (aside: my old statistics lecturer always stressed the importance of drawing pictures early and often) shows how me there is more than temperature influencing energy consumed: namely business operating hours. I was fortunate to have temperature and energy consumption data at the hourly level and therefore had the opportunity to develop a much richer model of energy consumption.

The rmd script is available here on Google Drive. It pulls in data from a publicly available google sheets page so you should be able to download it and run it in RStudio without any fuss. Hopefully I have commented my code sufficiently well but please contact me if you have any questions. For more on random forests check out this video from about 41:20.