Naive Bayes Classifiers

This is not a post about young lovers lacking in worldly wisdom – that would be naive baes. This is about an elegant machine learning classification algorithm – Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. I have previously applied Bayes’ Theorem to solve the Monty Hall problem.

Reminder of Bayes Rule: P(A | B) = P(B | A) * P(A) / P(B)

P(A) and P(B) are the probabilities of events A and B respectively

P(A | B) is the probability of event A given event B is true

P(B | A) is the probability of event B given event A is true

Still not clear? Ok, the formula can be rewritten in English as: “The posterior is equal to the likelihood times the prior divided by the evidence” Clear as mud, eh? Ok, I’ll try again! The posterior probability is easy enough, that’s what we’re looking for, it’s what we want to learn. After we analyze the data our posterior probability should be our best guess of the classification. Conversely, the prior probability is our best guess before we collect any data, the conventional prevailing wisdom if you will. The nice thing about a classification problem is that we have a fixed set of outcomes so the computation of probabilities becomes a little easier as we’ll see.

Let’s use the famous weather/golf data set to demonstrate an example. It’s only 14 rows so I can list them here as is:

Outlook Temperature  Humidity  Windy Play
overcast hot high FALSE yes
overcast cool normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE yes
rainy mild high FALSE yes
rainy cool normal FALSE yes
rainy cool normal TRUE no
rainy mild normal FALSE yes
rainy mild high TRUE no
sunny hot high FALSE no
sunny hot high TRUE no
sunny mild high FALSE no
sunny cool normal FALSE yes
sunny mild normal TRUE yes

It’s pretty obvious what’s going on in this data. The first four variables (outlook, temperature, humidity, windy) describe the weather. The last variable (play) is our target variable, it tells me whether or not my mother played golf given those weather conditions. We can use this training data to develop a Naive Bayes classification model to predict if my mother will play golf given certain weather conditions.

Step 1: Build contingency tables for each variable (tip: this is just an Excel pivot table or a cross tab in SAS or R)

Go through each variable and produce a summary table like so:

Played Golf
Outlook no yes
overcast 0% 44%
rainy 40% 33%
sunny 60% 22%

Do the same for the target variable which is no (36%) and yes (64%). Think of these two numbers as your prior probability. It is your best guess of the outcome without any other supporting evidence. So if someone asks me will my mother play golf on a day but they have told me nothing about the weather, my best guess is simply yes she will play golf with a probability of 64%. But let’s say I peek at the weather forecast for this Saturday and I see that the day will be SUNNY (outlook), MILD (temperature), HIGH (humidity) and TRUE (windy). Will my mother play golf on this day? Let’s use Naive Bayes classification to predict the outcome.

Step 2: Compute the likelihood L(yes) for test data

Test data day: SUNNY (outlook), MILD (temperature), HIGH (humidity) and TRUE (windy). Let’s first compute the likelihood that yes mother will play golf. Go grab the numbers from the tables we created in Step 1:

0.22 (SUNNY) * 0.44 (MILD) * 0.33 (HIGH) * 0.33 (TRUE) = 0.010542

And don’t forget to multiply by the prior for yes 0.64.

So, likelihood for yes = 0.010542 * 0.64 = 0.006747. Remember this is not a probability … yet. Let’s calculate the likelihood for no.

Step 3: Compute the likelihood L(no) for test data

Same test data as already described. Once again go back to Step 1 and grab the numbers for no:

0.60 (SUNNY) * 0.40 (MILD) * 0.80 (HIGH) * 0.60 (TRUE) = 0.1152

And don’t forget to multiply by the prior for no 0.36.

So, likelihood for no = 0.1152 * 0.36= 0.041472. And now it is very easy to compute the probabilities for yes and no.

Step 4: Compute posterior probabilities

P(yes) = 0.006747 / (0.006747 + 0.041472) = 0.139924 = 14%. It’s fairly obvious that P(no) should equal 86% but let’s check.

P(no) = 0.041472/ (0.006747 + 0.041472) = 0.860076= 86%

Therefore I can predict, with a probability of 86%, that my mother will not play golf this Saturday based on her previous playing history and on the weather forecast – SUNNY (outlook), MILD (temperature), HIGH (humidity) and TRUE (windy).

Closing thoughts

Naive Bayes is so-called because there is an assumption that the classifiers are independent of one another. Even though this is often not the case, the model tends to perform pretty well anyway. If you come across new data later on, great, you can simply chain it onto what you already have, i.e. each time you compute a posterior probability you can go on to use it as the prior probability with new variables.

You might have noticed that all the input variables are categorical – that is by design. But do not fret if you have numerical variables in your data, it is easy to convert them to categorical variables using expert knowledge, e.g. temperature less than 70 is cool, higher than 80 is hot and in between is mild. Or you could bin your variables automatically using percentiles or deciles or something like that. There are even ways to handle numerical data directly in Naive Bayes classification but it’s more convoluted and does require more assumptions, e.g. normally distributed variables. Personally, I prefer to stick to categorical variables with Naive Bayes because fewer assumptions is always desirable!

Check out the 5 minute video here that largely inspired this blog post.

I’m not usually a fan of infographics (not to be mistaken with data visualization which can be awesome!) but this baseball infographic is a lovely intro to predicting with Bayes.

And here is a very simple short example if you want to come at this again with different data – movie data in this case!

This blog post is of course about understanding how Naive Bayes classification works but if you are doing Naive Bayes classification with real life larger data sets then you’ll probably want to use R and the e1071 package or similar.

 

Advertisements

The Monty Hall problem and 3 ways to solve it

The Monty Hall problem is a classic probability conundrum which on the surface seems trivially simple but, alas, our intuition can lead us to the wrong answer. Full disclosure: I got it wrong when I first saw it! Here is the short Wikipedia description of the problem:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

monty-hall
On the surface the Monty Hall problem seems trivially simple: 3 doors, 1 car, 2 goats, pick 1, host opens 1, then choose to stick or switch

If you haven’t seen the problem before, have a guess now before reading on – what would you do, stick or switch? My instinctive first intuition was that it does not matter if I stick or switch. Two doors unopened, one car, that’s a 50:50 chance right there. Was I right?

Method 1: Bayes’ Theorem

Let’s tease it out using Bayes’ Theorem:

P(A|B) = P(B|A) * P(A) / P(B)

That’s the generic form of Bayes’ Theorem. For our specific Monty Hall problem let’s define the discrete events that are in play:

P(A) = P(B) = P(C) = 1/3 = the unconditional probability that the car is behind a particular door.

Note I am using upper case notation for our choice of door and as you see below I will use lower case to denote the door that Monty chooses to open.

P(a) = P(b) = P(c) = 1/2 = the unconditional probability that Monty will open a particular door. Monty will only have a choice of 2 doors because he is obviously not going to open the door you have selected.

So let’s say we choose door A initially. Remember we do not know what is behind any of the doors – but Monty knows. Monty will now open door b or c. Let’s say he opens door b. We now have to decide if we want to stick with door A or switch our choice to door C. Let’s use Bayes’ Theorem to work out the probability that the car is behind door A.

P(A|b) is the probability that the car is behind door A given Monty opens door b – this is what we want to compute, i.e. the probability of winning if we stick with door A

P(b|A) is the probability Monty will open door b given the car is behind door A. This probability is 1/2. Think about it, if Monty knows the car is behind door A, and we have selected door A, then he can choose to open door b or door c with equal probability of 1/2

P(A), the unconditional probability that the car is behind door A, is equal to 1/3

P(b), the unconditional probability that Monty opens door b, is equal to 1/2

Now we can write out the full equation:

P(A|b) = P(b|A) * P(A) / P(b) = (1/2) * (1/3) / (1/2) = 1/3

Hmmm, my intuition said 50:50 but the math says I only have a 1/3 chance of winning if I stick with door A. But that means I have a 2/3 chance of winning if I switch to door C. Let’s work it out and see.

P(C|b) is the probability that the car is behind door C given Monty opens door b – this is what we want to compute, i.e. the probability of winning if we switch to door C

P(b|C) is the probability Monty will open door b given the car is behind door C. This probability is 1. Think about it, if Monty knows the car is behind door C, and we have selected door A, then he has no choice but to open door b

P(C), the unconditional probability that the car is behind door C, is equal to 1/3

P(b), the unconditional probability that Monty opens door b, is equal to 1/2

Now we can write out the full equation:

P(C|b) = P(b|C) * P(C) / P(b) = 1 * (1/3) / (1/2) = 2/3

There it is, we have a 2/3 chance of winning if we switch to door C and only a 1/3 chance if we stick with door A.

Method 2: Write code to randomly simulate the problem many times

Bayes’ Rule is itself not the most intuitive formula so maybe we are still not satisfied with the answer. We can simulate the problem in R – grab my R code here to reproduce this graphic – by simulate I mean replay the game randomly many times and compare the sticking strategy with the switching strategy. Look at the results in the animation below and notice how as the number of iterations increase the probability of success converges on 1/3 if we stick with first choice every time and it converges on 2/3 if we switch every time.

 

animation
When we simulate the problem many times we see the two strategies (always stick vs always switch) converge on 1/3 and 2/3 respectively just as we had calculated using Bayes’ Theorem

Simulating a problem like this is a great way of verifying your math. Or sometimes, if you’re stuck in a rut and struggling with the math, you can simulate the problem first and then work backwards towards an understanding of the math. It’s important to have both tools, math/statistics and the ability to code, in your data science arsenal.

Method 3: Stop and think before Monty distracts you

Ok, let’s say we’re still not happy. We’re shaking our head, it does not fit with our System 1 thinking and we need a little extra juice to help our System 2 thinking over the line. Forget the math, forget the code, think of it like this:

You have selected one of three doors. You know that Monty is about to open one of the two remaining doors to show you a goat. Before Monty does this, ask yourself, which would you rather? Stick with the one door you have selected or have both of the two remaining doors. Yes, both, because effectively that is your choice: stick with your first choice or have both of the other doors.

monty-hall-made-easy
The Monty Hall problem can be reduced to this if we pause and think about the situation immediately before Monty opens a door to reveal a goat

Two doors or one, I know what I’d pick!

Parting thoughts

Coming at a problem from different angles: math, code, visualizations, etc, can help us out of a mental rut and/or reassure us by verifying our solutions. On the flip side, even when we ourselves fully understand a solution, we often have to explain it to a client, a manager, a decision maker or a young colleague who we are trying to teach. Therefore it is always a valuable exercise to tackle a problem in various ways and to be comfortable explaining it from different angles. Don’t stop here, google Monty Hall and you will find many other varied and interesting explanations of the Monty Hall problem.

The birthday problem

How many people would you need in a group before you could be confident that at least one pair in the group share the same birthday?

One day, back in Smurfit Business School, our statistics lecturer challenged us to a bet. He predicted, confidently (smugly even), that at least two of us shared a birthday. He bet us each the princely sum of €1. I glanced around me and I counted close to 40 students in the room. Being the savant that I am, I also know there are approximately 365 days in a year, and so I thought, you’re on! I mean, even allowing for some probability magic: 40 people, 365 days, this is free money!

I soon learned this was the famous birthday problem and although I was beginning to feel cocky as we got half way through my classmates’ birthdays, our teacher ultimately prevailed. It turns out that in a group of just 23 people the probability of a matching pair of birthdays is over 50%!

I hope this spreadsheet and the explanation below will help you understand why this is so.

  • We need at least 2 people to have any chance of having a matching pair. This is trivial. Person A has a birthday on any day. The probability of Person B matching is 1/365.
  • With 3 people, there are three possible matches: A matches B, A matches C or B matches C.
  • With 4 people there are 6 possible combinations (count the edges in the little diagram shown here).four people and six combos You might spot a pattern by now. In mathematics these are known as combinations. After a while counting manually becomes tedious but, thankfully, for any given number of people we can use the combination formula to see how many possible combinations exist – jump to column B in the spreadsheet for a closer look.
  • The probability for any one of these combinations being a matching pair is 1/365. Think of that like a bet: each individual combination is a bet with a 1/365 chance of winning. How many of these bets would we have to place to get at least one win.
    • Here’s a neat little probability trick for answering an “at least” type question. Compute the probability of not winning at all, i.e. precisely zero wins, and subtract that value from 1.*
  • Column C in the spreadsheet uses the binomial distribution formula to compute the probability of a specific number of wins from a given number of bets where each bet is independent and has an equal probability of success.
  • In our case we want to compute the probability of precisely zero wins and subtract this value from 1. This gives us the probability of at least one win.

birthday chart

In the results, we can see that 23 is the magic number where the probability of at least one match exceeds 0.5. Remember there were close to 40 in my class so my teacher knew at a glance that his probability of finding at least one pair was close to 0.9 … and there were enough suckers in the room to cover his lunch!

 

* This little problem inversion trick can be generalized further to any occasion when we are faced with a difficult question. If you’re struggling, try inverting the question. Having difficulty predicting fraud? Maybe try predicting “not fraud”! It sounds trivial, silly even, but inverting a problem can get you out of a mental rut. For a famous example, see how statistician Abraham Wald used this technique to help the Allies win WW2.