This is not a post about young lovers lacking in worldly wisdom – that would be naive baes. This is about an elegant machine learning classification algorithm – Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. I have previously applied Bayes’ Theorem to solve the Monty Hall problem.

Reminder of Bayes Rule: P(A | B) = P(B | A) * P(A) / P(B)

P(A) and P(B) are the probabilities of events A and B respectively

P(A | B) is the probability of event A

givenevent B is trueP(B | A) is the probability of event B

givenevent A is true

Still not clear? Ok, the formula can be rewritten in English as: “The posterior is equal to the likelihood times the prior divided by the evidence” Clear as mud, eh? Ok, I’ll try again! The posterior probability is easy enough, that’s what we’re looking for, it’s what we want to learn. After we analyze the data our posterior probability should be our best guess of the classification. Conversely, the prior probability is our best guess before we collect any data, the conventional prevailing wisdom if you will. The nice thing about a classification problem is that we have a fixed set of outcomes so the computation of probabilities becomes a little easier as we’ll see.

Let’s use the famous weather/golf data set to demonstrate an example. It’s only 14 rows so I can list them here as is:

Outlook |
Temperature |
Humidity |
Windy |
Play |

overcast | hot | high | FALSE | yes |

overcast | cool | normal | TRUE | yes |

overcast | mild | high | TRUE | yes |

overcast | hot | normal | FALSE | yes |

rainy | mild | high | FALSE | yes |

rainy | cool | normal | FALSE | yes |

rainy | cool | normal | TRUE | no |

rainy | mild | normal | FALSE | yes |

rainy | mild | high | TRUE | no |

sunny | hot | high | FALSE | no |

sunny | hot | high | TRUE | no |

sunny | mild | high | FALSE | no |

sunny | cool | normal | FALSE | yes |

sunny | mild | normal | TRUE | yes |

It’s pretty obvious what’s going on in this data. The first four variables (outlook, temperature, humidity, windy) describe the weather. The last variable (play) is our target variable, it tells me whether or not my mother played golf *given* those weather conditions. We can use this training data to develop a Naive Bayes classification model to predict if my mother will play golf *given* certain weather conditions.

### Step 1: Build contingency tables for each variable (tip: this is just an Excel pivot table or a cross tab in SAS or R)

Go through each variable and produce a summary table like so:

Played Golf |
||

Outlook |
no | yes |

overcast | 0% | 44% |

rainy | 40% | 33% |

sunny | 60% | 22% |

Do the same for the target variable which is *no* (36%) and *yes* (64%). Think of these two numbers as your prior probability. It is your best guess of the outcome without any other supporting evidence. So if someone asks me will my mother play golf on a day but they have told me nothing about the weather, my best guess is simply *yes* she will play golf with a probability of 64%. But let’s say I peek at the weather forecast for this Saturday and I see that the day will be SUNNY (outlook), MILD (temperature), HIGH (humidity) and TRUE (windy). Will my mother play golf on this day? Let’s use Naive Bayes classification to predict the outcome.

### Step 2: Compute the likelihood L(*yes*) for test data

Test data day: SUNNY (outlook), MILD (temperature), HIGH (humidity) and TRUE (windy). Let’s first compute the likelihood that *yes* mother will play golf. Go grab the numbers from the tables we created in Step 1:

0.22 (SUNNY) * 0.44 (MILD) * 0.33 (HIGH) * 0.33 (TRUE) = 0.010542

And don’t forget to multiply by the prior for *yes* 0.64.

So, likelihood for *yes* = 0.010542 * 0.64 = 0.006747. Remember this is not a probability … yet. Let’s calculate the likelihood for *no*.

### Step 3: Compute the likelihood L(*no*) for test data

Same test data as already described. Once again go back to Step 1 and grab the numbers for *no:*

0.60 (SUNNY) * 0.40 (MILD) * 0.80 (HIGH) * 0.60 (TRUE) = 0.1152

And don’t forget to multiply by the prior for *no *0.36.

So, likelihood for *no *= 0.1152 * 0.36= 0.041472. And now it is very easy to compute the probabilities for *yes* and *no*.

### Step 4: Compute posterior probabilities

P(*yes*) = 0.006747 / (0.006747 + 0.041472) = 0.139924 = 14%. It’s fairly obvious that P(*no*) should equal 86% but let’s check.

P(*no*) = 0.041472/ (0.006747 + 0.041472) = 0.860076= 86%

Therefore I can predict, with a probability of 86%, that my mother will not play golf this Saturday based on her previous playing history and on the weather forecast – SUNNY (outlook), MILD (temperature), HIGH (humidity) and TRUE (windy).

### Closing thoughts

Naive Bayes is so-called because there is an assumption that the classifiers are independent of one another. Even though this is often not the case, the model tends to perform pretty well anyway. If you come across new data later on, great, you can simply chain it onto what you already have, i.e. each time you compute a posterior probability you can go on to use it as the prior probability with new variables.

You might have noticed that all the input variables are categorical – that is by design. But do not fret if you have numerical variables in your data, it is easy to convert them to categorical variables using expert knowledge, e.g. temperature less than 70 is cool, higher than 80 is hot and in between is mild. Or you could bin your variables automatically using percentiles or deciles or something like that. There are even ways to handle numerical data directly in Naive Bayes classification but it’s more convoluted and does require more assumptions, e.g. normally distributed variables. Personally, I prefer to stick to categorical variables with Naive Bayes because fewer assumptions is always desirable!

Check out the 5 minute video here that largely inspired this blog post.

I’m not usually a fan of infographics (not to be mistaken with data visualization which can be awesome!) but this baseball infographic is a lovely intro to predicting with Bayes.

And here is a very simple short example if you want to come at this again with different data – movie data in this case!

This blog post is of course about understanding how Naive Bayes classification works but if you are doing Naive Bayes classification with real life larger data sets then you’ll probably want to use R and the e1071 package or similar.