Naive Bayes Classifier is one of the well known classifiers in supervised learning. I am going to show how it is calculated from a training data.

## How does Naive Bayes Classifier work?

I talked before about Naive Bayes’ theorem, and how to calculate the probability mathematically. How this is going to be applied in machine learning and supervised learning?
It is not different from how Naive Bayes work on statistics.
Given a sample $x$ with $n$ features $x_1, x_2, …x_n$, the goal of naive bayes is to determine the possibility of $x$ belong to $k$ possible classes: $y_1, y_2…y_k$.
Let us see how we represent Naive Bayes mathematically, and explain the Bayes terminologies used in supervised learning.

## Naive Bayes Calculation and Terminologies

As we learned before in my previous article, we can calculate $x$ belonging to $y_k$ as follows:
$P(y_k|x) = \frac{P(x|y_k)P(y_k)}{P(x)}$

$P(y_k)$ represents how the classes are distributed, and in Bayes’ theorem terms it is called: prior.
$P(x|y_k)$ represents how likely a sample $x$ that belong to a class $y_k$, how likely its $n$ features $x_1, x_2, …x_n$ with such values co-occur. $P(y_k|x)$ in Bayes terminologies is called likelihood
$P(x)$ is the overall distribution of features that are not specific to certain classes, and it is called evidence.
The prediction result in Bayes terminologies is called posterior.

### Calculate Prior: $p(y_k)$:

prior can be defined by one of two ways:

• Predetermined (for example each class has an equal chance of occurrence).
• Calculated from the training samples.

### Calculate Likelihood: $p(x|y_k)$

It is very difficult to calculate the joint distribution with lots of features $n$, and this is why we assume that there is no relationship among them and they are totally independent, and this is why it is called Naive. The joint conditional distribution of n features can be calculated as the joint product of individual feature conditional distribution:

And it can be calculated from the training data.

### Calculate Evidence: $p(x)$

Because the evidence is not related to classes, then it can be calculated to a constant.

### The final formula

Because the evidence of $P(x)$ can be calculated as constant, so we can say that the posterior is proportional to likelihood and prior:

## List of posts

This post is part of a series of posts

1. Preperation and introduction.
2. Naive Bayes by example
3. Scrubbing natural language text.
4. Naive Bayes’ Classifire (this post).
5. Writing Naive Bayes from scratch
6. Using Scikit-learn library