Bayes and Naive Bayes are very important techniques in machine learning. I am going to cover and dig into Naive Bayes in machine learning, and practice that using Python to detect email span. I am going to split this into many posts, because I am going to cover theory and practice.


List of posts

We are going to cover these points:

  1. Preperation and introduction (this post)
  2. Naive Bayes by example.
  3. Scrubbing natural language text.
  4. Naive Bayes’ Classifire.
  5. Writing Naive Bayes from scratch
  6. Using Scikit-learn library

Preparation

1. Downloan the training data

We are going to work on a sample database as a training samples. You can download it from here.

After you download the package file, uncompress it. It contains two folders one called spam and the second is ham.
We will assume in the next posts that you uncompressed the file under the folder called data.

2. Download the required packages

We are going to use Python, and more specifically natural language library to clean the data.
Make sure you install the library nltk, and will download its accessories that will help our work.

import nltk
nltk.download('names')
nltk.download('wordnet')