In the previous article we wrote the naive bayes algorithm from scratch. But now we want to use the pre-built naive bayes algorithm that comes with scikit-learn library.

First we copy the same code that read the emails and clean them. This code stays the same.

import glob
import os
from nltk.corpus import names

emails, labels = [], []

def read_file(filename):
    with open(filename, 'r', encoding='ISO-8859-1') as infile:

for filename in glob.glob(os.path.join('data/spam', '*.txt')):

for filename in glob.glob(os.path.join('data/ham', '*.txt')):

lemmatizer = WordNetLemmatizer()
all_names = set(names.words())
def clean_text(docs):
    cleaned_docs = []
    for doc in docs:
          ' '.join([
              for word in doc.split()
              if word.isalpha() and word not in all_names
    return cleaned_docs

cleaned_emails = clean_text(emails)

Now the fun part where we call

from sklearn.naive_bayes import MultinomialNB
# alpha is the smoothing factor
# fit_prior means calculate the prior from the training data set
clf = MultinomialNB(alpha=1.0, fit_prior=True)

# to train the model, we use this command
# (sparse_matrix, labels)
# sparse_matrix is a sparse matrix representation of the input words
# labels is the classes of the training data, labels_train)

# after training the model, we use the predict method
# clf.predict(term_docs_test)


List of posts

This post is part of a series of posts

  1. Preperation and introduction.
  2. Naive Bayes by example
  3. Scrubbing natural language text.
  4. Naive Bayes’ Classifire.
  5. Writing Naive Bayes from scratch.
  6. Using Scikit-learn library (this post)