In the previous article we wrote the naive bayes algorithm from scratch. But now we want to use the pre-built naive bayes algorithm that comes with scikit-learn library.


First we copy the same code that read the emails and clean them. This code stays the same.

import glob
import os
from nltk.corpus import names

emails, labels = [], []

def read_file(filename):
    with open(filename, 'r', encoding='ISO-8859-1') as infile:
        return infile.read()

for filename in glob.glob(os.path.join('data/spam', '*.txt')):
    emails.append(read_file(filename))
    labels.append(1)

for filename in glob.glob(os.path.join('data/ham', '*.txt')):
    emails.append(read_file(filename))
    labels.append(0)

lemmatizer = WordNetLemmatizer()
all_names = set(names.words())
def clean_text(docs):
    cleaned_docs = []
    for doc in docs:
        cleaned_docs.append(
          ' '.join([
              lemmatizer.lemmatize(word.lower())
              for word in doc.split()
              if word.isalpha() and word not in all_names
          ])
        )
    return cleaned_docs

cleaned_emails = clean_text(emails)


Now the fun part where we call

from sklearn.naive_bayes import MultinomialNB
# alpha is the smoothing factor
# fit_prior means calculate the prior from the training data set
clf = MultinomialNB(alpha=1.0, fit_prior=True)

# to train the model, we use this command
# clf.fit (sparse_matrix, labels)
# sparse_matrix is a sparse matrix representation of the input words
# labels is the classes of the training data
clf.fit(term_docs_train, labels_train)

# after training the model, we use the predict method
# clf.predict(term_docs_test)

clf.predict(term_docs_test)

List of posts

This post is part of a series of posts

  1. Preperation and introduction.
  2. Naive Bayes by example
  3. Scrubbing natural language text.
  4. Naive Bayes’ Classifire.
  5. Writing Naive Bayes from scratch.
  6. Using Scikit-learn library (this post)