In the previous article we wrote the naive bayes algorithm from scratch. But now we want to use the pre-built naive bayes algorithm that comes with scikit-learn library.

First we copy the same code that read the emails and clean them. This code stays the same.

import glob
import os
from nltk.corpus import names

emails, labels = [], []

with open(filename, 'r', encoding='ISO-8859-1') as infile:

for filename in glob.glob(os.path.join('data/spam', '*.txt')):
labels.append(1)

for filename in glob.glob(os.path.join('data/ham', '*.txt')):
labels.append(0)

lemmatizer = WordNetLemmatizer()
all_names = set(names.words())
def clean_text(docs):
cleaned_docs = []
for doc in docs:
cleaned_docs.append(
' '.join([
lemmatizer.lemmatize(word.lower())
for word in doc.split()
if word.isalpha() and word not in all_names
])
)
return cleaned_docs

cleaned_emails = clean_text(emails)


Now the fun part where we call

from sklearn.naive_bayes import MultinomialNB
# alpha is the smoothing factor
# fit_prior means calculate the prior from the training data set
clf = MultinomialNB(alpha=1.0, fit_prior=True)

# to train the model, we use this command
# clf.fit (sparse_matrix, labels)
# sparse_matrix is a sparse matrix representation of the input words
# labels is the classes of the training data
clf.fit(term_docs_train, labels_train)

# after training the model, we use the predict method
# clf.predict(term_docs_test)

clf.predict(term_docs_test)


## List of posts

This post is part of a series of posts

1. Preperation and introduction.
2. Naive Bayes by example
3. Scrubbing natural language text.
4. Naive Bayes’ Classifire.
5. Writing Naive Bayes from scratch.
6. Using Scikit-learn library (this post)