In the previous article we wrote the naive bayes algorithm from scratch. But now we want to use the pre-built naive bayes algorithm that comes with scikit-learn library.
First we copy the same code that read the emails and clean them. This code stays the same.
import glob import os from nltk.corpus import names emails, labels = ,  def read_file(filename): with open(filename, 'r', encoding='ISO-8859-1') as infile: return infile.read() for filename in glob.glob(os.path.join('data/spam', '*.txt')): emails.append(read_file(filename)) labels.append(1) for filename in glob.glob(os.path.join('data/ham', '*.txt')): emails.append(read_file(filename)) labels.append(0) lemmatizer = WordNetLemmatizer() all_names = set(names.words()) def clean_text(docs): cleaned_docs =  for doc in docs: cleaned_docs.append( ' '.join([ lemmatizer.lemmatize(word.lower()) for word in doc.split() if word.isalpha() and word not in all_names ]) ) return cleaned_docs cleaned_emails = clean_text(emails)
Now the fun part where we call
from sklearn.naive_bayes import MultinomialNB # alpha is the smoothing factor # fit_prior means calculate the prior from the training data set clf = MultinomialNB(alpha=1.0, fit_prior=True) # to train the model, we use this command # clf.fit (sparse_matrix, labels) # sparse_matrix is a sparse matrix representation of the input words # labels is the classes of the training data clf.fit(term_docs_train, labels_train) # after training the model, we use the predict method # clf.predict(term_docs_test) clf.predict(term_docs_test)
List of posts
This post is part of a series of posts