Naive Bayes
TOC
Introduction
Thinking about our usual problem: given a feature vector and an output vector, naive Bayes classifiers would help us to classify a new input instance into one of the classes in the output vector. Naive Bayes classification makes use of Bayes’ theorem and some simple assumptions. Here is the Bayes’ theorem:
In plain English,
Let’s come back to the model and speak about the other component of naive Bayes: the two naive assumptions. First, we assume that the input x are conditionally independent from each other and second, that each input feature contributes equally to y the class prediction. These assumptions, despite being simple, work quite well in reality. To proceed, let’s make use of some math symbol manipulations. How do those above assumptions translate into math notation? We first need to formulate the problem using mathematical symbols: Let X be the input vector, y be the prediction using X, into K classes. When we want to get the probability of y given those X, we use Bayes’ theorem:
Utilizing the naive assumptions:
Since the denominator is constant, we remove it and use proportionality instead:
Maximum likelihood
Applying the usual maximum likelihood principle: assume the highest probability that makes all the training y happens (that would give the prediction for the most possible class, too):
Applying the transformation in the previous section:
From here, the calculation of y obviously depends on the type of distributions of
Gaussian naive Bayes
When we assume the conditional probability to be normally distributed, we make use of the following characteristics of that distribution:
Mean
Standard deviation
Density function
Example 1: Consider a small dataset with two features: precipitation (rain level) and holiday, output: traffic or not. The precipitation level is 86 96 80 65 70 80 90 75 for days of traffic jams and 85 90 70 95 91 for days of traffic ok. Using two different set of rain levels (for two different types of traffic), we can calculate the mean and standard deviation for these two classes of prediction. Then we use those two different mean and standard deviation for the calculation of probability. That said, the probability of traffic jam with rain level of 74 would be:
The probability of traffic ok with rain level 74 would be (this time with different mean and standard deviation):
Note that we need to normalize those predictions before using it in production: calculate all possible probabilities, sum them up and then divide each for the total to get the real probability.
Example 2: A credit institution decides whether to lend a credit line to a person based on several factors: age, income, student status, credit rate. In this case, let scale the income so that it comes from range 0 to 10. We encode the categorical attribute “student-or-not” into 0 and 1 with 0 being negative. Same for “lend-or-not” target. Credit rate also comes from 0 to 10.
ID | Age | Income | Student | Credit rate | Lend or not? |
---|---|---|---|---|---|
1 | 23 | 8 | 0 | 7 | 0 |
2 | 45 | 9 | 0 | 5 | 1 |
3 | 60 | 5 | 0 | 6 | 1 |
… |
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
X = [[23, 8, 0,7],[45,9,0,5],[60,5,0,6],[34,2,1,9],[14,4,0,4],
[22, 8, 0,7],[40,4,1,5],[65,5,0,6],[35,2,1,9],[4,4,0,4],
[25, 4, 1,5],[45,9,0,5],[60,2,0,4],[34,1,1,9],[14,2,0,4],
[19, 8, 1,8],[42,6,0,7],[61,5,0,6],[34,2,1,10],[14,4,1,4]]
y = [0,1,1,1,1,
0,1,1,1,1,
1,1,0,0,0,
1,1,1,0,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# training the model on training set
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred)*100)
Accuracy: 62.5
Multinomial naive Bayes
In this classifier,
Consider a use case for multinomial naive Bayes, we need to classify documents. Then
Laplace smoothing
Despite its fancy name, the rationale and the tweak are rather simple: to prevent 0 appearing in the frequency table and obstruct computation, we add 1 to each and every value in the equation. The reason for such a name is that this technique has mathematical interpretation, can be applied in other places, and specifically it affects the logic of the system and its ability to make inference.
Example 3: Classify a question on Quora to be sincere or insincere? Insincere means hate-speech or not-real.
- Preprocess data:
- Remove numbers and punctuations
- Remove stopwords
- Stemming and lemmatization
- Training model:
- Find probability for each word. Eliminate words with probability smaller than 0.0001
- Find conditional probability = probability of that word / total (in)sincere words
- Predict: (with Laplace smoothing) if insincere_term / total > 0.5 -> insincere
- Calculate accuracy
import numpy as np
import pandas as pd
train = pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
# preprocess
from sklearn.model_selection import train_test_split
train, test = train_test_split(train, test_size=0.2)
word_count = {}
word_count_sincere = {}
word_count_insincere = {}
sincere = 0
insincere = 0
import re
import string
import nltk
stop_words = set(nltk.corpus.stopwords.words('english'))
from nltk.stem import PorterStemmer
stemmer= PorterStemmer()
row_count = train.shape[0]
for row in range(0,row_count):
insincere += train.iloc[row]['target']
sincere += (1 - train.iloc[row]['target'])
sentence = train.iloc[row]['question_text']
sentence = re.sub(r'\d+','',sentence)
sentence = sentence.translate(sentence.maketrans("","",string.punctuation))
words_in_sentence = list(set(sentence.split(' ')) - stop_words)
for index,word in enumerate(words_in_sentence):
word = stemmer.stem(word)
words_in_sentence[index] = stemmer.stem(word)
for word in words_in_sentence:
if train.iloc[row]['target'] == 0: #Sincere Words
if word in word_count_sincere.keys():
word_count_sincere[word]+=1
else:
word_count_sincere[word] = 1
elif train.iloc[row]['target'] == 1: #Insincere Words
if word in word_count_insincere.keys():
word_count_insincere[word]+=1
else:
word_count_insincere[word] = 1
if word in word_count.keys(): #For all words. I use this to compute probability.
word_count[word]+=1
else:
word_count[word]=1
# find proba for each word. eliminate < 0.0001
word_probability = {}
total_words = 0
for i in word_count:
total_words += word_count[i]
for i in word_count:
word_probability[i] = word_count[i] / total_words
print ('Total words ',len(word_probability))
print ('Minimum probability ',min (word_probability.values()))
threshold_p = 0.0001
for i in list(word_probability):
if word_probability[i] < threshold_p:
del word_probability[i]
if i in list(word_count_sincere): #list(dict) return it;s key elements
del word_count_sincere[i]
if i in list(word_count_insincere):
del word_count_insincere[i]
print ('Total words ',len(word_probability))
# find conditional proba
total_sincere_words = sum(word_count_sincere.values())
cp_sincere = {} #Conditional Probability
for i in list(word_count_sincere):
cp_sincere[i] = word_count_sincere[i] / total_sincere_words
total_insincere_words = sum(word_count_insincere.values())
cp_insincere = {} #Conditional Probability
for i in list(word_count_insincere):
cp_insincere[i] = word_count_insincere[i] / total_insincere_words
# predict
row_count = test.shape[0]
p_insincere = insincere / (sincere + insincere)
p_sincere = sincere / (sincere + insincere)
accuracy = 0
for row in range(0,row_count):
sentence = test.iloc[row]['question_text']
target = test.iloc[row]['target']
sentence = re.sub(r'\d+','',sentence)
sentence = sentence.translate(sentence.maketrans("","",string.punctuation))
words_in_sentence = list(set(sentence.split(' ')) - stop_words)
for index,word in enumerate(words_in_sentence):
word = stemmer.stem(word)
words_in_sentence[index] = stemmer.stem(word)
insincere_term = p_insincere
sincere_term = p_sincere
sincere_M = len(cp_sincere.keys())
insincere_M = len(cp_insincere.keys())
for word in words_in_sentence:
if word not in cp_insincere.keys():
insincere_M +=1
if word not in cp_sincere.keys():
sincere_M += 1
for word in words_in_sentence:
if word in cp_insincere.keys():
insincere_term *= (cp_insincere[word] + (1/insincere_M))
else:
insincere_term *= (1/insincere_M)
if word in cp_sincere.keys():
sincere_term *= (cp_sincere[word] + (1/sincere_M))
else:
sincere_term *= (1/sincere_M)
if insincere_term/(insincere_term + sincere_term) > 0.5:
response = 1
else:
response = 0
if target == response:
accuracy += 1
print ('Accuracy is ',accuracy/row_count*100)
# Accuracy is 94.13
Bernoulli naive Bayes
In this case, we only need to care about whether the word appears (i.e. we don’t care about its frequency).
In the above equation, if
The likelihood of the document would then be: