Analytics and Visualization of Big Data: Naive Bayes: a simple classifier

Friday, May 3, 2013

Naive Bayes: a simple classifier

The purpose of this blog post is to introduce a probabilistic classifier that is often implemented through computer software called “Naive Bayes” which is essentially used for pattern recognition within some data set. I will draw the majority of my understanding in order to write this post from the this video.

He begins the video by explaining the structure of the dataset necessary for the application of the Naïve Bayes classifier. From the video, the set up for the data should follow this form.

D = ((x⁽¹⁾, y₁), … , (x⁽ⁿ⁾,y_(n))) D is an algebraic expression for the data set.In an attempt to make that more clear, the variable x⁽¹⁾represents a coordinate pair ( x₁⁽¹⁾,x_d⁽¹⁾) where the superscript shows what point that coordinate belongs to and the subscript indexes that coordinate. X⁽ⁱ⁾ is a point in the space of R^d . Y_ibelongs to some finite set. In the video, he states that y represents a finite set that will be the integers from 1 to n.

He mentions in the video that there are several assumptions made when taking on a Naïve Bayes approach to classification. Those assumptions are listed as the following.

1) We assume we have a family of some set of distributions parametrized by theta and these distributions will have the following properties. Each of these is a joint distribution on x and y. So here x is going to be in Rd and Y is a class.

2) P_Θ(x,y) = P_Θ(x|y) P_Θ(y) = P_Θ(x₁|y) … P_Θ(x_d|y) P_Θ(y)

He mentions in the video that this second assumption is very key to the Naïve Bayes classification approach and that what this essential means is that the first expression of assumption 2 (P_Θ(x,y) = P_Θ(x|y) P_Θ(y)) will factor to equal the second expression. What I essential draw from this is that we assume that P_Θ(x|y) factors out to be the probability of the first coordinate x given y up to the last coordinate of x given y.

3) Assume that the points are independently, identically distributed based on parameter Θ. He mentions that in this context, the coordinates x₁ … x_n are independent given y if (X,Y) ~ P_Θ_.

He finally makes things a bit clearer at 7:35 in the video. He mentions that the main assumption is a conditional independence assumption.

At this point, he explains the “goal” of Naïve Bayes. Essentially he says that when some new x enters the data set, we want to “predict” its y. He mentions that the algorithm initiates by attempting to estimate the parameter Θ for which it is believed the distribution of the (x,y)s follow. Theta is estimated from the data and then we compute the prediction of y that maximizes over all possible classes the probability of that class given the new x. Because we assume that P_Θ(x|y) P_Θ(y) factors out to be = P_Θ(x₁|y) … P_Θ(x_d|y) P_Θ(y), we will attempt to maximize the prediction y across all x’s, and that value should give the new prediction for the value of y given to the new x. For a better understanding, please watch the video.