[MLE] Linear Classification

Linear Classification

  • Discriminant functions
  • Least squares for classification
  • Fisher’s linear discriminant
  • K-Nearest Neighbour(KNN)

Classification

What’s decision boundaries/linearly separable?

The goal in classification is to take an input vector x and to assign it to one of K discrete classes \(C_k\) where k = 1,…,K. In the most common scenario, the classes are taken to be disjoint, so that each input is assigned to one and only one class. The input space is thereby divided into decision regions whose boundaries are called decision boundaries or decision surfaces.
In this blog, we consider linear models for classification, by which we mean that the decision surfaces are linear functions of the input vector x and hence are defined by (D-1) -dimensional hyperplanes within the D-dimensional input space.
For example: in 3D input space, the decision boundary will be a 2D surface

What is Linear Separability

Linearly separable data:

  • Datasets whose classes can be separated by linear decision surfaces
  • Implies no class-overlap
  • classes can be divided by e.g. lines for 2D data or planes in 3D data
  • enter image description here
    Notation: h(x)->\(\widehat{t}\)

Binary notation(2 class)

enter image description here

Multi-class

enter image description here

So how classification actually works?

The hypothesis function is :
\(h(x) = f(w^Tx+w_0)\)

  • In this formula, \(w^Tx+w_0\) is the same as the linear regression model, so decision surfaces are linear function of x
  • but there is a f©, which means there is a non-linear activation function to attain binary decisions
  • for example, common probabilistic view is sigmoid function, which is \(h(x) = \frac{1}{1+e^{(w^Tx+w_0)}}\)

Question: Do you know how to calculate the decision boundary function?
If the activation function is sigmoid function.
h(x) = 0.5 ----> \(w^Tx+x_0 = 0\) if x is a 2D position/vector, \([x,y]\), therefore, \(w_1x+w_2y +w_0 = 0\), which is our decision boundary function

Discriminant functions

A discriminant is a function that takes an input vector x and assigns it to one of K classes, denoted \(C_k\). Now, we shall restrict attention to linear discriminants, and we consider first the case of two classes and then K>2 classes.

Two classes

The simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that \(y(x) = w^Tx+w_0\) where w is called weight vector and \(w_0\) is a bias.
An input vector x is assigned to class \(C_1\) if y(x)≥0 and to class \(C_2\) otherwise
Decision function: C1 if y(x) > 0/ C2 otherwise. The corresponding decision boundary is therefore defined by the relation y(x) = 0, which corresponds to a (D-1)-dimensional hyperplane within the D-dimensional input space.
Consider two points \(X_A\) and \(X_B\) both of which lie on the decision surface. Because \(y(X_A) = y(X_B) = 0\), we have \(W^T(X_A-X_B) = 0\) and hence the vector w is orthogonal to every vector lying within the decision surface, so w determines the orientation of decision surface. Similarly, if x is a point on the decision surface, then y(x) = 0, and so the normal distance from the origin to the decision surface is given by enter image description here
Here we should be able to calculate the distance of x projection of decision surface

enter image description here

Procedure of deduction
enter image description here

Multiple classes

Now consider the extension of linear discriminants to K > 2 classes. We might be tempted to build a K-class discriminant by combining a number of two-class discriminant functions. However, this leads to some serious difficulties.
Consider the use of K-1 classifiers each of which solves a two-class problem of separating points in a particular class \(C_k\) from points not in that class. This is known as a one-versus-the-rest classifier.
An alternative is to introduce K(K-1)/2 binary discriminant functions, one for every possible pair of classes. This is known as a one-versus-one classifier.
enter image description here

We can avoid these difficulties by considering a single K-class discriminant comprising K linear functions of the form
\(y_k(x) = w_K^Tx + w_{K0}\)
and then assigning a point x to class \(C_k\) if \(y_k(x) > y_j(x)\) for all j ≠ k.(actually y is like possibility, which one the the largest, we assign it to that class)
enter image description here
enter image description here

Training LDA

We now explore three approaches to learning the parameters of linear discriminant function, based on least squares, fisher’s linear discriminant and the perceptron algorithm.
Aim: Find w that minimises some error function on the training set.
Alright, the best one is covered before called logistic regression, but in this module, we haven’t talked about it. I also wondered why…

Least square for classification

enter image description here
enter image description here

We just check the dimension of this weird error function(cost function)
X is n * m (m is the number of features)
W is m* k (because different classifier has different weight)
XW is n * k
T is also n * k(make sense!), where n is the number of examples(training sets) and k is the number of classifiers. So the inner product answer will be k * k which is the square of error. Tr is trace where it is sum of diagonal elements. So it turns out to be a single number.

Well there are some pros and cons of least squared LDA:

Benefits

  • Exact closed form solution

Disadvantages

  • Not probabilistic
  • Sensitive to outliers
  • Problems with multiple classes and/or unbalanced data

For example:
enter image description here
enter image description here

Fisher LDA

enter image description here
enter image description here
enter image description here
enter image description here

KNN

  • In 1-Nearest Neighbour classification, each new example is assigned the same label as that of the example x(i) that minimises some distance function \(D(x(i), \widehat{x})\)
  • For K-Nearest Neighbours, a majority vote is taken over the labels of the K training instances with smallest distance to \(\widehat{x}\)
  • Performance depends on K

评论

此博客中的热门博文

[MLE] W2 Multivariate linear regression

[MLE] W1 Introduction

[AIM] MetaHeuristics