Artificial Intelligence & Algorithms

Algorithms.

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

 

In the classification phase, [latex]k[/latex] is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the [latex]k[/latex] training samples nearest to that query point.

 

A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, [latex]k-NN[/latex] has also been employed with correlation coefficients such as Pearson and Spearman. Often, the classification accuracy of [latex]k-NN[/latex] can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

 

A drawback of the basic “majority voting” classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number. One way to overcome this problem is to weigh the classification, taking into account the distance from the test point to each of its [latex]k[/latex] nearest neighbors. The class (or value, in regression problems) of each of the [latex]k[/latex] nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. [latex]k-NN[/latex] can then be applied to the SOM.

Parameter Selection

The best choice of [latex]k[/latex] depends upon the data; generally, larger values of [latex]k[/latex] reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good [latex]k[/latex] can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when [latex]k = 1[/latex]) is called the nearest neighbor algorithm.

 

The accuracy of the [latex]k-NN[/latex] algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popularapproach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes.

 

In binary (two class) classification problems, it is helpful to choose [latex]k[/latex] to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal [latex]k[/latex] in this setting is via bootstrap method.

  • Polynomial (homogeneous): [latex]k(\mathbf{x_i},\mathbf{x_j})=(\mathbf{x_i} \cdot \mathbf{x_j})^d[/latex]
  • Polynomial (inhomogeneous): [latex]k(\mathbf{x_i},\mathbf{x_j})=(\mathbf{x_i} \cdot \mathbf{x_j} + 1)^d[/latex]
  • Gaussian radial basis function: [latex]k(\mathbf{x_i},\mathbf{x_j})=\exp(-\gamma \|\mathbf{x_i}-\mathbf{x_j}\|^2)[/latex], for [latex]\gamma > 0[/latex]. Sometimes parametrized using [latex]\gamma=1/{2 \sigma^2}[/latex]
  • Hyperbolic tangent: [latex]k(\mathbf{x_i},\mathbf{x_j})=\tanh(\kappa \mathbf{x_i} \cdot \mathbf{x_j}+c), for some (not every) \kappa > 0 and c < 0 [/latex]

HealthApp is a story of effort, commitment, innovation, success and recognition