# Workshop #1 Supervised Learning: KNN

Posted:2021-04-15 07:24:43　Click:318

## Introduction

K-Nearest Neighbors (KNN) is a basic classifier for machine learning. A classifier takes an already labeled data set, and then it trys to label new data points into one of the catagories. So, we are trying to identify what class an object is in. To do this we look at the closest points (neighbors) to the object and the class with the majority of neighbors will be the class that we identify the object to be in. The k is the number of nearest neighbors to the object. So, if k = 1 then the class the object would be in is the class of the closest neighbor. Let’s look at an example.

In this example we are trying to classify the red star to be either a green square or a blue octagon. First, if we look at the inner circle where k = 3, we can see that there are 2 blue octagons and 1 green square. So there is a majority of blue octagons, so the red star would be classified as a blue octagon. Now we look at k = 5, the outer circle. In this one there is 2 blue octagons and 3 green squares. Then, the red star would be classified as a green square.

## How does it work?

We will look at two different ways to go about this. The two ways are the brute force method and the K-D tree method.

## Brute Force Method

This is the simplest method. Basically, it’s just calculating the Euclidean distance from the object being classified to each point in the set. The Euclidean distance is simply the length of a line segment that connects two points. The Brute Force method is useful when the dimensions of the points are small or the number of points is small. As the number of points increases the number of times the method will have to calculate the Euclidean distance also increases, so the performance of the method drops. Luckily, the K-D tree method is better equipped for larger sets of data.

## K-D Tree Method

This method tries to improve the running time by reducing the amount of times we calculate the Euclidean distance. The idea behind this method is that if we know that two data points are close to each other and we calculate the Euclidean distance to one of them and then we know that distance is roughly close to the other point. Here is an example of how the K-D tree looks like.

How a K-D tree works is that a node in the tree represents and holds data from an n-dimensional graph. Each node represents a box in the graph. First we can build a K-D tree out of a set of data, then when it’s time to classify a point we would just look at where the point will fall in the tree then calculate the Euclidean distance between only the points it is close to until we reach k neighbors.

If you have a larger data set it is recommended to use this method. This is because the cost of creating the K-D tree is relatively low if the data set is larger, and the cost of classifiying a point is constant as the data gets larger.

## Choosing k

Choosing k typically depends on the dataset you are looking at. You never want to choose k = 2 because it has a very high chance that there won’t be a majority class, so in the example above there would be one of each so we wouldn’t be able to classify the red star. Typically, you want the value of k to be small. As k goes to infinity all unidentified data points will always be classified to one class or the other depending on which class has more data points. You don’t want this to happen, so it is wise to choose a k that is relatively small.

KNN Assignment

1) Implement a Java-based KNN() function with specific interface and parameters;
2) For iMADE implementation, implement a multiagent based KNN application.
In order to ensure a well defined multiagent system (on top of JADE platform), the application must at least consists of two (or more than 2 iMADE agents). E,g. Data-miner agent to get the data from the iMADE server, KNN agent to perform the KNN operations.

KNN Parameters

KNN(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)

1)  n_neighbors : int, optional (default = 5) Number of neighbors to use by default for :meth:`kneighbors` queries.

2) Weights: used to identify the weight of the nearest samples of each sample. You can select "uniform", "distance" or user-defined weight. The default is "uniform", and the weights of all nearest neighbor samples are the same.

3) algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
Algorithm used to compute the nearest neighbors:

- 'ball_tree' will use :class:`BallTree`
- 'kd_tree' will use :class:`KDTree`
- 'brute' will use a brute-force search.
- 'auto' will attempt to decide the most appropriate algorithm
based on the values passed to :meth:`fit` method.

4) leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or KDTree.  This can affect the
speed of the construction and query, as well as the memory
required to store the tree.  The optimal value depends on the
nature of the problem.

5) metric : string or callable, default 'minkowski' metric to use for distance computation. If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

6) p : integer, optional (default = 2)
Parameter for the Minkowski metric from
sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

KNN Application