FeatureVectorKMeans (OpenIMAJ master project 1.3.10 API)

java.lang.Object
- org.openimaj.ml.clustering.kmeans.FeatureVectorKMeans<T>

Type Parameters:

T - Type of object being clustered

All Implemented Interfaces:

Clusterer<T[]>, SpatialClusterer<FeatureVectorCentroidsResult<T>,T>
```
public class FeatureVectorKMeans<T extends FeatureVector>
extends Object
implements SpatialClusterer<FeatureVectorCentroidsResult<T>,T>
```
Fast, parallel implementation of the K-Means algorithm with support for bigger-than-memory data. Various flavors of K-Means are supported through the selection of different subclasses of ObjectNearestNeighbours; for example, exact K-Means can be achieved using an ObjectNearestNeighboursExact. The specific choice of nearest-neighbour object is controlled through the NearestNeighboursFactory provided to the KMeansConfiguration used to construct instances of this class. The choice of ObjectNearestNeighbours affects the speed of clustering; using approximate nearest-neighbours algorithms for the K-Means can produces comparable results to the exact KMeans algorithm in much shorter time. The choice and configuration of ObjectNearestNeighbours can also control the type of distance function being used in the clustering.
The algorithm is implemented as follows: Clustering is initiated using a ByteKMeansInit and is iterative. In each round, batches of samples are assigned to centroids in parallel. The centroid assignment is performed using the pre-configured ObjectNearestNeighbours instances created from the KMeansConfiguration. Once all samples are assigned new centroids are calculated and the next round started. Data point pushing is performed using the same techniques as center point assignment.
This implementation is able to deal with larger-than-memory datasets by streaming the samples from disk using an appropriate DataSource. The only requirement is that there is enough memory to hold all the centroids plus working memory for the batches of samples being assigned.

Author:

Jonathon Hare (jsh2@ecs.soton.ac.uk), Sina Samangooei (ss@ecs.soton.ac.uk)

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`FeatureVectorKMeans.Result<T extends FeatureVector>` Result object for FeatureVectorKMeans, extending FeatureVectorCentroidsResult and ObjectNearestNeighboursProvider, as well as giving access to state information from the operation of the K-Means algorithm (i.e.

Constructor Summary

Constructors
Modifier	Constructor and Description
`protected`	`FeatureVectorKMeans()` A completely default `ByteKMeans` used primarily as a convenience function for reading.
	`FeatureVectorKMeans(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)` Construct the clusterer with the the given configuration.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`FeatureVectorCentroidsResult<T>`	`cluster(DataSource<T> ds)` Perform clustering with data from a data source.
`protected void`	`cluster(DataSource<T> data, FeatureVectorKMeans.Result<T> result)` Main clustering algorithm.
`protected FeatureVectorKMeans.Result<T>`	`cluster(DataSource<T> data, int K)` Initiate clustering with the given data and number of clusters.
`FeatureVectorKMeans.Result<T>`	`cluster(List<T> data)` Perform clustering on the given data.
`FeatureVectorKMeans.Result<T>`	`cluster(T[] data)` Perform clustering on the given data.
`void`	`cluster(T[] data, FeatureVectorKMeans.Result<T> result)` Main clustering algorithm.
`static <T extends FeatureVector> FeatureVectorKMeans<T>`	`createExact(int K, DistanceComparator<? super T> distance)` Convenience method to quickly create an exact `ByteKMeans`.
`static <T extends FeatureVector> FeatureVectorKMeans<T>`	`createExact(int K, DistanceComparator<? super T> distance, int niters)` Convenience method to quickly create an exact `ByteKMeans`.
`KMeansConfiguration<ObjectNearestNeighbours<T>,T>`	`getConfiguration()` Get the configuration
`FeatureVectorKMeansInit<T>`	`getInit()` Get the current initialisation algorithm
`int[][]`	`performClustering(List<T> data)` Perform clustering on the given data.
`int[][]`	`performClustering(T[] data)`
`void`	`seed(long seed)` Set the seed for the internal random number generator.
`void`	`setConfiguration(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)` Set the configuration
`void`	`setInit(FeatureVectorKMeansInit<T> init)` Set the current initialisation algorithm
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - FeatureVectorKMeans
```
public FeatureVectorKMeans(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)
```
    Construct the clusterer with the the given configuration.
    
    Parameters:
    
    conf - The configuration.
  - FeatureVectorKMeans
```
protected FeatureVectorKMeans()
```
    A completely default ByteKMeans used primarily as a convenience function for reading.
- Method Detail
  - getInit
```
public FeatureVectorKMeansInit<T> getInit()
```
    Get the current initialisation algorithm
    
    Returns:
    
    the init algorithm being used
  - setInit
```
public void setInit(FeatureVectorKMeansInit<T> init)
```
    Set the current initialisation algorithm
    
    Parameters:
    
    init - the init algorithm to be used
  - seed
```
public void seed(long seed)
```
    Set the seed for the internal random number generator.
    
    Parameters:
    
    seed - the random seed for init random sample selection, no seed if seed < -1
  - cluster
```
public FeatureVectorKMeans.Result<T> cluster(List<T> data)
```
    Perform clustering on the given data.
    
    Parameters:
    
    data - the data.
    
    Returns:
    
    the generated clusters.
  - cluster
```
public FeatureVectorKMeans.Result<T> cluster(T[] data)
```
    Description copied from interface: SpatialClusterer
    
    Perform clustering on the given data.
    
    Specified by:
    
    cluster in interface SpatialClusterer<FeatureVectorCentroidsResult<T extends FeatureVector>,T extends FeatureVector>
    
    Parameters:
    
    data - the data.
    
    Returns:
    
    the generated clusters.
  - performClustering
```
public int[][] performClustering(T[] data)
```
    Specified by:
    
    performClustering in interface Clusterer<T extends FeatureVector[]>
    
    Returns:
    
    Given data items, cluster them by index
  - performClustering
```
public int[][] performClustering(List<T> data)
```
    Perform clustering on the given data.
    
    Parameters:
    
    data - the data.
    
    Returns:
    
    the generated clusters.
  - cluster
```
protected FeatureVectorKMeans.Result<T> cluster(DataSource<T> data,
                                                int K)
                                         throws Exception
```
    Initiate clustering with the given data and number of clusters. Internally this method constructs the array to hold the centroids and calls #cluster(DataSource, Object).
    
    Parameters:
    
    data - data source to cluster with
    
    K - number of clusters to find
    
    Returns:
    
    cluster centroids
    
    Throws:
    
    Exception
  - cluster
```
public void cluster(T[] data,
                    FeatureVectorKMeans.Result<T> result)
             throws InterruptedException
```
    Main clustering algorithm. A number of threads as specified are started each containing an assignment job and a reference to the same set of ObjectNearestNeighbours object (i.e. Exact or KDTree). Each thread is added to a job pool and started in parallel. A single accumulator is shared between all threads and locked on update.
    This methods expects that the initial centroids have already been set in the result object and as such ignores the init object. In normal operation you should call one of the other cluster cluster methods instead of this one. However, if you wish to resume clustering iterations from a result that you've already generated this is the method to use.
    
    Parameters:
    
    data - the data to be clustered
    
    result - the results object to be populated
    
    Throws:
    
    InterruptedException - if interrupted while waiting, in which case unfinished tasks are cancelled.
  - cluster
```
protected void cluster(DataSource<T> data,
                       FeatureVectorKMeans.Result<T> result)
                throws InterruptedException
```
    Main clustering algorithm. A number of threads as specified are started each containing an assignment job and a reference to the same set of ObjectNearestNeighbours object (i.e. Exact or KDTree). Each thread is added to a job pool and started in parallel. A single accumulator is shared between all threads and locked on update.
    This methods expects that the initial centroids have already been set in the result object and as such ignores the init object. In normal operation you should call one of the other cluster cluster methods instead of this one. However, if you wish to resume clustering iterations from a result that you've already generated this is the method to use.
    
    Parameters:
    
    data - the data to be clustered
    
    result - the results object to be populated
    
    Throws:
    
    InterruptedException - if interrupted while waiting, in which case unfinished tasks are cancelled.
  - cluster
```
public FeatureVectorCentroidsResult<T> cluster(DataSource<T> ds)
```
    Description copied from interface: SpatialClusterer
    
    Perform clustering with data from a data source. The DataSource could potentially be backed by disk rather in memory.
    
    Specified by:
    
    cluster in interface SpatialClusterer<FeatureVectorCentroidsResult<T extends FeatureVector>,T extends FeatureVector>
    
    Parameters:
    
    ds - the data.
    
    Returns:
    
    the generated clusters.
  - getConfiguration
```
public KMeansConfiguration<ObjectNearestNeighbours<T>,T> getConfiguration()
```
    Get the configuration
    
    Returns:
    
    the configuration
  - setConfiguration
```
public void setConfiguration(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)
```
    Set the configuration
    
    Parameters:
    
    conf - the configuration to set
  - createExact
```
public static <T extends FeatureVector> FeatureVectorKMeans<T> createExact(int K,
                                                                           DistanceComparator<? super T> distance)
```
    Convenience method to quickly create an exact ByteKMeans. All parameters other than the number of clusters are set at their defaults, but can be manipulated through the configuration returned by getConfiguration().
    Euclidean distance is used to measure the distance between points.
    
    Parameters:
    
    K - the number of clusters
    
    distance - the distance measure
    
    Returns:
    
    a ByteKMeans instance configured for exact k-means
  - createExact
```
public static <T extends FeatureVector> FeatureVectorKMeans<T> createExact(int K,
                                                                           DistanceComparator<? super T> distance,
                                                                           int niters)
```
    Convenience method to quickly create an exact ByteKMeans. All parameters other than the number of clusters and number of iterations are set at their defaults, but can be manipulated through the configuration returned by getConfiguration().
    Euclidean distance is used to measure the distance between points.
    
    Parameters:
    
    K - the number of clusters
    
    distance - the distance measure
    
    niters - maximum number of iterations
    
    Returns:
    
    a ByteKMeans instance configured for exact k-means
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object

Class FeatureVectorKMeans<T extends FeatureVector>

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

FeatureVectorKMeans

FeatureVectorKMeans

Method Detail

getInit

setInit

seed

cluster

cluster

performClustering

performClustering

cluster

cluster

cluster

cluster

getConfiguration

setConfiguration

createExact

createExact

toString