T
- Type of object being clusteredpublic class FeatureVectorKMeans<T extends FeatureVector> extends Object implements SpatialClusterer<FeatureVectorCentroidsResult<T>,T>
ObjectNearestNeighbours
; for
example, exact K-Means can be achieved using an
ObjectNearestNeighboursExact
. The specific choice of
nearest-neighbour object is controlled through the
NearestNeighboursFactory
provided to the KMeansConfiguration
used to construct instances of this class. The choice of
ObjectNearestNeighbours
affects the speed of clustering; using
approximate nearest-neighbours algorithms for the K-Means can produces
comparable results to the exact KMeans algorithm in much shorter time. The
choice and configuration of ObjectNearestNeighbours
can also control
the type of distance function being used in the clustering.
The algorithm is implemented as follows: Clustering is initiated using a
ByteKMeansInit
and is iterative. In each round, batches of samples
are assigned to centroids in parallel. The centroid assignment is performed
using the pre-configured ObjectNearestNeighbours
instances created
from the KMeansConfiguration
. Once all samples are assigned new
centroids are calculated and the next round started. Data point pushing is
performed using the same techniques as center point assignment.
This implementation is able to deal with larger-than-memory datasets by
streaming the samples from disk using an appropriate DataSource
. The
only requirement is that there is enough memory to hold all the centroids
plus working memory for the batches of samples being assigned.
Modifier and Type | Class and Description |
---|---|
static class |
FeatureVectorKMeans.Result<T extends FeatureVector>
Result object for FeatureVectorKMeans, extending
FeatureVectorCentroidsResult and ObjectNearestNeighboursProvider, as well
as giving access to state information from the operation of the K-Means
algorithm (i.e.
|
Modifier | Constructor and Description |
---|---|
protected |
FeatureVectorKMeans()
A completely default
ByteKMeans used primarily as a convenience
function for reading. |
|
FeatureVectorKMeans(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)
Construct the clusterer with the the given configuration.
|
Modifier and Type | Method and Description |
---|---|
FeatureVectorCentroidsResult<T> |
cluster(DataSource<T> ds)
Perform clustering with data from a data source.
|
protected void |
cluster(DataSource<T> data,
FeatureVectorKMeans.Result<T> result)
Main clustering algorithm.
|
protected FeatureVectorKMeans.Result<T> |
cluster(DataSource<T> data,
int K)
Initiate clustering with the given data and number of clusters.
|
FeatureVectorKMeans.Result<T> |
cluster(List<T> data)
Perform clustering on the given data.
|
FeatureVectorKMeans.Result<T> |
cluster(T[] data)
Perform clustering on the given data.
|
void |
cluster(T[] data,
FeatureVectorKMeans.Result<T> result)
Main clustering algorithm.
|
static <T extends FeatureVector> |
createExact(int K,
DistanceComparator<? super T> distance)
Convenience method to quickly create an exact
ByteKMeans . |
static <T extends FeatureVector> |
createExact(int K,
DistanceComparator<? super T> distance,
int niters)
Convenience method to quickly create an exact
ByteKMeans . |
KMeansConfiguration<ObjectNearestNeighbours<T>,T> |
getConfiguration()
Get the configuration
|
FeatureVectorKMeansInit<T> |
getInit()
Get the current initialisation algorithm
|
int[][] |
performClustering(List<T> data)
Perform clustering on the given data.
|
int[][] |
performClustering(T[] data) |
void |
seed(long seed)
Set the seed for the internal random number generator.
|
void |
setConfiguration(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)
Set the configuration
|
void |
setInit(FeatureVectorKMeansInit<T> init)
Set the current initialisation algorithm
|
String |
toString() |
public FeatureVectorKMeans(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)
conf
- The configuration.protected FeatureVectorKMeans()
ByteKMeans
used primarily as a convenience
function for reading.public FeatureVectorKMeansInit<T> getInit()
public void setInit(FeatureVectorKMeansInit<T> init)
init
- the init algorithm to be usedpublic void seed(long seed)
seed
- the random seed for init random sample selection, no seed if
seed < -1public FeatureVectorKMeans.Result<T> cluster(List<T> data)
data
- the data.public FeatureVectorKMeans.Result<T> cluster(T[] data)
SpatialClusterer
cluster
in interface SpatialClusterer<FeatureVectorCentroidsResult<T extends FeatureVector>,T extends FeatureVector>
data
- the data.public int[][] performClustering(T[] data)
performClustering
in interface Clusterer<T extends FeatureVector[]>
public int[][] performClustering(List<T> data)
data
- the data.protected FeatureVectorKMeans.Result<T> cluster(DataSource<T> data, int K) throws Exception
#cluster(DataSource, Object)
.data
- data source to cluster withK
- number of clusters to findException
public void cluster(T[] data, FeatureVectorKMeans.Result<T> result) throws InterruptedException
result
object and as such ignores the
init object. In normal operation you should call one of the other
cluster
cluster methods instead of this one.
However, if you wish to resume clustering iterations from a result that
you've already generated this is the method to use.data
- the data to be clusteredresult
- the results object to be populatedInterruptedException
- if interrupted while waiting, in which case unfinished tasks
are cancelled.protected void cluster(DataSource<T> data, FeatureVectorKMeans.Result<T> result) throws InterruptedException
result
object and as such ignores the
init object. In normal operation you should call one of the other
cluster
cluster methods instead of this one. However, if you
wish to resume clustering iterations from a result that you've already
generated this is the method to use.data
- the data to be clusteredresult
- the results object to be populatedInterruptedException
- if interrupted while waiting, in which case unfinished tasks
are cancelled.public FeatureVectorCentroidsResult<T> cluster(DataSource<T> ds)
SpatialClusterer
DataSource
could potentially be backed by disk rather in memory.cluster
in interface SpatialClusterer<FeatureVectorCentroidsResult<T extends FeatureVector>,T extends FeatureVector>
ds
- the data.public KMeansConfiguration<ObjectNearestNeighbours<T>,T> getConfiguration()
public void setConfiguration(KMeansConfiguration<ObjectNearestNeighbours<T>,T> conf)
conf
- the configuration to setpublic static <T extends FeatureVector> FeatureVectorKMeans<T> createExact(int K, DistanceComparator<? super T> distance)
ByteKMeans
. All
parameters other than the number of clusters are set at their defaults,
but can be manipulated through the configuration returned by
getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
- the number of clustersdistance
- the distance measureByteKMeans
instance configured for exact k-meanspublic static <T extends FeatureVector> FeatureVectorKMeans<T> createExact(int K, DistanceComparator<? super T> distance, int niters)
ByteKMeans
. All
parameters other than the number of clusters and number of iterations are
set at their defaults, but can be manipulated through the configuration
returned by getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
- the number of clustersdistance
- the distance measureniters
- maximum number of iterationsByteKMeans
instance configured for exact k-means