public class ByteKMeans extends Object implements SpatialClusterer<ByteCentroidsResult,byte[]>
ByteNearestNeighbours
; for
example, approximate K-Means can be achieved using a
ByteNearestNeighboursKDTree
whilst exact K-Means can be achieved
using an ByteNearestNeighboursExact
. The specific choice of
nearest-neighbour object is controlled through the
NearestNeighboursFactory
provided to the KMeansConfiguration
used to construct instances of this class. The choice of
ByteNearestNeighbours
affects the speed of clustering; using
approximate nearest-neighbours algorithms for the K-Means can produces
comparable results to the exact KMeans algorithm in much shorter time.
The choice and configuration of ByteNearestNeighbours
can also
control the type of distance function being used in the clustering.
The algorithm is implemented as follows: Clustering is initiated using a
ByteKMeansInit
and is iterative. In each round, batches of
samples are assigned to centroids in parallel. The centroid assignment is
performed using the pre-configured ByteNearestNeighbours
instances
created from the KMeansConfiguration
. Once all samples are assigned
new centroids are calculated and the next round started. Data point pushing
is performed using the same techniques as center point assignment.
This implementation is able to deal with larger-than-memory datasets by
streaming the samples from disk using an appropriate DataSource
. The
only requirement is that there is enough memory to hold all the centroids
plus working memory for the batches of samples being assigned.
Modifier and Type | Class and Description |
---|---|
static class |
ByteKMeans.Result
Result object for ByteKMeans, extending ByteCentroidsResult and ByteNearestNeighboursProvider,
as well as giving access to state information from the operation of the K-Means algorithm
(i.e.
|
Modifier | Constructor and Description |
---|---|
protected |
ByteKMeans()
A completely default
ByteKMeans used primarily as a convenience function for reading. |
|
ByteKMeans(KMeansConfiguration<ByteNearestNeighbours,byte[]> conf)
Construct the clusterer with the the given configuration.
|
Modifier and Type | Method and Description |
---|---|
ByteKMeans.Result |
cluster(byte[][] data)
Perform clustering on the given data.
|
void |
cluster(byte[][] data,
ByteKMeans.Result result)
Main clustering algorithm.
|
ByteKMeans.Result |
cluster(DataSource<byte[]> ds)
Perform clustering with data from a data source.
|
void |
cluster(DataSource<byte[]> data,
ByteKMeans.Result result)
Main clustering algorithm.
|
protected ByteKMeans.Result |
cluster(DataSource<byte[]> data,
int K)
Initiate clustering with the given data and number of clusters.
|
static ByteKMeans |
createExact(int K)
Convenience method to quickly create an exact
ByteKMeans . |
static ByteKMeans |
createExact(int K,
int niters)
Convenience method to quickly create an exact
ByteKMeans . |
static ByteKMeans |
createKDTreeEnsemble(int K)
Convenience method to quickly create an approximate
ByteKMeans
using an ensemble of KD-Trees to perform nearest-neighbour lookup. |
KMeansConfiguration<ByteNearestNeighbours,byte[]> |
getConfiguration()
Get the configuration
|
ByteKMeansInit |
getInit()
Get the current initialisation algorithm
|
int[][] |
performClustering(byte[][] data) |
protected double |
roundDouble(double value) |
protected float |
roundFloat(double value) |
protected int |
roundInt(double value) |
protected long |
roundLong(double value) |
void |
seed(long seed)
Set the seed for the internal random number generator.
|
void |
setConfiguration(KMeansConfiguration<ByteNearestNeighbours,byte[]> conf)
Set the configuration
|
void |
setInit(ByteKMeansInit init)
Set the current initialisation algorithm
|
String |
toString() |
public ByteKMeans(KMeansConfiguration<ByteNearestNeighbours,byte[]> conf)
conf
- The configuration.protected ByteKMeans()
ByteKMeans
used primarily as a convenience function for reading.public ByteKMeansInit getInit()
public void setInit(ByteKMeansInit init)
init
- the init algorithm to be usedpublic void seed(long seed)
seed
- the random seed for init random sample selection, no seed if seed < -1public ByteKMeans.Result cluster(byte[][] data)
SpatialClusterer
cluster
in interface SpatialClusterer<ByteCentroidsResult,byte[]>
data
- the data.public int[][] performClustering(byte[][] data)
performClustering
in interface Clusterer<byte[][]>
protected ByteKMeans.Result cluster(DataSource<byte[]> data, int K) throws Exception
#cluster(DataSource, byte [][])
.data
- data source to cluster withK
- number of clusters to findException
public void cluster(byte[][] data, ByteKMeans.Result result) throws InterruptedException
result
object and as such ignores the
init object. In normal operation you should call one of the other cluster
cluster methods instead of this one. However, if you wish to resume clustering
iterations from a result that you've already generated this is the method
to use.data
- the data to be clusteredresult
- the results object to be populatedInterruptedException
- if interrupted while waiting, in
which case unfinished tasks are cancelled.public void cluster(DataSource<byte[]> data, ByteKMeans.Result result) throws InterruptedException
result
object and as such ignores the
init object. In normal operation you should call one of the other cluster
cluster methods instead of this one. However, if you wish to resume clustering
iterations from a result that you've already generated this is the method
to use.data
- the data to be clusteredresult
- the results object to be populatedInterruptedException
- if interrupted while waiting, in
which case unfinished tasks are cancelled.protected float roundFloat(double value)
protected double roundDouble(double value)
protected long roundLong(double value)
protected int roundInt(double value)
public ByteKMeans.Result cluster(DataSource<byte[]> ds)
SpatialClusterer
DataSource
could potentially be backed by disk rather in memory.cluster
in interface SpatialClusterer<ByteCentroidsResult,byte[]>
ds
- the data.public KMeansConfiguration<ByteNearestNeighbours,byte[]> getConfiguration()
public void setConfiguration(KMeansConfiguration<ByteNearestNeighbours,byte[]> conf)
conf
- the configuration to setpublic static ByteKMeans createExact(int K)
ByteKMeans
. All
parameters other than the number of clusters are set
at their defaults, but can be manipulated through the configuration
returned by getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
- the number of clustersByteKMeans
instance configured for exact k-meanspublic static ByteKMeans createExact(int K, int niters)
ByteKMeans
. All
parameters other than the number of clusters and number
of iterations are set at their defaults, but can be manipulated through
the configuration returned by getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
- the number of clustersniters
- maximum number of iterationsByteKMeans
instance configured for exact k-meanspublic static ByteKMeans createKDTreeEnsemble(int K)
ByteKMeans
using an ensemble of KD-Trees to perform nearest-neighbour lookup. All
parameters other than the number of clusters are set
at their defaults, but can be manipulated through the configuration
returned by getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
- the number of clustersByteKMeans
instance configured for approximate k-means
using an ensemble of KD-Trees