public class LongKMeans extends Object implements SpatialClusterer<LongCentroidsResult,long[]>
LongNearestNeighbours
; for
example, approximate KMeans can be achieved using a
LongNearestNeighboursKDTree
whilst exact KMeans can be achieved
using an LongNearestNeighboursExact
. The specific choice of
nearestneighbour object is controlled through the
NearestNeighboursFactory
provided to the KMeansConfiguration
used to construct instances of this class. The choice of
LongNearestNeighbours
affects the speed of clustering; using
approximate nearestneighbours algorithms for the KMeans can produces
comparable results to the exact KMeans algorithm in much shorter time.
The choice and configuration of LongNearestNeighbours
can also
control the type of distance function being used in the clustering.
The algorithm is implemented as follows: Clustering is initiated using a
LongKMeansInit
and is iterative. In each round, batches of
samples are assigned to centroids in parallel. The centroid assignment is
performed using the preconfigured LongNearestNeighbours
instances
created from the KMeansConfiguration
. Once all samples are assigned
new centroids are calculated and the next round started. Data point pushing
is performed using the same techniques as center point assignment.
This implementation is able to deal with largerthanmemory datasets by
streaming the samples from disk using an appropriate DataSource
. The
only requirement is that there is enough memory to hold all the centroids
plus working memory for the batches of samples being assigned.
Modifier and Type  Class and Description 

static class 
LongKMeans.Result
Result object for LongKMeans, extending LongCentroidsResult and LongNearestNeighboursProvider,
as well as giving access to state information from the operation of the KMeans algorithm
(i.e.

Modifier  Constructor and Description 

protected 
LongKMeans()
A completely default
LongKMeans used primarily as a convenience function for reading. 

LongKMeans(KMeansConfiguration<LongNearestNeighbours,long[]> conf)
Construct the clusterer with the the given configuration.

Modifier and Type  Method and Description 

LongKMeans.Result 
cluster(DataSource<long[]> ds)
Perform clustering with data from a data source.

protected LongKMeans.Result 
cluster(DataSource<long[]> data,
int K)
Initiate clustering with the given data and number of clusters.

void 
cluster(DataSource<long[]> data,
LongKMeans.Result result)
Main clustering algorithm.

LongKMeans.Result 
cluster(long[][] data)
Perform clustering on the given data.

void 
cluster(long[][] data,
LongKMeans.Result result)
Main clustering algorithm.

static LongKMeans 
createExact(int K)
Convenience method to quickly create an exact
LongKMeans . 
static LongKMeans 
createExact(int K,
int niters)
Convenience method to quickly create an exact
LongKMeans . 
static LongKMeans 
createKDTreeEnsemble(int K)
Convenience method to quickly create an approximate
LongKMeans
using an ensemble of KDTrees to perform nearestneighbour lookup. 
KMeansConfiguration<LongNearestNeighbours,long[]> 
getConfiguration()
Get the configuration

LongKMeansInit 
getInit()
Get the current initialisation algorithm

int[][] 
performClustering(long[][] data) 
protected double 
roundDouble(double value) 
protected float 
roundFloat(double value) 
protected int 
roundInt(double value) 
protected long 
roundLong(double value) 
void 
seed(long seed)
Set the seed for the internal random number generator.

void 
setConfiguration(KMeansConfiguration<LongNearestNeighbours,long[]> conf)
Set the configuration

void 
setInit(LongKMeansInit init)
Set the current initialisation algorithm

String 
toString() 
public LongKMeans(KMeansConfiguration<LongNearestNeighbours,long[]> conf)
conf
 The configuration.protected LongKMeans()
LongKMeans
used primarily as a convenience function for reading.public LongKMeansInit getInit()
public void setInit(LongKMeansInit init)
init
 the init algorithm to be usedpublic void seed(long seed)
seed
 the random seed for init random sample selection, no seed if seed < 1public LongKMeans.Result cluster(long[][] data)
SpatialClusterer
cluster
in interface SpatialClusterer<LongCentroidsResult,long[]>
data
 the data.public int[][] performClustering(long[][] data)
performClustering
in interface Clusterer<long[][]>
protected LongKMeans.Result cluster(DataSource<long[]> data, int K) throws Exception
#cluster(DataSource, long [][])
.data
 data source to cluster withK
 number of clusters to findException
public void cluster(long[][] data, LongKMeans.Result result) throws InterruptedException
result
object and as such ignores the
init object. In normal operation you should call one of the other cluster
cluster methods instead of this one. However, if you wish to resume clustering
iterations from a result that you've already generated this is the method
to use.data
 the data to be clusteredresult
 the results object to be populatedInterruptedException
 if interrupted while waiting, in
which case unfinished tasks are cancelled.public void cluster(DataSource<long[]> data, LongKMeans.Result result) throws InterruptedException
result
object and as such ignores the
init object. In normal operation you should call one of the other cluster
cluster methods instead of this one. However, if you wish to resume clustering
iterations from a result that you've already generated this is the method
to use.data
 the data to be clusteredresult
 the results object to be populatedInterruptedException
 if interrupted while waiting, in
which case unfinished tasks are cancelled.protected float roundFloat(double value)
protected double roundDouble(double value)
protected long roundLong(double value)
protected int roundInt(double value)
public LongKMeans.Result cluster(DataSource<long[]> ds)
SpatialClusterer
DataSource
could potentially be backed by disk rather in memory.cluster
in interface SpatialClusterer<LongCentroidsResult,long[]>
ds
 the data.public KMeansConfiguration<LongNearestNeighbours,long[]> getConfiguration()
public void setConfiguration(KMeansConfiguration<LongNearestNeighbours,long[]> conf)
conf
 the configuration to setpublic static LongKMeans createExact(int K)
LongKMeans
. All
parameters other than the number of clusters are set
at their defaults, but can be manipulated through the configuration
returned by getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
 the number of clustersLongKMeans
instance configured for exact kmeanspublic static LongKMeans createExact(int K, int niters)
LongKMeans
. All
parameters other than the number of clusters and number
of iterations are set at their defaults, but can be manipulated through
the configuration returned by getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
 the number of clustersniters
 maximum number of iterationsLongKMeans
instance configured for exact kmeanspublic static LongKMeans createKDTreeEnsemble(int K)
LongKMeans
using an ensemble of KDTrees to perform nearestneighbour lookup. All
parameters other than the number of clusters are set
at their defaults, but can be manipulated through the configuration
returned by getConfiguration()
.
Euclidean distance is used to measure the distance between points.
K
 the number of clustersLongKMeans
instance configured for approximate kmeans
using an ensemble of KDTrees