2D K-Means Clustering (kmeans)

The kmeans function performs 2D k-means clustering. 2D k-means clustering can be used to visualize patterns within 2D scatter plots. The kmeans function takes three parameters:

  1. The numeric field for the first dimension.

  2. The numeric field for the second dimension.

  3. K or number of clusters.

Sample syntax

select kmeans(petal_length_d, petal_width_d, 5) as cluster,
       petal_length_d,
       petal_width_d
from iris
       limit 150

Result set

The result set contains a random sample of records that match the WHERE clause. If no WHERE clause is included the random sample will be taken from the entire result set. The size of the random sample can be controlled by the LIMIT clause. The default sample size, if no limit is applied, is 25,000.

The kmeans function returns the cluster name of each row in the result set. The two fields used for clustering are also available in the result set.

Sample result set in Apache Zeppelin

Sample result

Visualization

Sample visualization of kmeans cluster with Apache Zeppelin scatter plot.

Sample visualization