Geo Clustering (geo_cluster)

The geo_cluster function performs geo-spatial clustering and noise reduction. The underlying algorithm used is DBSCAN clustering using haversine meters for the distance measure. The geo_cluster function takes four parameters:

  1. the latitude field

  2. the longitude field

  3. the distance in meters to be considered a neighbor

  4. the smallest number of points to be considered a cluster

Sample syntax

select geo_cluster(lat_d, lon_d, 100, 5) as cluster,
from nyc311
       where lat_d is not null
       desc_t = 'Rat Sighting'
limit 5000

Result set

The geo_cluster result set contains a random sample of records that match the WHERE clause. If no WHERE clause is included the random sample will be taken from the entire result set. The size of the random sample can be controlled by the LIMIT clause. The default sample size, if no limit is applied, is 25,000. Points that match the WHERE clause but are not assigned to a cluster are not included in the result set. The noise-reduced result set makes it easy to quickly find hot spots or clusters in geo-spatial data. Due to the noise reduction the final result will likely be smaller than the limit.

The geo_cluster function returns the cluster name for each latitude/longitude point. The latitude/longitude point fields can also be selected for plotting.

Sample result set in Apache Zeppelin

Sample result set


The geo_cluster output can be visualized on a map or scatter plot. The example below shows the geo_cluster output visualized with an Apache Zeppelin map visualization.

Sample visualization