Geo Clustering (geo_cluster)
geo_cluster function performs geo-spatial clustering and noise reduction. The underlying algorithm used is DBSCAN clustering using haversine meters for the distance measure. The geo_cluster function takes four parameters:
The latitude field
The longitude field
The distance in meters to be considered a neighbor
The smallest number of points to be considered a cluster
select geo_cluster(lat_d, lon_d, 100, 5) as cluster, lat_d, lon_d from nyc311 where lat_d is not null and desc_t = 'Rat Sighting' limit 5000
geo_cluster result set contains a random sample of records that match the
WHERE clause is included the random sample will be taken from the entire result set.
The size of the random sample can be controlled by the
LIMIT clause. The default sample size, if no limit is applied, is 25,000.
Points that match the
WHERE clause but are not assigned to a cluster are not included in the result set.
The noise-reduced result set makes it easy to quickly find hot spots or clusters in geo-spatial data.
Due to the noise reduction the final result will likely be smaller than the limit.
geo_cluster function returns the cluster name for each latitude/longitude point. The latitude/longitude point fields can also be selected for plotting.
geo_cluster output can be visualized on a map or scatter plot. The example below shows the
geo_cluster output visualized with an Apache Zeppelin map visualization.