How To
    Learn More

      Geo Clustering (geo_cluster)

      The geo_cluster function performs geo-spatial clustering and noise reduction. The underlying algorithm used is DBSCAN clustering using haversine meters for the distance measure. The geo_cluster function takes four parameters:

      1. The latitude field

      2. The longitude field

      3. The distance in meters to be considered a neighbor

      4. The smallest number of points to be considered a cluster

      Sample syntax

      select geo_cluster(lat_d, lon_d, 100, 5) as cluster,
      from nyc311
             where lat_d is not null
             desc_t = 'Rat Sighting'
      limit 5000

      Result set

      The geo_cluster result set contains a random sample of records that match the WHERE clause. If no WHERE clause is included the random sample will be taken from the entire result set. The size of the random sample can be controlled by the LIMIT clause. The default sample size, if no limit is applied, is 25,000. Points that match the WHERE clause but are not assigned to a cluster are not included in the result set. The noise-reduced result set makes it easy to quickly find hot spots or clusters in geo-spatial data. Due to the noise reduction the final result will likely be smaller than the limit.

      The geo_cluster function returns the cluster name for each latitude/longitude point. The latitude/longitude point fields can also be selected for plotting.

      Sample result set in Apache Zeppelin

      Sample result set


      The geo_cluster output can be visualized on a map or scatter plot. The example below shows the geo_cluster output visualized with an Apache Zeppelin map visualization.

      Sample visualization