System Metrics

By default, collection of system metrics is disabled. When it is enabled, Fusion continuously indexes system and Solr metrics to the system collection system_metrics. Collection of system metrics is enabled using the Configurations API, like this:

curl -u user:pass -H 'Content-type:application/json' -X PUT -d 'true' "http://localhost:8764/api/apollo/configurations/com.lucidworks.apollo.metrics.enabled"

There are around 600 different metrics available. In this topic we’ve highlighted a few that are likely to be the most useful or interesting to you.

The /system/metrics endpoint of the System API lists all the metrics that the system is currently collecting. Metrics are returned for the current instance only; Fusion instances do not aggregate metrics between nodes.

Types of Metrics Collected

There are several types of metrics:

  • Gauges: These are single values, valid for the point in time at which the metrics are collected.

  • Counters: These are values that are incremented or decremented over time.

  • Meters: These measure the rate of events over time. They include a mean rate, as well as a 1-, 5- and 15-minute moving average. Most of these moving averages are exponentially weighted, so that more recent values contribute more heavily than older values; exceptions to this rule have the word "unweighted" in their name.

  • Histograms: These measure the distribution of values. They will report the minimum, maximum, mean, and the values at the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles.

  • Timers: A timer is a meter combined with a histogram; it measures the length of time that a particular operation takes (both mean duration and moving averages) as well as the distribution of those durations.

Many of the metrics are for internal use by the system. However, Fusion may ask for a dump of the metrics data (using the System API endpoint) to help diagnose performance issues. Some metrics are also subject to change pending performance tuning and additional testing.

Metrics of Particular Interest

Slow Web Service Calls

For each web service endpoint in the system, the system keeps a list of the last several requests whose request time has been in the 99th percentile – that is, examples of the top 1% of slow requests for that endpoint. These are recorded as com.lucidworks.apollo.resources.serviceName.methodName.weighted.slow.examples, where serviceName is the name of the service and methodName is the name of a valid method for that service.

This information might be helpful when diagnosing performance issues. Here is an example of the 5 slowest calls to the getCollectionMetrics method of the CollectionResource service:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.slow.examples" : {
      "value" : [ {
        "requestUri" : "http://localhost:8765/api/v1/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8765/api/v1/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8765/api/v1/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8765/api/v1/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8765/api/v1/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      } ]
    }

System Memory

There are several memory-related metrics reported:

  • mem.heap.used: the current amount of heap memory, in bytes, used by the system.

  • mem.heap.max: the maximum amount of heap memory, in bytes, that the system could use.

  • mem.heap.usage: the percentage (0 - 1.0) of available heap memory that the system is currently using (this is equal to mem.heap.used / mem.heap.max).

  • mem.non-heap.used: the current amount of non-heap memory (also called "off-heap memory"), in bytes, used by the system.

  • mem.non-heap.max: the maximum amount of non-heap memory, in bytes, that the system could use.

  • mem.non-heap.usage: the percentage (0 - 1.0) of available non-heap memory that the system is currently using (this is equal to mem.non-heap.used / mem.non-heap.max).

  • mem.total.used: the current total amount of memory (heap plus non-heap), in bytes, used by the system.

  • mem.total.max: the maximum amount of total memory (heap plus non-heap), in bytes, that the system could use.

Here is an example of mem.heap.used:

{
  "version" : "3.0.0",
  "gauges" : {
    "mem.heap.used" : {
      "value" : 94783360
    }
  },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : { }
}

Query and Index Pipeline Stage Metrics

For each query pipeline and index pipeline stage, Fusion collects aggregate performance metrics for successful executions and for errors. All executions for each stage are stored in a metric named stages.stageType.stageName.process, where stageType is the type of stage, and stageName is the name of a specific stage.

Here is an example of a request to get the performance metrics for an index pipeline stage named 'solr-default' (stages.solr-index.solr-default.process), which is included with Fusion:

{"version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "stages.solr-index.solr-default.process" : {
      "count" : 109195,
      "max" : 0.128585,
      "mean" : 0.004011065175097276,
      "min" : 0.0022500000000000003,
      "p50" : 0.0030645000000000004,
      "p75" : 0.0033495,
      "p95" : 0.005410449999999992,
      "p98" : 0.014195759999999965,
      "p99" : 0.02462230000000001,
      "p999" : 0.12850243700000002,
      "stddev" : 0.007408363728123277,
      "m15_rate" : 11.957732876922531,
      "m1_rate" : 8.784289947811962,
      "m5_rate" : 9.037172472578138,
      "mean_rate" : 9.214233776748047,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

This shows the number of uses of the stage ("count"), the maximum and minimum times, the mean, the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles (p50, p75, etc.), and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', etc.). In this case, the pipeline has been used 109,195 times, with a mean rate of 9.214 events per second, with only .003 events in the 50th percentile.

Metrics for successful completions of stages are stored in metrics named stages.index.stageType.stage.stageName.ok or stages.query.stageType.stage.stageName.ok, depending on if the stage is part of an index pipeline or a query pipeline. Here is an example of the mean rates for successful runs of the 'solr-default' index pipeline stage (stages.index.solr-index.stage.solr-default.ok):

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.stage.solr-default.ok" : {
      "count" : 110855,
      "m15_rate" : 5.270163206842968,
      "m1_rate" : 8.485969925086419,
      "m5_rate" : 8.06785229981572,
      "mean_rate" : 9.18230056255745,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

This shows the number of uses of the stage ("count") and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', etc.). From the above, we can see that the solr-default stage has been executed 110,855 times, with a mean rate of 9.18 events per second.

If you prefer to see the metrics for the entire stage type, you can omit the stage name entirely, and simply get metrics for the stage type. This takes the form of stages.index.stageType.ok (for an index pipeline) or stages.query.stageName.ok (for a query pipeline). Here is an example, using the solr-index stage type:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.ok" : {
      "count" : 116425,
      "m15_rate" : 6.178851947720613,
      "m1_rate" : 8.814380052133192,
      "m5_rate" : 8.585203640734829,
      "mean_rate" : 9.19499774409566,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

In this example, we see that the solr-index stage has been successfully run 116,425 times, with a mean rate of 9.19 events per second.

Web Service Endpoint Metrics

For each web service endpoint, Fusion keeps a timer recording the duration and rate of requests. The duration is calculated using an exponentially-weighted moving average with a heavy bias toward measurements from the last 5 minutes.

These metrics have names in the form: com.lucidworks.apollo.resources.serviceName.methodName.weighted.timer, or for a specific example, com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer" : {
      "count" : 2624,
      "max" : 0.134712,
      "mean" : 0.031589107976653694,
      "min" : 0.022424000000000003,
      "p50" : 0.028440000000000003,
      "p75" : 0.036908,
      "p95" : 0.044644449999999995,
      "p98" : 0.05026944,
      "p99" : 0.05444051000000004,
      "p999" : 0.134693411,
      "stddev" : 0.00936497282768644,
      "m15_rate" : 0.07113433590025664,
      "m1_rate" : 0.06387037028343223,
      "m5_rate" : 0.06218407166715861,
      "mean_rate" : 0.0663172057583814,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }

Solr Request Metrics

The system keeps track of the performance of requests to each Solr server that it communicates with.

The metrics have names in the form solr.solrIdentifier.requestType. The solrIdentifier is the address of the Solr instance, and the requestType can be 'get-requests', 'post-requests' or 'put-requests'.

This example shows get-requests to a Solr instance that is found on '10.0.1.8' and port 8983:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "solr.10.0.1.8-8983.get-requests" : {
      "count" : 3170,
      "max" : 0.873981,
      "mean" : 0.2451200904669261,
      "min" : 0.001678,
      "p50" : 0.318176,
      "p75" : 0.48169550000000005,
      "p95" : 0.53017705,
      "p98" : 0.5617982399999999,
      "p99" : 0.6281221800000003,
      "p999" : 0.8710894970000004,
      "stddev" : 0.2448979377578966,
      "m15_rate" : 0.02059326561557774,
      "m1_rate" : 0.03249432457272969,
      "m5_rate" : 0.030788223074952624,
      "mean_rate" : 0.033875616252208286,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

From this we can see that there have been 3,170 GET requests to that Solr instance, and the mean response rate is .03 requests per second.

Changing Metric Collection Frequency

The default frequency to collect metrics is 60 seconds. Since the metrics are stored in a system collection (and a Solr instance), the data can grow to be quite large over time. If you do not need metrics collection to happen as frequently (perhaps during initial implmentation), you can change the frequency by modifying the com.lucidworks.apollo.metrics.poll.seconds configuration parameter with the Configurations API.

For example:

curl -u user:pass -X PUT -H 'Content-type: application/json' -d '600' http://localhost:8764/api/apollo/configurations/com.lucidworks.apollo.metrics.poll.seconds

To disable metrics, you could set the com.lucidworks.apollo.metrics.poll.seconds parameter to '-1'.

curl -u user:pass -X PUT -H 'Content-type: application/json' -d '-1' http://localhost:8764/api/apollo/configurations/com.lucidworks.apollo.metrics.poll.seconds