System Metrics

In version 4.1 and earlier, Fusion automatically creates the system collection system_metrics. It is empty until you manually enable system_metrics indexing; see below for instructions.

Types of Metrics Collected

There are several types of metrics:

  • Gauges: These are single values, valid for the point in time at which the metrics are collected.

  • Counters: These are values that are incremented or decremented over time.

  • Meters: These measure the rate of events over time. They include a mean rate, as well as a 1-, 5- and 15-minute moving average. Most of these moving averages are exponentially weighted, so that more recent values contribute more heavily than older values; exceptions to this rule have the word "unweighted" in their name.

  • Histograms: These measure the distribution of values. They will report the minimum, maximum, mean, and the values at the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles.

  • Timers: A timer is a meter combined with a histogram; it measures the length of time that a particular operation takes (both mean duration and moving averages) as well as the distribution of those durations.

Many of the metrics are for internal use by the system. However, Fusion may ask for a dump of the metrics data (using the System API endpoint) to help diagnose performance issues. Some metrics are also subject to change pending performance tuning and additional testing.

Metrics of Particular Interest

Slow Web Service Calls

For each web service endpoint in the system, the system keeps a list of the last several requests whose request time has been in the 99th percentile – that is, examples of the top 1% of slow requests for that endpoint. These are recorded as com.lucidworks.apollo.resources.serviceName.methodName.weighted.slow.examples, where serviceName is the name of the service and methodName is the name of a valid method for that service.

This information might be helpful when diagnosing performance issues. Here is an example of the 5 slowest calls to the getCollectionMetrics method of the CollectionResource service:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.slow.examples" : {
      "value" : [ {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      } ]
    }

System Memory

There are several memory-related metrics reported:

  • mem.heap.used: the current amount of heap memory, in bytes, used by the system.

  • mem.heap.max: the maximum amount of heap memory, in bytes, that the system could use.

  • mem.heap.usage: the percentage (0 - 1.0) of available heap memory that the system is currently using (this is equal to mem.heap.used / mem.heap.max).

  • mem.non-heap.used: the current amount of non-heap memory (also called "off-heap memory"), in bytes, used by the system.

  • mem.non-heap.max: the maximum amount of non-heap memory, in bytes, that the system could use.

  • mem.non-heap.usage: the percentage (0 - 1.0) of available non-heap memory that the system is currently using (this is equal to mem.non-heap.used / mem.non-heap.max).

  • mem.total.used: the current total amount of memory (heap plus non-heap), in bytes, used by the system.

  • mem.total.max: the maximum amount of total memory (heap plus non-heap), in bytes, that the system could use.

Here is an example of mem.heap.used:

{
  "version" : "3.0.0",
  "gauges" : {
    "mem.heap.used" : {
      "value" : 94783360
    }
  },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : { }
}

Query and Index Pipeline Stage Metrics

For each query pipeline and index pipeline stage, Fusion collects aggregate performance metrics for successful executions and for errors. All executions for each stage are stored in a metric named stages.stageType.stageName.process, where stageType is the type of stage, and stageName is the name of a specific stage.

Here is an example of a request to get the performance metrics for an index pipeline stage named 'solr-default' (stages.solr-index.solr-default.process), which is included with Fusion:

{"version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "stages.solr-index.solr-default.process" : {
      "count" : 109195,
      "max" : 0.128585,
      "mean" : 0.004011065175097276,
      "min" : 0.0022500000000000003,
      "p50" : 0.0030645000000000004,
      "p75" : 0.0033495,
      "p95" : 0.005410449999999992,
      "p98" : 0.014195759999999965,
      "p99" : 0.02462230000000001,
      "p999" : 0.12850243700000002,
      "stddev" : 0.007408363728123277,
      "m15_rate" : 11.957732876922531,
      "m1_rate" : 8.784289947811962,
      "m5_rate" : 9.037172472578138,
      "mean_rate" : 9.214233776748047,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

This shows the number of uses of the stage ("count"), the maximum and minimum times, the mean, the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles (p50, p75, and so on.), and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). In this case, the pipeline has been used 109,195 times, with a mean rate of 9.214 events per second, with only .003 events in the 50th percentile.

Metrics for successful completions of stages are stored in metrics named stages.index.stageType.stage.stageName.ok or stages.query.stageType.stage.stageName.ok, depending on if the stage is part of an index pipeline or a query pipeline. Here is an example of the mean rates for successful runs of the 'solr-default' index pipeline stage (stages.index.solr-index.stage.solr-default.ok):

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.stage.solr-default.ok" : {
      "count" : 110855,
      "m15_rate" : 5.270163206842968,
      "m1_rate" : 8.485969925086419,
      "m5_rate" : 8.06785229981572,
      "mean_rate" : 9.18230056255745,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

This shows the number of uses of the stage ("count") and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). From the above, we can see that the solr-default stage has been executed 110,855 times, with a mean rate of 9.18 events per second.

If you prefer to see the metrics for the entire stage type, you can omit the stage name entirely, and simply get metrics for the stage type. This takes the form of stages.index.stageType.ok (for an index pipeline) or stages.query.stageName.ok (for a query pipeline). Here is an example, using the solr-index stage type:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.ok" : {
      "count" : 116425,
      "m15_rate" : 6.178851947720613,
      "m1_rate" : 8.814380052133192,
      "m5_rate" : 8.585203640734829,
      "mean_rate" : 9.19499774409566,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

In this example, we see that the solr-index stage has been successfully run 116,425 times, with a mean rate of 9.19 events per second.

Web Service Endpoint Metrics

For each web service endpoint, Fusion keeps a timer recording the duration and rate of requests. The duration is calculated using an exponentially-weighted moving average with a heavy bias toward measurements from the last 5 minutes.

These metrics have names in the form: com.lucidworks.apollo.resources.serviceName.methodName.weighted.timer, or for a specific example, com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer" : {
      "count" : 2624,
      "max" : 0.134712,
      "mean" : 0.031589107976653694,
      "min" : 0.022424000000000003,
      "p50" : 0.028440000000000003,
      "p75" : 0.036908,
      "p95" : 0.044644449999999995,
      "p98" : 0.05026944,
      "p99" : 0.05444051000000004,
      "p999" : 0.134693411,
      "stddev" : 0.00936497282768644,
      "m15_rate" : 0.07113433590025664,
      "m1_rate" : 0.06387037028343223,
      "m5_rate" : 0.06218407166715861,
      "mean_rate" : 0.0663172057583814,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }

Solr Request Metrics

The system keeps track of the performance of requests to each Solr server that it communicates with.

The metrics have names in the form solr.solrIdentifier.requestType. The solrIdentifier is the address of the Solr instance, and the requestType can be 'get-requests', 'post-requests' or 'put-requests'.

This example shows get-requests to a Solr instance that is found on '10.0.1.8' and port 8983:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "solr.10.0.1.8-8983.get-requests" : {
      "count" : 3170,
      "max" : 0.873981,
      "mean" : 0.2451200904669261,
      "min" : 0.001678,
      "p50" : 0.318176,
      "p75" : 0.48169550000000005,
      "p95" : 0.53017705,
      "p98" : 0.5617982399999999,
      "p99" : 0.6281221800000003,
      "p999" : 0.8710894970000004,
      "stddev" : 0.2448979377578966,
      "m15_rate" : 0.02059326561557774,
      "m1_rate" : 0.03249432457272969,
      "m5_rate" : 0.030788223074952624,
      "mean_rate" : 0.033875616252208286,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

From this we can see that there have been 3,170 GET requests to that Solr instance, and the mean response rate is .03 requests per second.