System Metrics

By default, Fusion captures system metrics in the var/log/metrics/metrics.log file, then indexes them asynchronously in the system_monitor system collection:

You can view these metrics in the DevOps Center.

Configuration

Most aspects of metrics collection can be configured in the fusion.properties file:

  • Metrics collection can be disabled with default.collectMetrics = false

  • The frequency of metrics collection can be adjusted with default.collectMetricsIntervalSecs = 30

  • Metrics can be shipped to a different Solr cluster or collection by adjusting the log-shipper.solrZk.connect and log-shipper.metricsSolrCollection properties.

The retention period for system metrics is 30 days by default and can be configured in the delete-old-system-metrics Fusion task job, available in all apps.

How to adjust the system metrics retention period
  1. Navigate to Collections > Jobs.

  2. Select the delete-old-system-metrics job.

  3. In the job configuration pane, scroll down to REQUEST ENTITY (AS STRING).

    Editing the <code>delete-old-system-metrics</code> job

  4. Change 30DAYS to the desired period of time to retain system metrics.

Metrics document fields

Both host and service metrics are stored as a single Solr document with a timestamp and the fields described below.

All metrics

  • id

    Unique autogenerated document identifier.

  • node_s

    Unique identifier of a Fusion node / server (autogenerated).

  • timestamp_tdt

    Timestamp of a metric.

  • view_s

    Type of a metric, either "host" or "service_instance".

  • type_s

    Store type of a metric, either "latest" or "history".

    • "history" is a snapshot of a metric at a particular time.

    • "latest" is a single document per host or service with the latest state that is constantly updated over time. It allows easy retrieval and aggregation for just the "latest" / "recent" state of the system.

Host metrics

  • CPU

    • cpu_load_d

      Normalized CPU load, such as a floating number value in the range of [0.0,1.0].

    • cpu_sys_d, cpu_user_d, cpu_wait_d, cpu_combined_d and cpu_idle_d

      Break down of CPU load per type. Those are also floating number values in the range of [0.0,1.0].

    • load_average_d

      System load average for the last minute, not normalized.

    • processors_l

      Number of CPU cores according to JVM.

  • Memory

    • memory_total_l and memory_free_l

      Total and free amounts of physical memory in bytes.

    • swap_total_l and swap_free_l

      Total and free swap in bytes.

  • Disk space

    • disk_total_l and disk_free_l

      Disk sizes of a partition where Fusion is installed (where var and data folders reside).

  • Uptime

    • host_uptime_l

      Total uptime of a host operating system (in milliseconds).

    • agent_uptime_l

      Uptime of Fusion agent service (in milliseconds).

  • Various Info

    • os_name_s, os_arch_s and os_version_s

      OS details according to JVM.

    • addresses_ss

      List of IP addresses according to network configuration.

    • hostname_s

      Main hostname or IP address of a server.

Service metrics

  • service_s

    The name of the service (that is, api, solr, and so on) to which this metric pertains.

  • status_s

    Status of the service according to Agent (that is, RUNNING, STARTING, and so on).

  • pid_i

    Process ID.

  • address_s

    IP address or hostname that is configured for this service to run on (or the default).

  • Generic Java Metrics

    • java_process_cpu_load_d

      Normalized CPU load used by this service.

    • java_heap_max_l, java_heap_used_l and java_non_heap_used_l

      JVM memory metrics.

    • java_open_file_descriptors_i

      Number of open files according to JVM.

    • java_loaded_classes_i and java_unloaded_classes_i

      JVM class loading metrics, useful for spotting problems with dynamic redeployment of Web applications.

    • java_threads_i

      Total JVM threads.

    • gc_collection_count_l and gc_collection_time_l

      GC metrics like number of invocations and total time spent.

  • Jetty Metrics

    All Jetty based services provide low-level Jetty metrics such as the following:

    • jetty_request_time_mean_f

      Mean request time according to Jetty.

    • jetty_threads_i

      Number of Jetty threads

    • jetty_responses_5xx_l, jetty_responses_4xx_l, and so on

      Number of responses per status.

  • Solr Metrics

    • solr_index_size_l

      Total Solr index size in bytes hosted on a Solr node.

    • solr_docs_l

      Total number of Solr documents hosted on a Solr node.

    • solr_requests_l

      Total number of Solr requests to all cores on a Solr node.

  • ZooKeeper Metrics

    • zk_connections_i

      Number of ZooKeeper connections to ZooKeeper node.

    • zk_znodes_l

      Number of ZooKeeper nodes.

    • zk_watches_i

      Number of Zk watches.

    • zk_ephemerals_i

      Number of ephemeral ZooKeeper nodes.

    • zk_size_l

      ZooKeeper size in bytes.

  • API metrics

    • api_query_pipelines_http_one_minute_rate_f, api_query_pipelines_http_mean_f, and so on

      Query pipeline metrics like rate of query requests to the HTTP endpoint or to Solr and mean response times.

    • api_index_pipelines_http_one_minute_rate_f, api_index_pipelines_http_mean_f, and so on

      Index pipeline metrics like rate of index requests to the HTTP endpoint or to Solr and mean response times.

  • Proxy metrics

    • proxy_active_sessions_l

      Number of active auth sessions.

    • proxy_sessions_one_minute_rate_f

      Rate of new auth sessions per minute (per node). This metric is captured once per second, then presented as a moving average over the last minute.

Legacy metrics collection

The metrics collection features described below are deprecated in Fusion 4.2 and will be removed in a future release.

In version 4.1 and earlier, Fusion automatically creates the system collection system_metrics. It is empty until you manually enable system_metrics indexing; see below for instructions. In version 4.2 and above, Fusion creates the collection when you enable system_metrics indexing.

Note
The jobs that produce legacy metrics are not automatically linked to your app when you enable legacy metrics, so they will not automatically appear in the Jobs panel. You can find them in the Object Explorer and link them to your app.
How to enable metrics indexing in the Fusion UI
  1. Navigate to System > System > Metrics.

  2. Enable Record System Metrics Over Time.

How to enable metrics indexing using the REST API
curl -u admin:password123 -H 'Content-type:application/json' -X PUT -d 'true' "http://localhost:8764/api/configurations/com.lucidworks.apollo.metrics.enabled"

There are around 600 different metrics available. In this topic we’ve highlighted a few that are likely to be the most useful or interesting to you.

The /system/metrics endpoint of the System API lists all the metrics that the system is currently collecting. Metrics are returned for the current instance only; Fusion instances do not aggregate metrics between nodes.

Types of Metrics Collected

There are several types of metrics:

  • Gauges: These are single values, valid for the point in time at which the metrics are collected.

  • Counters: These are values that are incremented or decremented over time.

  • Meters: These measure the rate of events over time. They include a mean rate, as well as a 1-, 5- and 15-minute moving average. Most of these moving averages are exponentially weighted, so that more recent values contribute more heavily than older values; exceptions to this rule have the word "unweighted" in their name.

  • Histograms: These measure the distribution of values. They will report the minimum, maximum, mean, and the values at the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles.

  • Timers: A timer is a meter combined with a histogram; it measures the length of time that a particular operation takes (both mean duration and moving averages) as well as the distribution of those durations.

Many of the metrics are for internal use by the system. However, Fusion may ask for a dump of the metrics data (using the System API endpoint) to help diagnose performance issues. Some metrics are also subject to change pending performance tuning and additional testing.

Metrics of Particular Interest

Slow Web Service Calls

For each web service endpoint in the system, the system keeps a list of the last several requests whose request time has been in the 99th percentile – that is, examples of the top 1% of slow requests for that endpoint. These are recorded as com.lucidworks.apollo.resources.serviceName.methodName.weighted.slow.examples, where serviceName is the name of the service and methodName is the name of a valid method for that service.

This information might be helpful when diagnosing performance issues. Here is an example of the 5 slowest calls to the getCollectionMetrics method of the CollectionResource service:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.slow.examples" : {
      "value" : [ {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      } ]
    }

System Memory

There are several memory-related metrics reported:

  • mem.heap.used: the current amount of heap memory, in bytes, used by the system.

  • mem.heap.max: the maximum amount of heap memory, in bytes, that the system could use.

  • mem.heap.usage: the percentage (0 - 1.0) of available heap memory that the system is currently using (this is equal to mem.heap.used / mem.heap.max).

  • mem.non-heap.used: the current amount of non-heap memory (also called "off-heap memory"), in bytes, used by the system.

  • mem.non-heap.max: the maximum amount of non-heap memory, in bytes, that the system could use.

  • mem.non-heap.usage: the percentage (0 - 1.0) of available non-heap memory that the system is currently using (this is equal to mem.non-heap.used / mem.non-heap.max).

  • mem.total.used: the current total amount of memory (heap plus non-heap), in bytes, used by the system.

  • mem.total.max: the maximum amount of total memory (heap plus non-heap), in bytes, that the system could use.

Here is an example of mem.heap.used:

{
  "version" : "3.0.0",
  "gauges" : {
    "mem.heap.used" : {
      "value" : 94783360
    }
  },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : { }
}

Query and Index Pipeline Stage Metrics

For each query pipeline and index pipeline stage, Fusion collects aggregate performance metrics for successful executions and for errors. All executions for each stage are stored in a metric named stages.stageType.stageName.process, where stageType is the type of stage, and stageName is the name of a specific stage.

Here is an example of a request to get the performance metrics for an index pipeline stage named 'solr-default' (stages.solr-index.solr-default.process), which is included with Fusion:

{"version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "stages.solr-index.solr-default.process" : {
      "count" : 109195,
      "max" : 0.128585,
      "mean" : 0.004011065175097276,
      "min" : 0.0022500000000000003,
      "p50" : 0.0030645000000000004,
      "p75" : 0.0033495,
      "p95" : 0.005410449999999992,
      "p98" : 0.014195759999999965,
      "p99" : 0.02462230000000001,
      "p999" : 0.12850243700000002,
      "stddev" : 0.007408363728123277,
      "m15_rate" : 11.957732876922531,
      "m1_rate" : 8.784289947811962,
      "m5_rate" : 9.037172472578138,
      "mean_rate" : 9.214233776748047,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

This shows the number of uses of the stage ("count"), the maximum and minimum times, the mean, the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles (p50, p75, and so on.), and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). In this case, the pipeline has been used 109,195 times, with a mean rate of 9.214 events per second, with only .003 events in the 50th percentile.

Metrics for successful completions of stages are stored in metrics named stages.index.stageType.stage.stageName.ok or stages.query.stageType.stage.stageName.ok, depending on if the stage is part of an index pipeline or a query pipeline. Here is an example of the mean rates for successful runs of the 'solr-default' index pipeline stage (stages.index.solr-index.stage.solr-default.ok):

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.stage.solr-default.ok" : {
      "count" : 110855,
      "m15_rate" : 5.270163206842968,
      "m1_rate" : 8.485969925086419,
      "m5_rate" : 8.06785229981572,
      "mean_rate" : 9.18230056255745,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

This shows the number of uses of the stage ("count") and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). From the above, we can see that the solr-default stage has been executed 110,855 times, with a mean rate of 9.18 events per second.

If you prefer to see the metrics for the entire stage type, you can omit the stage name entirely, and simply get metrics for the stage type. This takes the form of stages.index.stageType.ok (for an index pipeline) or stages.query.stageName.ok (for a query pipeline). Here is an example, using the solr-index stage type:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.ok" : {
      "count" : 116425,
      "m15_rate" : 6.178851947720613,
      "m1_rate" : 8.814380052133192,
      "m5_rate" : 8.585203640734829,
      "mean_rate" : 9.19499774409566,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

In this example, we see that the solr-index stage has been successfully run 116,425 times, with a mean rate of 9.19 events per second.

Web Service Endpoint Metrics

For each web service endpoint, Fusion keeps a timer recording the duration and rate of requests. The duration is calculated using an exponentially-weighted moving average with a heavy bias toward measurements from the last 5 minutes.

These metrics have names in the form: com.lucidworks.apollo.resources.serviceName.methodName.weighted.timer, or for a specific example, com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer" : {
      "count" : 2624,
      "max" : 0.134712,
      "mean" : 0.031589107976653694,
      "min" : 0.022424000000000003,
      "p50" : 0.028440000000000003,
      "p75" : 0.036908,
      "p95" : 0.044644449999999995,
      "p98" : 0.05026944,
      "p99" : 0.05444051000000004,
      "p999" : 0.134693411,
      "stddev" : 0.00936497282768644,
      "m15_rate" : 0.07113433590025664,
      "m1_rate" : 0.06387037028343223,
      "m5_rate" : 0.06218407166715861,
      "mean_rate" : 0.0663172057583814,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }

Solr Request Metrics

The system keeps track of the performance of requests to each Solr server that it communicates with.

The metrics have names in the form solr.solrIdentifier.requestType. The solrIdentifier is the address of the Solr instance, and the requestType can be 'get-requests', 'post-requests' or 'put-requests'.

This example shows get-requests to a Solr instance that is found on '10.0.1.8' and port 8983:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "solr.10.0.1.8-8983.get-requests" : {
      "count" : 3170,
      "max" : 0.873981,
      "mean" : 0.2451200904669261,
      "min" : 0.001678,
      "p50" : 0.318176,
      "p75" : 0.48169550000000005,
      "p95" : 0.53017705,
      "p98" : 0.5617982399999999,
      "p99" : 0.6281221800000003,
      "p999" : 0.8710894970000004,
      "stddev" : 0.2448979377578966,
      "m15_rate" : 0.02059326561557774,
      "m1_rate" : 0.03249432457272969,
      "m5_rate" : 0.030788223074952624,
      "mean_rate" : 0.033875616252208286,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

From this we can see that there have been 3,170 GET requests to that Solr instance, and the mean response rate is .03 requests per second.

Changing Metric Collection Frequency

The default frequency to collect metrics is 60 seconds. Since the metrics are stored in a system collection (and a Solr instance), the data can grow to be quite large over time. If you do not need metrics collection to happen as frequently (perhaps during initial implementation), you can change the frequency by modifying the com.lucidworks.apollo.metrics.poll.seconds configuration parameter with the Configurations API.

For example:

curl -u user:pass -X PUT -H 'Content-type: application/json' -d '600' http://localhost:8764/api/configurations/com.lucidworks.apollo.metrics.poll.seconds

To disable metrics, you could set the com.lucidworks.apollo.metrics.poll.seconds parameter to '-1'.

curl -u user:pass -X PUT -H 'Content-type: application/json' -d '-1' http://localhost:8764/api/configurations/com.lucidworks.apollo.metrics.poll.seconds