System Metrics

Table of Contents

Configuration
Metrics document fields
Legacy metrics collection
- Types of Metrics Collected
- Metrics of Particular Interest

By default, Fusion captures system metrics in the var/log/metrics/metrics.log file, then indexes them asynchronously in the system_monitor system collection:

Host/server metrics (CPU, memory, disk space usage, and so on)
Service metrics (process CPU, Java heap memory usage, and so on)

You can view these metrics in the DevOps Center.

Configuration

Most aspects of metrics collection can be configured in the fusion.properties file:

Metrics collection can be disabled with default.collectMetrics = false
The frequency of metrics collection can be adjusted with default.collectMetricsIntervalSecs = 30
Metrics can be shipped to a different Solr cluster or collection by adjusting the log-shipper.solrZk.connect and log-shipper.metricsSolrCollection properties.

The retention period for system metrics is 30 days by default and can be configured in the delete-old-system-metrics Fusion task job, available in all apps.

Metrics document fields

Both host and service metrics are stored as a single Solr document with a timestamp and the fields described below.

All metrics

id

Unique autogenerated document identifier.
node_s

Unique identifier of a Fusion node / server (autogenerated).
timestamp_tdt

Timestamp of a metric.
view_s

Type of a metric, either "host" or "service_instance".
type_s

Store type of a metric, either "latest" or "history".
- "history" is a snapshot of a metric at a particular time.
- "latest" is a single document per host or service with the latest state that is constantly updated over time. It allows easy retrieval and aggregation for just the "latest" / "recent" state of the system.

Host metrics

CPU
- cpu_load_d
  
  Normalized CPU load, such as a floating number value in the range of [0.0,1.0].
- cpu_sys_d, cpu_user_d, cpu_wait_d, cpu_combined_d and cpu_idle_d
  
  Break down of CPU load per type. Those are also floating number values in the range of [0.0,1.0].
- load_average_d
  
  System load average for the last minute, not normalized.
- processors_l
  
  Number of CPU cores according to JVM.
Memory
- memory_total_l and memory_free_l
  
  Total and free amounts of physical memory in bytes.
- swap_total_l and swap_free_l
  
  Total and free swap in bytes.
Disk space
- disk_total_l and disk_free_l
  
  Disk sizes of a partition where Fusion is installed (where var and data folders reside).
Uptime
- host_uptime_l
  
  Total uptime of a host operating system (in milliseconds).
- agent_uptime_l
  
  Uptime of Fusion agent service (in milliseconds).
Various Info
- os_name_s, os_arch_s and os_version_s
  
  OS details according to JVM.
- addresses_ss
  
  List of IP addresses according to network configuration.
- hostname_s
  
  Main hostname or IP address of a server.

Service metrics

service_s

The name of the service (that is, api, solr, and so on) to which this metric pertains.
status_s

Status of the service according to Agent (that is, RUNNING, STARTING, and so on).
pid_i

Process ID.
address_s

IP address or hostname that is configured for this service to run on (or the default).
Generic Java Metrics
- java_process_cpu_load_d
  
  Normalized CPU load used by this service.
- java_heap_max_l, java_heap_used_l and java_non_heap_used_l
  
  JVM memory metrics.
- java_open_file_descriptors_i
  
  Number of open files according to JVM.
- java_loaded_classes_i and java_unloaded_classes_i
  
  JVM class loading metrics, useful for spotting problems with dynamic redeployment of Web applications.
- java_threads_i
  
  Total JVM threads.
- gc_collection_count_l and gc_collection_time_l
  
  GC metrics like number of invocations and total time spent.
Jetty Metrics

All Jetty based services provide low-level Jetty metrics such as the following:
- jetty_request_time_mean_f
  
  Mean request time according to Jetty.
- jetty_threads_i
  
  Number of Jetty threads
- jetty_responses_5xx_l, jetty_responses_4xx_l, and so on
  
  Number of responses per status.
Solr Metrics
- solr_index_size_l
  
  Total Solr index size in bytes hosted on a Solr node.
- solr_docs_l
  
  Total number of Solr documents hosted on a Solr node.
- solr_requests_l
  
  Total number of Solr requests to all cores on a Solr node.
ZooKeeper Metrics
- zk_connections_i
  
  Number of ZooKeeper connections to ZooKeeper node.
- zk_znodes_l
  
  Number of ZooKeeper nodes.
- zk_watches_i
  
  Number of Zk watches.
- zk_ephemerals_i
  
  Number of ephemeral ZooKeeper nodes.
- zk_size_l
  
  ZooKeeper size in bytes.
API metrics
- api_query_pipelines_http_one_minute_rate_f, api_query_pipelines_http_mean_f, and so on
  
  Query pipeline metrics like rate of query requests to the HTTP endpoint or to Solr and mean response times.
- api_index_pipelines_http_one_minute_rate_f, api_index_pipelines_http_mean_f, and so on
  
  Index pipeline metrics like rate of index requests to the HTTP endpoint or to Solr and mean response times.
Proxy metrics
- proxy_active_sessions_l
  
  Number of active auth sessions.
- proxy_sessions_one_minute_rate_f
  
  Rate of new auth sessions per minute (per node). This metric is captured once per second, then presented as a moving average over the last minute.

Legacy metrics collection

The metrics collection features described below are deprecated in Fusion 4.2 and will be removed in a future release.

In version 4.1 and earlier, Fusion automatically creates the system collection system_metrics. It is empty until you manually enable system_metrics indexing; see below for instructions. In version 4.2 and later, Fusion creates the collection when you enable system_metrics indexing.

The jobs that produce legacy metrics are not automatically linked to your app when you enable legacy metrics, so they will not automatically appear in the Jobs panel. You can find them in the Object Explorer and link them to your app.

Types of Metrics Collected

There are several types of metrics:

Gauges: These are single values, valid for the point in time at which the metrics are collected.
Counters: These are values that are incremented or decremented over time.
Meters: These measure the rate of events over time. They include a mean rate, as well as a 1-, 5- and 15-minute moving average. Most of these moving averages are exponentially weighted, so that more recent values contribute more heavily than older values; exceptions to this rule have the word "unweighted" in their name.
Histograms: These measure the distribution of values. They will report the minimum, maximum, mean, and the values at the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles.
Timers: A timer is a meter combined with a histogram; it measures the length of time that a particular operation takes (both mean duration and moving averages) as well as the distribution of those durations.

Many of the metrics are for internal use by the system. However, Fusion may ask for a dump of the metrics data (using the System API endpoint) to help diagnose performance issues. Some metrics are also subject to change pending performance tuning and additional testing.

Metrics of Particular Interest

Slow Web Service Calls

For each web service endpoint in the system, the system keeps a list of the last several requests whose request time has been in the 99th percentile – that is, examples of the top 1% of slow requests for that endpoint. These are recorded as com.lucidworks.apollo.resources.serviceName.methodName.weighted.slow.examples, where serviceName is the name of the service and methodName is the name of a valid method for that service.

This information might be helpful when diagnosing performance issues. Here is an example of the 5 slowest calls to the getCollectionMetrics method of the CollectionResource service:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.slow.examples" : {
      "value" : [ {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/logs/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      }, {
        "requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
        "queryParams" : { },
        "userPrincipal" : null,
        "method" : "GET",
        "cookies" : { }
      } ]
    }

System Memory

There are several memory-related metrics reported:

mem.heap.used: the current amount of heap memory, in bytes, used by the system.
mem.heap.max: the maximum amount of heap memory, in bytes, that the system could use.
mem.heap.usage: the percentage (0 - 1.0) of available heap memory that the system is currently using (this is equal to mem.heap.used / mem.heap.max).
mem.non-heap.used: the current amount of non-heap memory (also called "off-heap memory"), in bytes, used by the system.
mem.non-heap.max: the maximum amount of non-heap memory, in bytes, that the system could use.
mem.non-heap.usage: the percentage (0 - 1.0) of available non-heap memory that the system is currently using (this is equal to mem.non-heap.used / mem.non-heap.max).
mem.total.used: the current total amount of memory (heap plus non-heap), in bytes, used by the system.
mem.total.max: the maximum amount of total memory (heap plus non-heap), in bytes, that the system could use.

Here is an example of mem.heap.used:

{
  "version" : "3.0.0",
  "gauges" : {
    "mem.heap.used" : {
      "value" : 94783360
    }
  },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : { }
}

Query and Index Pipeline Stage Metrics

For each query pipeline and index pipeline stage, Fusion collects aggregate performance metrics for successful executions and for errors. All executions for each stage are stored in a metric named stages.stageType.stageName.process, where stageType is the type of stage, and stageName is the name of a specific stage.

Here is an example of a request to get the performance metrics for an index pipeline stage named 'solr-default' (stages.solr-index.solr-default.process), which is included with Fusion:

{"version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "stages.solr-index.solr-default.process" : {
      "count" : 109195,
      "max" : 0.128585,
      "mean" : 0.004011065175097276,
      "min" : 0.0022500000000000003,
      "p50" : 0.0030645000000000004,
      "p75" : 0.0033495,
      "p95" : 0.005410449999999992,
      "p98" : 0.014195759999999965,
      "p99" : 0.02462230000000001,
      "p999" : 0.12850243700000002,
      "stddev" : 0.007408363728123277,
      "m15_rate" : 11.957732876922531,
      "m1_rate" : 8.784289947811962,
      "m5_rate" : 9.037172472578138,
      "mean_rate" : 9.214233776748047,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

This shows the number of uses of the stage ("count"), the maximum and minimum times, the mean, the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles (p50, p75, and so on.), and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). In this case, the pipeline has been used 109,195 times, with a mean rate of 9.214 events per second, with only .003 events in the 50th percentile.

Metrics for successful completions of stages are stored in metrics named stages.index.stageType.stage.stageName.ok or stages.query.stageType.stage.stageName.ok, depending on if the stage is part of an index pipeline or a query pipeline. Here is an example of the mean rates for successful runs of the 'solr-default' index pipeline stage (stages.index.solr-index.stage.solr-default.ok):

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.stage.solr-default.ok" : {
      "count" : 110855,
      "m15_rate" : 5.270163206842968,
      "m1_rate" : 8.485969925086419,
      "m5_rate" : 8.06785229981572,
      "mean_rate" : 9.18230056255745,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

This shows the number of uses of the stage ("count") and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). From the above, we can see that the solr-default stage has been executed 110,855 times, with a mean rate of 9.18 events per second.

If you prefer to see the metrics for the entire stage type, you can omit the stage name entirely, and simply get metrics for the stage type. This takes the form of stages.index.stageType.ok (for an index pipeline) or stages.query.stageName.ok (for a query pipeline). Here is an example, using the solr-index stage type:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : {
    "stages.index.solr-index.ok" : {
      "count" : 116425,
      "m15_rate" : 6.178851947720613,
      "m1_rate" : 8.814380052133192,
      "m5_rate" : 8.585203640734829,
      "mean_rate" : 9.19499774409566,
      "units" : "events/second"
    }
  },
  "timers" : { }
}

In this example, we see that the solr-index stage has been successfully run 116,425 times, with a mean rate of 9.19 events per second.

Web Service Endpoint Metrics

For each web service endpoint, Fusion keeps a timer recording the duration and rate of requests. The duration is calculated using an exponentially-weighted moving average with a heavy bias toward measurements from the last 5 minutes.

These metrics have names in the form: com.lucidworks.apollo.resources.serviceName.methodName.weighted.timer, or for a specific example, com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer:

"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer" : {
      "count" : 2624,
      "max" : 0.134712,
      "mean" : 0.031589107976653694,
      "min" : 0.022424000000000003,
      "p50" : 0.028440000000000003,
      "p75" : 0.036908,
      "p95" : 0.044644449999999995,
      "p98" : 0.05026944,
      "p99" : 0.05444051000000004,
      "p999" : 0.134693411,
      "stddev" : 0.00936497282768644,
      "m15_rate" : 0.07113433590025664,
      "m1_rate" : 0.06387037028343223,
      "m5_rate" : 0.06218407166715861,
      "mean_rate" : 0.0663172057583814,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }

Solr Request Metrics

The system keeps track of the performance of requests to each Solr server that it communicates with.

The metrics have names in the form solr.solrIdentifier.requestType. The solrIdentifier is the address of the Solr instance, and the requestType can be 'get-requests', 'post-requests' or 'put-requests'.

This example shows get-requests to a Solr instance that is found on '10.0.1.8' and port 8983:

{
  "version" : "3.0.0",
  "gauges" : { },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "solr.10.0.1.8-8983.get-requests" : {
      "count" : 3170,
      "max" : 0.873981,
      "mean" : 0.2451200904669261,
      "min" : 0.001678,
      "p50" : 0.318176,
      "p75" : 0.48169550000000005,
      "p95" : 0.53017705,
      "p98" : 0.5617982399999999,
      "p99" : 0.6281221800000003,
      "p999" : 0.8710894970000004,
      "stddev" : 0.2448979377578966,
      "m15_rate" : 0.02059326561557774,
      "m1_rate" : 0.03249432457272969,
      "m5_rate" : 0.030788223074952624,
      "mean_rate" : 0.033875616252208286,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

From this we can see that there have been 3,170 GET requests to that Solr instance, and the mean response rate is .03 requests per second.