System Metrics
By default, Fusion captures system metrics in the var/log/metrics/metrics.log
file, then indexes them asynchronously in the system_monitor
system collection:
-
Host/server metrics (CPU, memory, disk space usage, and so on)
-
Service metrics (process CPU, Java heap memory usage, and so on)
You can view these metrics in the DevOps Center.
Configuration
Most aspects of metrics collection can be configured in the fusion.properties
file:
-
Metrics collection can be disabled with
default.collectMetrics = false
-
The frequency of metrics collection can be adjusted with
default.collectMetricsIntervalSecs = 30
-
Metrics can be shipped to a different Solr cluster or collection by adjusting the
log-shipper.solrZk.connect
andlog-shipper.metricsSolrCollection
properties.
The retention period for system metrics is 30 days by default and can be configured in the delete-old-system-metrics
Fusion task job, available in all apps.
Metrics document fields
Both host and service metrics are stored as a single Solr document with a timestamp and the fields described below.
All metrics
-
id
Unique autogenerated document identifier.
-
node_s
Unique identifier of a Fusion node / server (autogenerated).
-
timestamp_tdt
Timestamp of a metric.
-
view_s
Type of a metric, either "host" or "service_instance".
-
type_s
Store type of a metric, either "latest" or "history".
-
"history" is a snapshot of a metric at a particular time.
-
"latest" is a single document per host or service with the latest state that is constantly updated over time. It allows easy retrieval and aggregation for just the "latest" / "recent" state of the system.
-
Host metrics
-
CPU
-
cpu_load_d
Normalized CPU load, such as a floating number value in the range of
[0.0,1.0]
. -
cpu_sys_d
,cpu_user_d
,cpu_wait_d
,cpu_combined_d
andcpu_idle_d
Break down of CPU load per type. Those are also floating number values in the range of
[0.0,1.0]
. -
load_average_d
System load average for the last minute, not normalized.
-
processors_l
Number of CPU cores according to JVM.
-
-
Memory
-
memory_total_l
andmemory_free_l
Total and free amounts of physical memory in bytes.
-
swap_total_l
andswap_free_l
Total and free swap in bytes.
-
-
Disk space
-
disk_total_l
anddisk_free_l
Disk sizes of a partition where Fusion is installed (where
var
anddata
folders reside).
-
-
Uptime
-
host_uptime_l
Total uptime of a host operating system (in milliseconds).
-
agent_uptime_l
Uptime of Fusion agent service (in milliseconds).
-
-
Various Info
-
os_name_s
,os_arch_s
andos_version_s
OS details according to JVM.
-
addresses_ss
List of IP addresses according to network configuration.
-
hostname_s
Main hostname or IP address of a server.
-
Service metrics
-
service_s
The name of the service (that is,
api
,solr
, and so on) to which this metric pertains. -
status_s
Status of the service according to Agent (that is,
RUNNING
,STARTING
, and so on). -
pid_i
Process ID.
-
address_s
IP address or hostname that is configured for this service to run on (or the default).
-
Generic Java Metrics
-
java_process_cpu_load_d
Normalized CPU load used by this service.
-
java_heap_max_l
,java_heap_used_l
andjava_non_heap_used_l
JVM memory metrics.
-
java_open_file_descriptors_i
Number of open files according to JVM.
-
java_loaded_classes_i
andjava_unloaded_classes_i
JVM class loading metrics, useful for spotting problems with dynamic redeployment of Web applications.
-
java_threads_i
Total JVM threads.
-
gc_collection_count_l
andgc_collection_time_l
GC metrics like number of invocations and total time spent.
-
-
Jetty Metrics
All Jetty based services provide low-level Jetty metrics such as the following:
-
jetty_request_time_mean_f
Mean request time according to Jetty.
-
jetty_threads_i
Number of Jetty threads
-
jetty_responses_5xx_l
,jetty_responses_4xx_l
, and so onNumber of responses per status.
-
-
Solr Metrics
-
solr_index_size_l
Total Solr index size in bytes hosted on a Solr node.
-
solr_docs_l
Total number of Solr documents hosted on a Solr node.
-
solr_requests_l
Total number of Solr requests to all cores on a Solr node.
-
-
ZooKeeper Metrics
-
zk_connections_i
Number of ZooKeeper connections to ZooKeeper node.
-
zk_znodes_l
Number of ZooKeeper nodes.
-
zk_watches_i
Number of Zk watches.
-
zk_ephemerals_i
Number of ephemeral ZooKeeper nodes.
-
zk_size_l
ZooKeeper size in bytes.
-
-
API metrics
-
api_query_pipelines_http_one_minute_rate_f
,api_query_pipelines_http_mean_f
, and so onQuery pipeline metrics like rate of query requests to the HTTP endpoint or to Solr and mean response times.
-
api_index_pipelines_http_one_minute_rate_f
,api_index_pipelines_http_mean_f
, and so onIndex pipeline metrics like rate of index requests to the HTTP endpoint or to Solr and mean response times.
-
-
Proxy metrics
-
proxy_active_sessions_l
Number of active auth sessions.
-
proxy_sessions_one_minute_rate_f
Rate of new auth sessions per minute (per node). This metric is captured once per second, then presented as a moving average over the last minute.
-
Legacy metrics collection
The metrics collection features described below are deprecated in Fusion 4.2 and will be removed in a future release.
In version 4.1 and earlier, Fusion automatically creates the system collection system_metrics
. It is empty until you manually enable system_metrics
indexing; see below for instructions.
In version 4.2 and later, Fusion creates the collection when you enable system_metrics
indexing.
The jobs that produce legacy metrics are not automatically linked to your app when you enable legacy metrics, so they will not automatically appear in the Jobs panel. You can find them in the Object Explorer and link them to your app. |
Types of Metrics Collected
There are several types of metrics:
-
Gauges: These are single values, valid for the point in time at which the metrics are collected.
-
Counters: These are values that are incremented or decremented over time.
-
Meters: These measure the rate of events over time. They include a mean rate, as well as a 1-, 5- and 15-minute moving average. Most of these moving averages are exponentially weighted, so that more recent values contribute more heavily than older values; exceptions to this rule have the word "unweighted" in their name.
-
Histograms: These measure the distribution of values. They will report the minimum, maximum, mean, and the values at the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles.
-
Timers: A timer is a meter combined with a histogram; it measures the length of time that a particular operation takes (both mean duration and moving averages) as well as the distribution of those durations.
Many of the metrics are for internal use by the system. However, Fusion may ask for a dump of the metrics data (using the System API endpoint) to help diagnose performance issues. Some metrics are also subject to change pending performance tuning and additional testing.
Metrics of Particular Interest
Slow Web Service Calls
For each web service endpoint in the system, the system keeps a list of the last several requests whose request time has been in the 99th percentile – that is, examples of the top 1% of slow requests for that endpoint. These are recorded as com.lucidworks.apollo.resources.serviceName.methodName.weighted.slow.examples
, where serviceName is the name of the service and methodName is the name of a valid method for that service.
This information might be helpful when diagnosing performance issues. Here is an example of the 5 slowest calls to the getCollectionMetrics method of the CollectionResource service:
"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.slow.examples" : {
"value" : [ {
"requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
"queryParams" : { },
"userPrincipal" : null,
"method" : "GET",
"cookies" : { }
}, {
"requestUri" : "http://localhost:8764/api/collections/logs/stats",
"queryParams" : { },
"userPrincipal" : null,
"method" : "GET",
"cookies" : { }
}, {
"requestUri" : "http://localhost:8764/api/collections/logs/stats",
"queryParams" : { },
"userPrincipal" : null,
"method" : "GET",
"cookies" : { }
}, {
"requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
"queryParams" : { },
"userPrincipal" : null,
"method" : "GET",
"cookies" : { }
}, {
"requestUri" : "http://localhost:8764/api/collections/lws5_metrics/stats",
"queryParams" : { },
"userPrincipal" : null,
"method" : "GET",
"cookies" : { }
} ]
}
System Memory
There are several memory-related metrics reported:
-
mem.heap.used
: the current amount of heap memory, in bytes, used by the system. -
mem.heap.max
: the maximum amount of heap memory, in bytes, that the system could use. -
mem.heap.usage
: the percentage (0 - 1.0) of available heap memory that the system is currently using (this is equal tomem.heap.used
/mem.heap.max
). -
mem.non-heap.used
: the current amount of non-heap memory (also called "off-heap memory"), in bytes, used by the system. -
mem.non-heap.max
: the maximum amount of non-heap memory, in bytes, that the system could use. -
mem.non-heap.usage
: the percentage (0 - 1.0) of available non-heap memory that the system is currently using (this is equal tomem.non-heap.used
/mem.non-heap.max
). -
mem.total.used
: the current total amount of memory (heap plus non-heap), in bytes, used by the system. -
mem.total.max
: the maximum amount of total memory (heap plus non-heap), in bytes, that the system could use.
Here is an example of mem.heap.used
:
{
"version" : "3.0.0",
"gauges" : {
"mem.heap.used" : {
"value" : 94783360
}
},
"counters" : { },
"histograms" : { },
"meters" : { },
"timers" : { }
}
Query and Index Pipeline Stage Metrics
For each query pipeline and index pipeline stage, Fusion collects aggregate performance metrics for successful executions and for errors. All executions for each stage are stored in a metric named stages.stageType.stageName.process
, where stageType is the type of stage, and stageName is the name of a specific stage.
Here is an example of a request to get the performance metrics for an index pipeline stage named 'solr-default' (stages.solr-index.solr-default.process
), which is included with Fusion:
{"version" : "3.0.0",
"gauges" : { },
"counters" : { },
"histograms" : { },
"meters" : { },
"timers" : {
"stages.solr-index.solr-default.process" : {
"count" : 109195,
"max" : 0.128585,
"mean" : 0.004011065175097276,
"min" : 0.0022500000000000003,
"p50" : 0.0030645000000000004,
"p75" : 0.0033495,
"p95" : 0.005410449999999992,
"p98" : 0.014195759999999965,
"p99" : 0.02462230000000001,
"p999" : 0.12850243700000002,
"stddev" : 0.007408363728123277,
"m15_rate" : 11.957732876922531,
"m1_rate" : 8.784289947811962,
"m5_rate" : 9.037172472578138,
"mean_rate" : 9.214233776748047,
"duration_units" : "seconds",
"rate_units" : "calls/second"
}
}
}
This shows the number of uses of the stage ("count"), the maximum and minimum times, the mean, the 50th, 75th, 95th, 98th, 99th, and 99.9th percentiles (p50, p75, and so on.), and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). In this case, the pipeline has been used 109,195 times, with a mean rate of 9.214 events per second, with only .003 events in the 50th percentile.
Metrics for successful completions of stages are stored in metrics named stages.index.stageType.stage.stageName.ok
or stages.query.stageType.stage.stageName.ok
, depending on if the stage is part of an index pipeline or a query pipeline. Here is an example of the mean rates for successful runs of the 'solr-default' index pipeline stage (stages.index.solr-index.stage.solr-default.ok
):
{
"version" : "3.0.0",
"gauges" : { },
"counters" : { },
"histograms" : { },
"meters" : {
"stages.index.solr-index.stage.solr-default.ok" : {
"count" : 110855,
"m15_rate" : 5.270163206842968,
"m1_rate" : 8.485969925086419,
"m5_rate" : 8.06785229981572,
"mean_rate" : 9.18230056255745,
"units" : "events/second"
}
},
"timers" : { }
}
This shows the number of uses of the stage ("count") and the mean rates over 1-, 5- and 15-minute intervals ('m1_rate', and so on.). From the above, we can see that the solr-default stage has been executed 110,855 times, with a mean rate of 9.18 events per second.
If you prefer to see the metrics for the entire stage type, you can omit the stage name entirely, and simply get metrics for the stage type. This takes the form of stages.index.stageType.ok
(for an index pipeline) or stages.query.stageName.ok
(for a query pipeline). Here is an example, using the solr-index stage type:
{
"version" : "3.0.0",
"gauges" : { },
"counters" : { },
"histograms" : { },
"meters" : {
"stages.index.solr-index.ok" : {
"count" : 116425,
"m15_rate" : 6.178851947720613,
"m1_rate" : 8.814380052133192,
"m5_rate" : 8.585203640734829,
"mean_rate" : 9.19499774409566,
"units" : "events/second"
}
},
"timers" : { }
}
In this example, we see that the solr-index stage has been successfully run 116,425 times, with a mean rate of 9.19 events per second.
Web Service Endpoint Metrics
For each web service endpoint, Fusion keeps a timer recording the duration and rate of requests. The duration is calculated using an exponentially-weighted moving average with a heavy bias toward measurements from the last 5 minutes.
These metrics have names in the form: com.lucidworks.apollo.resources.serviceName.methodName.weighted.timer
, or for a specific example, com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer
:
"com.lucidworks.apollo.resources.CollectionResource.getCollectionMetrics.weighted.timer" : {
"count" : 2624,
"max" : 0.134712,
"mean" : 0.031589107976653694,
"min" : 0.022424000000000003,
"p50" : 0.028440000000000003,
"p75" : 0.036908,
"p95" : 0.044644449999999995,
"p98" : 0.05026944,
"p99" : 0.05444051000000004,
"p999" : 0.134693411,
"stddev" : 0.00936497282768644,
"m15_rate" : 0.07113433590025664,
"m1_rate" : 0.06387037028343223,
"m5_rate" : 0.06218407166715861,
"mean_rate" : 0.0663172057583814,
"duration_units" : "seconds",
"rate_units" : "calls/second"
}
Solr Request Metrics
The system keeps track of the performance of requests to each Solr server that it communicates with.
The metrics have names in the form solr.solrIdentifier.requestType
. The solrIdentifier is the address of the Solr instance, and the requestType can be 'get-requests', 'post-requests' or 'put-requests'.
This example shows get-requests to a Solr instance that is found on '10.0.1.8' and port 8983:
{
"version" : "3.0.0",
"gauges" : { },
"counters" : { },
"histograms" : { },
"meters" : { },
"timers" : {
"solr.10.0.1.8-8983.get-requests" : {
"count" : 3170,
"max" : 0.873981,
"mean" : 0.2451200904669261,
"min" : 0.001678,
"p50" : 0.318176,
"p75" : 0.48169550000000005,
"p95" : 0.53017705,
"p98" : 0.5617982399999999,
"p99" : 0.6281221800000003,
"p999" : 0.8710894970000004,
"stddev" : 0.2448979377578966,
"m15_rate" : 0.02059326561557774,
"m1_rate" : 0.03249432457272969,
"m5_rate" : 0.030788223074952624,
"mean_rate" : 0.033875616252208286,
"duration_units" : "seconds",
"rate_units" : "calls/second"
}
}
}
From this we can see that there have been 3,170 GET requests to that Solr instance, and the mean response rate is .03 requests per second.