top
, du -h
, and df
; REST APIs; and JMX MBean metrics.
ENABLE_REMOTE_JMX_OPTS=true
. Refer to Apache’s Solr refernce guide for configuring JMX for more information.
/admin/{parameter}?action=CLUSTERSTATUS
.
Parameter | Description |
---|---|
collection | The collection or alias name for which information is requested. If omitted, information on all collections in the cluster will be returned. If an alias is supplied, information on the collections in the alias will be returned. |
shard | The shard(s) for which information is requested. Multiple shard names can be specified as a comma-separated list. |
route | This can be used if you need the details of the shard where a particular document belongs to and you don’t know which shard it falls under. |
http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS
Output
Check the System State
status
should be OK
if Solr is able to return results for queries.ok
.system/status
endpoint, which returns the status.ok
if the service started successfully.system/status
endpoint, which returns the status.Use System Metrics
/system/metrics
endpoint of the System API lists all the metrics that the system is currently collecting. Metrics are returned for the current instance only; Fusion instances do not aggregate metrics between nodes.delete-old-system-metrics
job.
30DAYS
to the desired period of time to retain system metrics.
com.lucidworks.apollo.metrics.poll.seconds
configuration parameter with the Configurations API.For example:com.lucidworks.apollo.metrics.poll.seconds
parameter to ‘-1’.fusion.cors
(fusion.properties
in Fusion 4.x) file located in /fusion/latest.x/conf/
. If log files are not being created and stored in the logs collection, check the configurations in this file.CPU Idle | This value helps determine whether a system is running out of CPU resources. Establishing average values, such as avg(1 min) , helps determine if the CPU Idle value is too high. Establishing these averages for each individual CPU core helps you avoid confusion if, for example, everything is paused on a single CPU core. |
Load Average | This value indicates whether processes are being loaded into a queue to be processed. If this value is higher than the number of CPU cores on a system, the machine is failing to load processes. The Load Average value might not indicate a specific problem, but might rather indicate that a problem exists somewhere. |
Swap | Most operating systems can extend the capabilities of their RAM by paging portions out to disk before moving it back into RAM as needed. This process is called “swap”. Swap may potentially have a Severe impact on the performance of Solr. Avoid swap by reconfiguring services across different nodes or adding memory. |
Page In/Out | This value indicates the use of swap space. There might be a small amount of swap space used and little page in/out activity. However, the use of swap space generally indicates the need for more memory. |
Free Memory | This value indicates the amount of free memory available to the system. A system with low amounts of free memory can no longer increase disk caching as queries are run. Because of this, index accesses must wait for permanent disk storage, which is much slower. Take note of the amount of free memory and the disk cache in comparison to the overall index size. Ideally, the disk cache should be able to hold all of the index, although this is not as important when using newer SSD technologies. |
I/O Wait | A high I/O Wait value indicates system processes are waiting on disk access, resulting in a slow system. |
Free Disk Space | It is critical to monitor the amount of free disk space, because running out of free disk space can cause corruption in indexes. Avoid this by creating a mitigation plan to reallocate indexes, purge less important data, or add disk space. Create warning and alarm alerting for disk partitions containing application or log data. |
solr/<corename>
. Some of these metrics can also be obtained via the fields below.
Average Response Times | Monitoring tools can complete very useful calculations by measuring deltas between incremental values. For example, a user might want to calculate the average response time over the last five minutes. Use the Runtime JMX MBean as the divisor to get performance over time periods. The JMX MBean for the standard request handler is found at: solr/<corename>:type=standard,id=org.apache.solr.handler.StandardRequestHandler, totalTime, requests] This can be obtained for every Solr requestHandler being used. |
nth Percentile Response Times | This value allows a user to see the response time for 75thPercentile , 95thPercentile , 99thPercentile , and 999thPercentiles . This value is a poor performance benchmark during short spikes in query times, but in other applications, it can show slow degradation problems and baseline behavior. JMX MBean example: "solr/collection1:type=standard,id=org.apache.solr.handler.component.SearchHandler", 75thPcRequestTime The same format is used for all percentiles. |
Average QPS | Tracking peak QPS over several-minute intervals can show how well Fusion and Solr are handling queries and can be vital in benchmarking plausible load. JMX MBean is found at: "solr/collection1:type=standard,id=org.apache.solr.handler.StandardRequestHandler" , requests divided by JMX Runtime java.lang:Runtime . |
Average Cache/Hit Ratios | This value can indicate potential problems in how Solr is warming searcher caches. A ratio below 1:2 should be considered problematic, as this may affect resource management when warming caches. Again, this value can be used with the Runtime JMX MBean to find changes over a time period. JMX MBeans are found at: "solr/<corename>:type=<cachetype>id=org.apache.solr.search.<cache impl>", hitratio/cumulative hitratio |
External Query “HTTP Ping” | An external query HTTP ping lets you monitor the performance of the system by sending a request for a basic status response from the system. Because this request does not perform a query of data, the response is fast and consistent. The average response time can be used as a metric for measuring system load. Under light to moderate loads, the response to the HTTP ping request is returned quickly. Under heavy loads, the response lags. If the response lags considerably, this might be a sign that the service is close to breaking. External query HTTP pings can be configured to target query pipelines, which in turn query Solr. Using this method, you can monitor the uptime of the system by verifying Solr, API nodes, and the proxy service are taking requests. Every important collection should be monitored, as the performance of one collection may differ from another. There are many vendors which provide these services. This includes Datadog, New Relic, StatusCake, and Pingdom. |
Last Full Garbage Collection | Monitoring the total time taken for full garbage collection cycles is important, because full GC activities typically pause processing for a whole application. Acceptable pause times vary, as they relate to how responsive your queries must be in a worst-case scenario. Generally, anything over a few seconds might indicate a problem. JMX MBean: "java.lang:type=GarbageCollector,name=<GC Name>", LastGcInfo.duration Most fusion log directories also have detailed GC logs. |
Full GC as % of Runtime | This value indicates the amount of time spent in full garbage collections. Large amounts of time spent in full GC cycles can indicate the amount of space given to the heap is too little. |
Total GC Time as % of Runtime | An indexing server typically spends considerably more time in GC. This is due to all of the new data coming in. When tuning heap and generation sizes, it is important to know how much time is being spent in GC. If too little time is spent, GC triggers more frequently. If too much time is spent, pause frequency increases. |
Total Threads | This is an important value to track because each thread takes up memory space, and too many threads can overload a system. In some cases, out of memory (OOM) errors are not indicative of the need for more memory, but rather that a system is overloaded with threads. If a system has too many threads opening, it might indicate a performance bottleneck or a lack of hardware resources. JMX MBean: "java.lang:type=Threading,ThreadCount . |
Autowarming Times | This value indicates how long new searchers or caches take to initialize and load. The value applies to both searchers and caches. However, the warmupTime for the searcher is separate from that of caches. There is always a tradeoff between autowarming times and having prewarmed caches ready for search. If autowarming times are set to be longer than the interval for initializing new searchers, problems may arise. Searcher JMX MBean: "solr/collection1:type=searcher,id=org.apache.solr.search.SolrIndexSearcher", warmupTime. "solr/collection1:type=documentCache,id=org.apache.solr.search.LRUCache" "solr/collection1:type=fieldValueCache,id=org.apache.solr.search.FastLRUCache" "solr/collection1:type=filterCache,id=org.apache.solr.search.FastLRUCache" "solr/collection1:type=queryResultCache,id=org.apache.solr.search.LRUCache" |