Cluster Monitoring and Troubleshooting

Fusion structure and components

Fusion consists of a collection of services running on one or more nodes in a network, as summarized below.

Note
The default ports listed under Runs on below can be altered in fusion.properties file located in /fusion/4.2.x/conf/.

Fusion services

Service Runs on Description

Fusion UI

All nodes in Fusion cluster. Port 8763.

Fusion Admin Web App.

API

All nodes in Fusion cluster. Port 8765.

The API service fulfills REST API requests, processes pipeline stages, and starts, stop, and sometimes monitors scheduled jobs.

Connectors

Fusion nodes. Port 8984.

The connectors service runs connector jobs and processes pipeline stages.

Proxy

All nodes in Fusion cluster. Port 8764.

The proxy service authenticates API requests sent to port 8764 and dispatches to an API or connectors service over port 8765 or 8984.

Apache Solr

All Signals nodes. Port 8983.

Solr for various Fusion collections.  signals_aggr, logs, various system* collections such as models, blobs, history, messages, metrics, etc. On larger installations, Solr is often installed and run separately from Fusion.

Apache Spark

Signals nodes. Various ports.

The Apache Spark service provides large-scale data processing jobs for Spark SQL, aggregation, machine learning, etc.

Web Apps

All nodes in Fusion cluster. Port 8780.

The web apps service runs installed App Studio and/or Rules Editor web applications (war).

Apache Zookeeper

Port 9983.

Apache ZooKeeper configures and manages all Fusion components in a single Fusion deployment. ZooKeeper is often installed and run separately from Fusion. In this case, it typically runs on port 2181.

Typical directory structure

├── bin   # scripts for start|stop|status of Fusion services
├── conf  # contains config for Fusion processes and logging
├── data  # data directories (often linked or mounted to larger disk)
│   ├── connectors
│   ├── nlp
│   ├── solr
│   └── zookeeper
└── var  # default location for log files.
    ├── log
       ├── admin-ui
       ├── agent  # agent used to start/stop services
       ├── api    # API service calls, jobs, and pipeline stage output
       ├── connectors
       │   │   ├── connectors-classic  # Connector and pipeline output
       │   │   └── connectors-rpc
       ├── log-shipper
       ├── proxy
       ├── solr
       ├── spark-master
       ├── spark-worker
       ├── sql
       ├── webapps
       └── zookeeper

In general, each log directory contains a log named after the directory followed by a .log suffix. garbage collection (GC) logs are also located in these directories. The GC logs show internal timings for Java garbage collection cycles, as well as logs showing Jetty requests and errors for the given service. Each Fusion service has a log4j2.xml settings file located in /fusion/4.2.x/conf which controls the particulars of log retention and contents.

By default, file rotation is enabled in the log4j2.xml settings file. File rotation ensures a new log file is created under the following conditions: * A 24-hour period has elapsed * A log file has exceeded 100MB

Starting, stopping, and checking the status of Fusion services

By running the scripts in the bin directory, a user can stop, start, restart, or check status for all of Fusion services or individual services. On systems where Fusion is installed as a service, you can use the systemctl fusion start|stop|status command as well. When running the fusion command, the services operated upon are the services listed in the group.default property set in fusion.properties.

See the instructions below for basic examples on starting and stopping Fusion services. See Start and Stop Fusion for in-depth examples on starting and stopping Fusion services on Unix and Windows.

Note
Run these commands as the Fusion user to avoid permission conflicts when creating, reading, or writing log file entries.

You can control all Fusion services at once under the management of the Fusion agent, or you can control services individually.

How to control all services using the Fusion agent:
  • Unix: {fusion-parent-unix}/fusion/4.2{sub-version}/bin/fusion <command>

  • Windows: {fusion-parent-windows}\fusion\4.2{sub-version}\bin\fusion.cmd <command>

How to control individual services:
  • Unix: {fusion-parent-unix}/fusion/4.2{sub-version}/bin/<servicename> <command>

    For example: {fusion-parent-unix}/fusion/4.2{sub-version}/bin/proxy restart

  • Windows: {fusion-parent-windows}\fusion\4.2{sub-version}\bin\<servicename>.cmd <command>

    For example: {fusion-parent-windows}\fusion\4.2{sub-version}\bin\proxy.cmd restart

Tip
When starting services individually, start Zookeeper first.

The commands below can be issued to the fusion/fusion.cmd script to issue the command to all services in the correct sequence, or they can be issued to an individual service.

start

Start one or all Fusion services.

status

Display the status of one or all Fusion services.

restart

Restart one or all Fusion services.

stop

Stop one or all Fusion services.

run

Start one or all Fusion services in the foreground.

run-in-shell (Unix only)

Start an individual service using Bash’s exec function, which allows the service to assume the shell process’s PID. See Run Fusion in shell mode below.

Important
Starting and stopping individual Fusion services is not possible on Windows.

Monitoring

There are many ways to monitor Java Virtual Machines, Unix file systems, networks, and applications like Solr and Fusion. This includes the use of a UI, command-line utilities, or REST APIs and JMX Managed Beans (MBeans). A full-fledged monitoring tool such as Datadog, Nagios, Zabbix, New Relic, SolarWinds, or other comparable tool might be helpful. These tools assist in the analysis of the raw JMX values, which tend to be more informative than the numbers alone.

This topic focuses on generally available options, including: command-line utilities such as top, du -h, and df; REST APIs; and JMX MBean metrics.

Note
Lucidworks does not explicitly recommend any particular monitoring product, nor do we imply suitability or compatibility with Lucidworks Fusion. The aforementioned products are not detailed or reviewed in this topic.

Monitoring via Solr

The Solr Admin console lists information for one Solr node. This can be obtained via HTTP as needed.

GET JVM metrics for a Solr node

http://${Solr_Server}:8983/solr/admin/info/system?wt=json

You can access detailed metrics in the Solr Admin UI by selecting an individual Solr core and selecting the Plugins/Stats option. Among the published metrics are averages for Query and Index Request Time, Requests Per Second, Commit rates, Transaction Log size, and more.

Solr Admin UI Plugins/Stats

Obtain these same metrics via HTTP on a per-core basis:

http://${Solr_server}:8983/solr/${Collection}_${shard}_${replica}/admin/mbeans?stats=true&wt=json

Additionally, you can turn JMX monitoring on by starting Solr with ENABLE_REMOTE_JMX_OPTS=true. For more information, see Apache’s guide for configuring JMX.

Finding shard leaders in Solr

In Solr admin console

The Solr admin console’s Cloud > Graph view can show the Shards and replicas for all Solr collections in a cluster. The nodes on the right-hand side with the filled in dot are the current shard leaders. Query operations are distributed across both leaders and followers but index operations require extra resources from leaders.

Solr Admin UI Cloud Graph

With the API

The Solr API allows you to fetch the cluster status, including collections, shards, replicas, configuration name, collection aliases, and cluster properties.

Complete information can be found in the Apache Solr Reference Guide.

Requests are made with the following URL endpoint: /admin/{parameter}?action=CLUSTERSTATUS.

Parameter Description

collection

The collection or alias name for which information is requested. If omitted, information on all collections in the cluster will be returned. If an alias is supplied, information on the collections in the alias will be returned.

shard

The shard(s) for which information is requested. Multiple shard names can be specified as a comma-separated list.

route

This can be used if you need the details of the shard where a particular document belongs to and you don’t know which shard it falls under.

Input

http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS

Output
{
  "responseHeader":{
    "status":0,
    "QTime":333},
  "cluster":{
    "collections":{
      "collection1":{
        "shards":{
          "shard1":{
            "range":"80000000-ffffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "state":"active",
                "core":"collection1",
                "node_name":"127.0.1.1:8983_solr",
                "base_url":"http://127.0.1.1:8983/solr",
                "leader":"true"},
              "core_node3":{
                "state":"active",
                "core":"collection1",
                "node_name":"127.0.1.1:8900_solr",
                "base_url":"http://127.0.1.1:8900/solr"}}},
          "shard2":{
            "range":"0-7fffffff",
            "state":"active",
            "replicas":{
              "core_node2":{
                "state":"active",
                "core":"collection1",
                "node_name":"127.0.1.1:7574_solr",
                "base_url":"http://127.0.1.1:7574/solr",
                "leader":"true"},
              "core_node4":{
                "state":"active",
                "core":"collection1",
                "node_name":"127.0.1.1:7500_solr",
                "base_url":"http://127.0.1.1:7500/solr"}}}},
        "maxShardsPerNode":"1",
        "router":{"name":"compositeId"},
        "replicationFactor":"1",
        "znodeVersion": 11,
        "autoCreated":"true",
        "configName" : "my_config",
        "aliases":["both_collections"]
      },
      "collection2":{
        "..."
      }
    },
    "aliases":{ "both_collections":"collection1,collection2" },
    "roles":{
      "overseer":[
        "127.0.1.1:8983_solr",
        "127.0.1.1:7574_solr"]
    },
    "live_nodes":[
      "127.0.1.1:7574_solr",
      "127.0.1.1:7500_solr",
      "127.0.1.1:8983_solr",
      "127.0.1.1:8900_solr"]
  }
}

Monitoring Fusion

Fusion UI includes monitoring, troubleshooting, and incident investigation tools in the DevOps Center. The DevOps Center consists of a set of dashboards and an interactive log viewer, providing views into this Fusion cluster’s hosts and services using metrics and events.

You can monitor general system state and status via HTTP in several ways. For details, see Checking System State.

Fusion publishes a wide variety of metrics, including stage-by-stage performance for pipelines via the Fusion Metrics API. The Fusion metrics can give valuable insight into stage development, hot-spot identification, and overall throughput numbers. More information can be found at System Metrics.

Check which Fusion services are running

curl http://${fusion_server}:8764/api

Check the cluster-wide status for a service

curl -u admin:${PASSWORD}-X POST http://${fusion_server}:8764/api/system/status

Monitor the logs collection

In addition to the on-disk log files, most of the key Fusion logging events are written to the Logs collection. You can monitor these logging events with the Logging Dashboard.

Note
The log creation process can be altered in the fusion.properties file located in /fusion/4.2.x/conf/. If log files are not being created and stored in the logs collection, check the configurations in this file.

Logging Dashboard

See what scheduled jobs are running

The Fusion UI’s Datasources and Jobs pages show the most recent status for data ingestion jobs and other jobs that can be scheduled.

Fusion UI Datasources and Jobs

This can also be accomplished using the Jobs API.

What to monitor

Several levels of information should be monitored, including hardware resources, performance benchmarks, the Java virtual machine, Lucene/Solr, and others. The point at which a monitored metric should trigger a warning or alarm condition is highly dependent on user-specific factors, such as the target audience, time tolerance thresholds, and real-time needs.

Hardware and resources

This is just knowing what is going on with the overall system resources.

CPU Idle

This value helps determine whether a system is running out of CPU resources. Establishing average values, such as avg(1 min), helps determine if the CPU Idle value is too high. Establishing these averages for each individual CPU core helps you avoid confusion if, for example, everything is paused on a single CPU core.

Load Average

This value indicates whether processes are being loaded into a queue to be processed. If this value is higher than the number of CPU cores on a system, the machine is failing to load processes. The Load Average value might not indicate a specific problem, but might rather indicate that a problem exists somewhere.

Swap

Most operating systems can extend the capabilities of their RAM by paging portions out to disk before moving it back into RAM as needed. This process is called "swap".

Warning
Swap may potentially have a severe impact on the performance of Solr. Avoid swap by reconfiguring services across different nodes or adding memory.

Page In/Out

This value indicates the use of swap space. There might be a small amount of swap space used and little page in/out activity. However, the use of swap space generally indicates the need for more memory.

Free Memory

This value indicates the amount of free memory available to the system. A system with low amounts of free memory can no longer increase disk caching as queries are run. Because of this, index accesses must wait for permanent disk storage, which is much slower.

Take note of the amount of free memory and the disk cache in comparison to the overall index size. Ideally, the disk cache should be able to hold all of the index, although this is not as important when using newer SSD technologies.

I/O Wait

A high I/O Wait value indicates system processes are waiting on disk access, resulting in a slow system.

Free Disk Space

It is critical to monitor the amount of free disk space, because running out of free disk space can cause corruption in indexes. Avoid this by creating a mitigation plan to reallocate indexes, purge less important data, or add disk space. Create warning and alarm alerting for disk partitions containing application or log data.

Performance benchmarks

These benchmarks are available via JMX MBeans when using JConsole. They are found under the namespace solr/<corename>. Some of these metrics can also be obtained via the fields below.

Average Response Times

Monitoring tools can complete very useful calculations by measuring deltas between incremental values. For example, a user might want to calculate the average response time over the last five minutes. Use the Runtime JMX MBean as the divisor to get performance over time periods.

The JMX MBean for the standard request handler is found at: solr/<corename>:type=standard,id=org.apache.solr.handler.StandardRequestHandler, totalTime, requests]

This can be obtained for every Solr requestHandler being used.

nth Percentile Response Times

This value allows a user to see the response time for 75thPercentile, 95thPercentile, 99thPercentile, and 999thPercentiles. This value is a poor performance benchmark during short spikes in query times, but in other applications, it can show slow degradation problems and baseline behavior.

JMX MBean example: "solr/collection1:type=standard,id=org.apache.solr.handler.component.SearchHandler", 75thPcRequestTime

The same format is used for all percentiles.

Average QPS

Tracking peak QPS over several-minute intervals can show how well Fusion and Solr are handling queries and can be vital in benchmarking plausible load.

JMX MBean is found at: "solr/collection1:type=standard,id=org.apache.solr.handler.StandardRequestHandler", requests divided by JMX Runtime java.lang:Runtime.

Average Cache/Hit Ratios

This value can indicate potential problems in how Solr is warming searcher caches. A ratio below 1:2 should be considered problematic, as this may affect resource management when warming caches. Again, this value can be used with the Runtime JMX MBean to find changes over a time period.

JMX MBeans are found at: "solr/<corename>:type=<cachetype>id=org.apache.solr.search.<cache impl>", hitratio/cumulative hitratio

External Query "HTTP Ping"

An external query HTTP ping lets you monitor the performance of the system by sending a request for a basic status response from the system. Because this request does not perform a query of data, the response is fast and consistent. The average response time can be used as a metric for measuring system load. Under light to moderate loads, the response to the HTTP ping request is returned quickly. Under heavy loads, the response lags. If the response lags considerably, this might be a sign that the service is close to breaking.

External query HTTP pings can be configured to target query pipelines, which in turn query Solr. Using this method, you can monitor the uptime of the system by verifying Solr, API nodes, and the proxy service are taking requests.

Every important collection should be monitored, as the performance of one collection may differ from another.

There are many vendors which provide these services. This includes Datadog, New Relic, StatusCake, and Pingdom.

Java Virtual Machine

The Java Virtual Machine is found via JMX MBeans. Many are also available via REST API calls. See System API for details.

JMX runtime

The JMX runtime value is an important piece of data that helps determine time frames and gather statistics, such as performance over the last five minutes. A user can use the delta of the runtime value over the last time period as a divisor in other statistics. This metric can be used to determine how long a server has been running. This JMX MBean is found at: `"java.lang:Runtime",Uptime.

Last Full Garbage Collection

Monitoring the total time taken for full garbage collection cycles is important, because full GC activities typically pause processing for a whole application. Acceptable pause times vary, as they relate to how responsive your queries must be in a worst-case scenario. Generally, anything over a few seconds might indicate a problem.

JMX MBean: "java.lang:type=GarbageCollector,name=<GC Name>", LastGcInfo.duration

Most fusion log directories also have detailed GC logs.

Full GC as % of Runtime

This value indicates the amount of time spent in full garbage collections. Large amounts of time spent in full GC cycles can indicate the amount of space given to the heap is too little.

Total GC Time as % of Runtime

An indexing server typically spends considerably more time in GC. This is due to all of the new data coming in. When tuning heap and generation sizes, it is important to know how much time is being spent in GC. If too little time is spent, GC triggers more frequently. If too much time is spent, pause frequency increases.

Total Threads

This is an important value to track because each thread takes up memory space, and too many threads can overload a system. In some cases, out of memory (OOM) errors are not indicative of the need for more memory, but rather that a system is overloaded with threads. If a system has too many threads opening, it might indicate a performance bottleneck or a lack of hardware resources.

JMX MBean: "java.lang:type=Threading,ThreadCount.

Lucene/Solr

This can be found via JMX as well. 

Autowarming Times

This value indicates how long new searchers or caches take to initialize and load. The value applies to both searchers and caches. However, the warmupTime for the searcher is separate from that of caches. There is always a tradeoff between autowarming times and having prewarmed caches ready for search. 

Note
If autowarming times are set to be longer than the interval for initializing new searchers, problems may arise.

Searcher JMX MBean:

"solr/collection1:type=searcher,id=org.apache.solr.search.SolrIndexSearcher", warmupTime.
"solr/collection1:type=documentCache,id=org.apache.solr.search.LRUCache"
"solr/collection1:type=fieldValueCache,id=org.apache.solr.search.FastLRUCache"
"solr/collection1:type=filterCache,id=org.apache.solr.search.FastLRUCache"
"solr/collection1:type=queryResultCache,id=org.apache.solr.search.LRUCache"

Troubleshooting

High CPU load

Heavy CPU load across all nodes in the cluster might indicate the need for more nodes. More typically, however, a single node or service is being overloaded.

  • Use top or another monitoring utility to find out which processes are using high CPU.

  • All Fusion services, including Solr and Zookeeper, run as Java processes. Most services indicate the name of the service via the -DserviceName command-line argument.

Common reasons for busy services:

  • Connectors-classic: document parsing and ingestion

  • Connectors-classic: document-heavy index and/or query pipeline use

  • API: document-heavy index and/or query pipeline use

  • Solr: use across multiple nodes holding the same collection

  • Solr: use across multiple nodes holding shard leaders

  • Heavy query load involving facets, stats, and/or sorting

  • Heavy document writing to a given collection

What to do:

  • Find out which services or nodes are being overloaded.

  • Find out if there is known activity to explain the excessive CPU load.

  • If no explanation is found, Contact Lucidworks Support.

Low memory

Low memory problems are typically the result of two issues: low system memory and a lack of heap space for an individual service.

Lack of free memory

You can detect a lack of free memory on a server using the free -h command. On servers running the Solr service, large amounts of free memory is ideal because this memory is used for filesystem caching. Take note of the amount of free memory and the disk cache in comparison to the overall index size. Ideally, the disk cache should be able to hold all of the index, although this is not as important when using newer SSD technologies.

There are several ways to increase the available memory on a server:

  • Change the command-line options configured in conf/fusion.properties to reduce the heap sizes of individual services. However, this might result in a lack of heap space, as described below.

  • Run fewer services on a node and/or reallocate services that need more memory to nodes with extra capacity.

  • Add nodes to the cluster.

  • Add memory to the nodes.

Lack of heap space

Finding processes that have exited with OutOfMemoryError in the logs, or finding a dump file called java_pidXXX.hprof in the log directory, indicates that a service failed due to lack of heap space.

Heap space is configured on a per-service basis in the conf/fusion.properties file via the -Xmx and -Xms command-line parameters. Avoid allocating heap sizes that are known to be larger than needed for a service, because these can lead to long GC pauses.

Two common ways to detect long GC pauses include:

  • Examine the gc_*.log files in the log directories. For deep analysis, upload files to http://gceasy.io/.

  • Using top, look for periods when all cores are busy followed by a spike in one core (with most or all other cores dropping to near zero). This is pattern is typically found when one service is busy and is encountering long GC pauses.

Note
The need for GC analysis varies from application to application.