Troubleshooting Performance Issues

Troubleshooting

High CPU load

Heavy CPU load across all nodes in the cluster might indicate the need for more nodes. More typically, however, a single node or service is being overloaded.

  • Use top or another monitoring utility to find out which processes are using high CPU.

  • All Fusion services, including Solr and Zookeeper, run as Java processes. Most services indicate the name of the service via the -DserviceName command-line argument.

Common reasons for busy services:

  • Connectors-classic: document parsing and ingestion

  • Connectors-classic: document-heavy index and/or query pipeline use

  • API: document-heavy index and/or query pipeline use

  • Solr: use across multiple nodes holding the same collection

  • Solr: use across multiple nodes holding shard leaders

  • Heavy query load involving facets, stats, and/or sorting

  • Heavy document writing to a given collection

What to do:

  • Find out which services or nodes are being overloaded.

  • Find out if there is known activity to explain the excessive CPU load.

  • If no explanation is found, Contact Lucidworks Support.

Low memory

Low memory problems are typically the result of two issues: low system memory and a lack of heap space for an individual service.

Lack of free memory

You can detect a lack of free memory on a server using the free -h command. On servers running the Solr service, large amounts of free memory is ideal because this memory is used for filesystem caching. Take note of the amount of free memory and the disk cache in comparison to the overall index size. Ideally, the disk cache should be able to hold all of the index, although this is not as important when using newer SSD technologies.

There are several ways to increase the available memory on a server:

  • Change the command-line options configured in conf/fusion.cors to reduce the heap sizes of individual services. However, this might result in a lack of heap space, as described below.

  • Run fewer services on a node and/or reallocate services that need more memory to nodes with extra capacity.

  • Add nodes to the cluster.

  • Add memory to the nodes.

Lack of heap space

Finding processes that have exited with OutOfMemoryError in the logs, or finding a dump file called java_pidXXX.hprof in the log directory, indicates that a service failed due to lack of heap space.

Heap space is configured on a per-service basis in the conf/fusion.cors file via the -Xmx and -Xms command-line parameters. Avoid allocating heap sizes that are known to be larger than needed for a service, because these can lead to long GC pauses.

Two common ways to detect long GC pauses include:

  • Examine the gc_*.log files in the log directories. For deep analysis, upload files to http://gceasy.io/.

  • Using top, look for periods when all cores are busy followed by a spike in one core (with most or all other cores dropping to near zero). This is pattern is typically found when one service is busy and is encountering long GC pauses.

Note
The need for GC analysis varies from application to application.