Troubleshooting Performance Issues

Table of Contents

Troubleshooting

Troubleshooting

High CPU load

Heavy CPU load across all nodes in the cluster might indicate the need for more nodes. More typically, however, a single node or service is being overloaded.

Use top or another monitoring utility to find out which processes are using high CPU.
All Fusion services, including Solr and Zookeeper, run as Java processes. Most services indicate the name of the service via the -DserviceName command-line argument.

Common reasons for busy services:

Connectors-classic: document parsing and ingestion
Connectors-classic: document-heavy index and/or query pipeline use
API: document-heavy index and/or query pipeline use
Solr: use across multiple nodes holding the same collection
Solr: use across multiple nodes holding shard leaders
Heavy query load involving facets, stats, and/or sorting
Heavy document writing to a given collection

What to do:

Find out which services or nodes are being overloaded.
Find out if there is known activity to explain the excessive CPU load.
If no explanation is found, Contact Lucidworks Support.

Low memory

Low memory problems are typically the result of two issues: low system memory and a lack of heap space for an individual service.

You can detect a lack of free memory on a server using the free -h command. On servers running the Solr service, large amounts of free memory is ideal because this memory is used for filesystem caching. Take note of the amount of free memory and the disk cache in comparison to the overall index size. Ideally, the disk cache should be able to hold all of the index, although this is not as important when using newer SSD technologies.

There are several ways to increase the available memory on a server:

Change the command-line options configured in conf/fusion.properties to reduce the heap sizes of individual services. However, this might result in a lack of heap space, as described below.
Run fewer services on a node and/or reallocate services that need more memory to nodes with extra capacity.
Add nodes to the cluster.
Add memory to the nodes.

Lack of heap space

Finding processes that have exited with OutOfMemoryError in the logs, or finding a dump file called java_pidXXX.hprof in the log directory, indicates that a service failed due to lack of heap space.

Heap space is configured on a per-service basis in the conf/fusion.properties file via the -Xmx and -Xms command-line parameters. Avoid allocating heap sizes that are known to be larger than needed for a service, because these can lead to long GC pauses.

Two common ways to detect long GC pauses include:

Examine the gc_*.log files in the log directories. For deep analysis, upload files to http://gceasy.io.
Using top, look for periods when all cores are busy followed by a spike in one core (with most or all other cores dropping to near zero). This is pattern is typically found when one service is busy and is encountering long GC pauses.

The need for GC analysis varies from application to application.

High response time at proxy and api layer

Under regular load with some spike, the response time might go very high.

The default values for both of these properties that control on the number of connections available in the pool is 50:

com.lucidworks.apollo.admin.proxy.max.conn.per.route: controls the number of connections per route
com.lucidworks.apollo.admin.proxy.max.conn.total: controls the max number of connections

Check the proxy logs:

019-08-26T18:20:26,377 - INFO  [main:clojure.tools.logging$eval11$fn__16@0] - {} - max-per-route: 50 , max-conn-total: 50

In the conf/fusion.properties file, modify the property values. If the number of connections is a bottleneck, configure the connections for the proxy JVM parameters. For example, configure the pool to have 150 connections:

proxy.jvmOptions = -Xmx${PROXY_MEM:-512m} -Dcom.lucidworks.apollo.admin.proxy.max.conn.per.route=150 -Dcom.lucidworks.apollo.admin.proxy.max.conn.total=150

The proxy logs should now show the max connections set to 150:

019-08-26T18:25:26,377 - INFO  [main:clojure.tools.logging$eval11$fn__16@0] - {} - max-per-route: 150 , max-conn-total: 150

Troubleshooting Performance Issues