Spark Troubleshooting

Table of Contents

Begin troubleshooting process
Known issues and solutions
Other troubleshooting steps
- Log API endpoints for Spark jobs

This article contains tips and techniques for troubleshooting Spark.

Begin troubleshooting process

First determine if the job is a Spark job. Spark jobs display in the Fusion UI Jobs panel and start with spark:. Additionally, a Spark job has a job ID attributed as SPARK JOB ID .

View a comprehensive list of spark jobs:
- Fusion 4.x:
  - Spark Jobs
  - Spark Jobs API
Next, check whether Spark services are enabled or if it is a local Spark instance instantiated to run Spark-related jobs.
1. Go to fusion.properties and look for group.default. The line should have spark-master and spark-worker in the list, for example group.default = zookeeper, solr, api, connectors-classic, connectors-rpc, proxy, webapps, admin-ui, spark-master, spark-worker, log-shipper.
2. Connect to spark-client via shell script by navigating to fusion_home/bin/ directory and attempt to start via ./spark-shell. It should successfully connect with a message, for example "Launching Spark Shell with Fusion Spark Master: local". If spark-shell fails to connect at all, copy the error message and pass it to Lucidworks support.
Try connecting to Apache Spark’s admin URL. It should be accessible via host:8764. Check if scheduled jobs are complete or running.
Re-confirm the status of Spark services by querying the API endpoint at http://localhost:8764/api/spark/info. The API should return mode and masterUrl. If mode and masterUrl are local, then Spark services are not enabled explicitly or they are in a failure state. If Spark services are enabled then you will see mode as STANDALONE.
If Spark services were enabled but the API endpoint returned mode as LOCAL then there is an issue with starting Spark services.
1. Restart Fusion with its Spark services as a first option.
2. Check driver default logs via API endpoint and increase the numbers of rows param as required, for example`http://<FUSION_HOST>/api/spark/log/driver/default?rows=100` OR http://localhost:8764/api/spark/log/driver/default?rows=100.
3. If you do not see an error stack trace in detail via the API endpoint, check the server tail Spark driver default logs by navigating to fusion_home/var/log/api/ and do tail -F spark-driver-default.log, or copy the complete log files under /fusion_home/var/log/api/ (for example, spark-driver-default.log, spark-driver-scripted.log, api.log, spark-driver-script-stdout.log) and share with Lucidworks to troubleshoot the actual issue.
  1. Logs required to troubleshoot Spark jobs failure are responses to below endpoints
    
    http://localhost:8764/api/spark/master/config
    
    http://localhost:8764/api/spark/worker/config
    
    http://localhost:8764/api/spark/master/status
    
    https://host:8764/api/spark/info
    
    https://host:8764/api/spark/configurations
    
    http://192.168.29.185:8764/api/apollo/configurations

Standard Fusion setup without Spark services enabled expects the Spark jobs to work in local mode, however they are failing. When Spark services are not configured to start, a local Spark instance is instantiated to run Spark-related jobs. This Spark instance could possibly have issues.

Steps

Run the curl command:

 curl -u USERNAME:PASSWORD -X POST -H "Content-type:application/json" http://localhost:8764/api/apollo/configurations/fusion.spark.driver.jar.exclusions -d

".*io.grpc.*,.*org.apache.spark.*,.*org.spark-project.*,.*org.apache.hadoop.*,.*org.apache.derby.*,.*spark-assembly.*,.*spark-network.*,.*spark-examples.*,.*\\/hadoop-.*,.*\\/tachyon.*,.*\\/datanucleus.*,.*\\/scala-library.*,.*\\/solr-commons-csv.*,.*\\/spark-csv.*,.*\\/hive-jdbc-shaded.*,.*\\/sis-metadata.*,.*\\/bcprov.*,.*spire.*,.*com.chuusai.*,.*shapeless.*"

curl -u USERNAME:PASSWORD -X POST -H "Content-type: application/json" -d "true" http://host:8764/api/apollo/configurations/spark.sql.caseSensitive

Remove any shaded jars on file system.

find . -name "spark-shaded*jar" -exec rm {} \;

Restart Fusion, specifically API and spark-worker services (if running).

Spark job (script or aggregation) is not getting all the resources available on the workers. By default, each application is only configured to get 0.5 of available memory on the cluster and 0.8 of available cores. Refer below links on how to make changes.
1. Number of instances and cores allocated
2. Memory allocation

Other troubleshooting steps

Log API endpoints for Spark jobs

Log endpoints are useful for debugging Spark jobs on multiple nodes. In a distributed environment, the log endpoints parse the last N log lines from different Spark log files on multiple nodes and output the responses from all nodes as text/plain (which renders nicely in browsers) sorted by the timestamp.

The REST API Reference documents log endpoints for Spark jobs. The URIs for the endpoints contain /api/spark/log.

The most useful log API endpoint is the spark/log/job/ endpoint, which goes through all Fusion REST API and Spark logs, filters the logs by the jobId (using MDC, the mapped diagnostic context), and merges the output from different files.

For example, to obtain log content for the job jobId:

curl -u USERNAME:PASSWORD "https://FUSION_HOST:6764/api/spark/log/job/jobId"

Log endpoints will only output data from log files on nodes on which the API service is running.

Specific issues

These are some specific issues you might encounter.

Job hung in waiting status

Check the logs for a message that looks like:

2016-10-07T11:51:44,800 - WARN  [Timer-0:Logging$class@70] - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

If you see this, then it means your job has requested more CPU or memory than is available. For instance, if you ask for 4g but there is only 2g available, then the job will just hang in WAITING status.

Lost executor due to heartbeat timeout

If you see errors like the following:

2016-10-09T19:56:51,174 - WARN  [dispatcher-event-loop-5:Logging$class@70] - Removing executor 1 with no recent heartbeats: 160532 ms exceeds timeout 120000 ms

2016-10-09T19:56:51,175 - ERROR [dispatcher-event-loop-5:Logging$class@74] - Lost executor 1 on ip-10-44-188-82.ec2.internal: Executor heartbeat timed out after 160532 ms

2016-10-09T19:56:51,178 - WARN  [dispatcher-event-loop-5:Logging$class@70] - Lost task 22.0 in stage 1.0 (TID 166, ip-10-44-188-82.ec2.internal): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 160532 ms

This is most likely due to an OOM in the executor JVM (preventing it from maintaining the heartbeat with the application driver). However, we have seen cases where tasks fail, but the job still completes, so you will need to wait it out to see if the job recovers.

Another situation when this can occur is when a shuffle size (incoming data for a particular task) exceeds 2GB. This is hard to predict in advance because it depends on job parallelism and the number of records produced by earlier stages. The solution is to re-submit the job with increased job parallelism.

Spark Master will not start on EC2

See aws-instances-and-java-net-unknownhostexception for a solution.

For custom Spark troubleshooting, see Troubleshoot Custom Spark Jobs.