This article contains tips and techniques for troubleshooting Spark.
Log endpoints are useful for debugging Spark jobs on multiple nodes. In a distributed environment, the log endpoints parse the last N log lines from different Spark log files on multiple nodes and output the responses from all nodes as
text/plain (which renders nicely in browsers) sorted by the timestamp.
The REST API Reference documents log endpoints for Spark jobs. The URIs for the endpoints contain
The most useful log API endpoint is the
spark/log/job/ endpoint, which goes through all Fusion REST API and Spark logs, filters the logs by the
jobId (using MDC, the mapped diagnostic context), and merges the output from different files.
For example, to obtain log content for the job
curl -u user:password "$FUSION_API/spark/log/job/jobId"
|Log endpoints will only output data from log files on nodes on which the API service is running.|
These are some specific issues you might encounter.
Check the logs for a message that looks like:
2016-10-07T11:51:44,800 - WARN [Timer-0:Logging$class@70] - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
If you see this, then it means your job has requested more CPU or memory than is available. For instance, if you ask for 4g but there is only 2g available, then the job will just hang in WAITING status.
If you see errors like the following:
2016-10-09T19:56:51,174 - WARN [dispatcher-event-loop-5:Logging$class@70] - Removing executor 1 with no recent heartbeats: 160532 ms exceeds timeout 120000 ms 2016-10-09T19:56:51,175 - ERROR [dispatcher-event-loop-5:Logging$class@74] - Lost executor 1 on ip-10-44-188-82.ec2.internal: Executor heartbeat timed out after 160532 ms 2016-10-09T19:56:51,178 - WARN [dispatcher-event-loop-5:Logging$class@70] - Lost task 22.0 in stage 1.0 (TID 166, ip-10-44-188-82.ec2.internal): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 160532 ms
This is most likely due to an OOM in the executor JVM (preventing it from maintaining the heartbeat with the application driver). However, we have seen cases where tasks fail, but the job still completes, so you will need to wait it out to see if the job recovers.
Another situation when this can occur is when a shuffle size (incoming data for a particular task) exceeds 2GB. This is hard to predict in advance because it depends on job parallelism and the number of records produced by earlier stages. The solution is to re-submit the job with increased job parallelism.