Scaling Spark Aggregations

For this section, we’ll walk through the process of running a simple aggregation on 130M signals (created from synthetic data).

One of the most common issues when running an aggregation job over a large signals data set is task timeout issues in Stage 2 (foreachPartition). This is typically due to slowness indexing aggregated jobs back into Solr or JavaScript functions. The solution is to increase the number of partitions of the aggregated RDD (the input to Stage 2). You need to increase the default parallelism by setting the following configuration property:

> curl -u ... -H 'Content-type:application/json' -X PUT -d '72'
"$FUSION_API/configurations/spark.default.parallelism"

After making the change, the foreachPartition stage of the job will use 72 partitions:

foreachPartition

You can increase the number of rows read per page, default is 10000, by passing the rows parameter when starting your aggregation job, such as:

> curl -u ... -XPOST "$FUSION_API/aggregator/jobs/perf_signals/perfJob?rows=20000&sync=false"

For example, we were able to read 130M signals from Solr in 18 minutes at ~120K rows / sec using rows=20000 vs. 21 minutes using the default 10000.