Scaling Spark Aggregations
For this section, we’ll walk through the process of running a simple aggregation on 130M signals (created from synthetic data).
> curl -u ... -H 'Content-type:application/json' -X PUT -d '72' "$FUSION_API/configurations/spark.default.parallelism"
After making the change, the foreachPartition stage of the job will use 72 partitions:
You can increase the number of rows read per page, default is 10000, by passing the rows parameter when starting your aggregation job, such as:
> curl -u ... -XPOST "$FUSION_API/aggregator/jobs/perf_signals/perfJob?rows=20000&sync=false"
For example, we were able to read 130M signals from Solr in 18 minutes at ~120K rows / sec using rows=20000 vs. 21 minutes using the default 10000.