Apache Spark is an open-source cluster-computing framework that serves as a fast and general execution engine for large-scale data processing jobs that can be decomposed into stepwise tasks, which are distributed across a cluster of networked computers.
Spark improves on previous MapReduce implementations by using resilient distributed datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
These topics explain Spark administration concepts in Fusion 5:
Spark operations include:
Audit all Spark jobs for natural key support.
Audit SQL Aggregation jobs for natural key usage.
When looking at SQL for the ALS Recommender job, audit generated aggregation SQL to ensure that it’s using a natural key projection.
Support partitioning in all Spark jobs in accordance with config options.
Support external data sources for all jobs (Spark, NLP, Clustering, Recommender), including external Spark source support for NLP.