Spark Operations
Apache Spark is an open source cluster-computing framework that serves as a fast and general execution engine for large-scale data processing jobs that can be decomposed into stepwise tasks, which are distributed across a cluster of networked computers.
Spark improves on previous MapReduce implementations by using resilient distributed datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
These topics explain Spark administration concepts in Fusion 5:
Spark operations include:
-
Audit all Spark jobs for natural key support.
-
Audit SQL Aggregation jobs for natural key usage.
-
When looking at SQL for the BPR Recommender job, audit generated aggregation SQL to ensure that it’s using a natural key projection.
-
Support partitioning in all Spark jobs in accordance with config options.
-
Support external data sources for all jobs (Spark, NLP, Clustering, Recommender), including external Spark source support for NLP.