Spark Jobs

Apache Spark can power a wide variety of data analysis jobs. In Fusion, Spark jobs are especially useful for generating recommendations.

Spark job subtypes

For the Spark job type, the available subtypes are listed below.

Job subtype Description

Aggregation

Define an aggregation job to be executed by Fusion Spark.

ALS Recommender

Train a collaborative filtering matrix decomposition recommender using SparkML’s Alternating Least Squares (ALS) to batch-compute user recommendations and item similarities.

Bisecting KMeans Clustering Job

Train a bisecting KMeans clustering model.

Cluster Labeling

Attach keyword labels to documents that have already been assigned to groups. See Doc Clustering below.

Collection Analysis

Produce statistics about the types of documents in a collection and their lengths.

Co-occurrence Similarity

Compute a mutual-information item similarity model.

Doc Clustering

Preprocess documents, separate out extreme-length documents and other outliers, automatically select the number of clusters, and extract keyword labels for clusters. You can choose between Bisecting KMeans and KMeans clustering methods, and between TFIDF and word2vec vectorization methods.

Item Similarity Recommender

Compute user recommendations based on pre-computed item similarity model.

Levenshtein

Compare the items in a collection and produces possible spelling mistakes based on the Levenshtein edit distance.

Logistic Regression Classifier Training Job

Train a regularized logistic regression model for text classification.

Matrix Decomposition-Based Query-Query Similarity Job

Train a collaborative filtering matrix decomposition recommender using SparkML’s Alternating Least Squares (ALS) to batch-compute query-query similarities.

Outlier Detection

Find groups of outliers for the entire set of documents in the collection.

Random Forest Classifier Training

Train a random forest classifier for text classification.

Script

Run a custom Scala script as a Fusion Job.

Statistically Interesting Phrases (SIP)

Output statistically interesting phrases in a collection, that is, phrases that occur more frequently or less frequently than expected.

Spark job configuration

Spark jobs can be created and modified using the Fusion UI or the Spark Jobs API. They can be scheduled using the Fusion UI or the Jobs API.

To see the complete list of configuration parameters for all Spark job subtypes, use the /spark/schema endpoint:

curl -u user:pass http://localhost:8764/api/apollo/spark/schema