Spark Jobs

Apache Spark can power a wide variety of data analysis jobs. In Fusion, Spark jobs are especially useful for generating recommendations.

Spark job subtypes

For the Spark job type, the available subtypes are listed below.

ALS Recommender

Train a collaborative filtering matrix decomposition recommender using SparkML’s Alternating Least Squares (ALS) to batch-compute user recommendations and item similarities.

Aggregation

Define an aggregation job to be executed by Fusion Spark.

Co-occurrence Similarity

Compute a mutual-information item similarity model.

Random Forest Classifier Training

Train a random forest classifier for text classification.

Script

Run a custom Scala script as a Fusion Job.

Matrix Decomposition-Based Query-Query Similarity Job

Train a collaborative filtering matrix decomposition recommender using SparkML’s Alternating Least Squares (ALS) to batch-compute query-query similarities.

Bisecting KMeans Clustering Job

Train a bisecting KMeans clustering model.

Logistic Regression Classifier Training Job

Train a regularized logistic regression model for text classification.

Item Similarity Recommender

Compute user recommendations based on pre-computed item similarity model.

Levenshtein

Compare the items in a collection and produces possible spelling mistakes based on the Levenshtein edit distance.

Collection Analysis

Produce statistics about the types of documents in a collection and their lengths.

Statistically Interesting Phrases (SIP)

Output statistically interesting phrases in a collection, that is, phrases that occur more frequently or less frequently than expected.

Doc Clustering

An end-to-end document clustering job that preprocesses documents, separates out extreme-length documents and other outliers, automatically selects the number of clusters, and extracts keyword labels for clusters. You can choose between Bisecting KMeans and KMeans clustering methods, and between TFIDF and word2vec vectorization methods.

Outlier Detection

Find groups of outliers for the entire set of documents in the collection.

Cluster Labeling

Attach keyword labels to documents that have already been assigned to groups.

Spark job configuration

Spark jobs can be created and modified using the Fusion UI or the Spark Jobs API. They can be scheduled using the Fusion UI or the Jobs API.

To see the complete list of configuration parameters for all Spark job subtypes, use the /spark/schema endpoint:

curl -u user:pass http://localhost:8764/api/apollo/spark/schema