Spark Jobs
Apache Spark can power a wide variety of data analysis jobs. In Fusion, Spark jobs are especially useful for generating recommendations.
Spark job subtypes
For the Spark job type, the available subtypes are listed below.
Job subtype | Description |
---|---|
Aggregation |
Define an aggregation job to be executed by Fusion Spark. |
ALS Recommender |
Train a collaborative filtering matrix decomposition recommender using SparkML’s Alternating Least Squares (ALS) to batch-compute user recommendations and item similarities. |
Bisecting KMeans Clustering Job |
Train a bisecting KMeans clustering model. |
Cluster Labeling |
Attach keyword labels to documents that have already been assigned to groups. See Doc Clustering below. |
Collection Analysis |
Produce statistics about the types of documents in a collection and their lengths. |
Co-occurrence Similarity |
Compute a mutual-information item similarity model. |
Doc Clustering |
Preprocess documents, separate out extreme-length documents and other outliers, automatically select the number of clusters, and extract keyword labels for clusters. You can choose between Bisecting KMeans and KMeans clustering methods, and between TFIDF and word2vec vectorization methods. |
Item Similarity Recommender |
Compute user recommendations based on pre-computed item similarity model. |
Levenshtein |
Compare the items in a collection and produces possible spelling mistakes based on the Levenshtein edit distance. |
Logistic Regression Classifier Training Job |
Train a regularized logistic regression model for text classification. |
Matrix Decomposition-Based Query-Query Similarity Job |
Train a collaborative filtering matrix decomposition recommender using SparkML’s Alternating Least Squares (ALS) to batch-compute query-query similarities. |
Outlier Detection |
Find groups of outliers for the entire set of documents in the collection. |
Random Forest Classifier Training |
Train a random forest classifier for text classification. |
Script |
Run a custom Scala script as a Fusion Job. |
Statistically Interesting Phrases (SIP) |
Output statistically interesting phrases in a collection, that is, phrases that occur more frequently or less frequently than expected. |
Spark job configuration
Spark jobs can be created and modified using the Fusion UI or the Spark Jobs API. They can be scheduled using the Fusion UI or the Jobs API.
To see the complete list of configuration parameters for all Spark job subtypes, use the /spark/schema
endpoint:
curl -u user:pass http://localhost:8764/api/apollo/spark/schema