Product Selector

Fusion 5.11
    Fusion 5.11

    Spark Jobs

    Apache Spark can power a wide variety of data analysis jobs. In Fusion, Spark jobs are especially useful for generating recommendations.

    Spark job subtypes

    For the Spark job type, the available subtypes are listed below.

    • SQL Aggregation job

      A Spark SQL aggregation job where user-defined parameters are injected into a built-in SQL template at runtime.

    • Custom Python job

      The Custom Python job provides user the ability to run Python code via Fusion. This job supports Python 3.6+ code.

    • Script

      Run a custom Scala script as a Fusion job.

    See Additional Spark jobs for more information.

    Spark job configuration

    Spark jobs can be created and modified using the Fusion UI or the Spark Jobs API. They can be scheduled using the Fusion UI or the Spark Jobs API.

    For the complete list of configuration parameters for all Spark job subtypes, see the Jobs Configuration Reference.

    Machine learning jobs

    Fusion provides these job types to perform machine learning tasks.

    Signals analysis

    These jobs analyze a collection of signals in order to perform query rewriting, signals aggregation, or experiment analysis.

    • Ground Truth

      Estimate ground truth queries using click signals and query signals, with document relevance per query determined using a click/skip formula.

    Query rewriting

    These jobs produce data that can be used for query rewriting or to inform updates to the synonyms.txt file.

    • Head/Tail Analysis

      Perform head/tail analysis of queries from collections of raw or aggregated signals, to identify underperforming queries and the reasons. This information is valuable for improving overall conversions, Solr configurations, auto-suggest, product catalogs, and SEO/SEM strategies, in order to improve conversion rates.

    • Phrase Extraction

      Identify multi-word phrases in signals.

    • Synonym Detection Jobs

      Use this job to generate pairs of synonyms and pairs of similar queries. Two words are considered potential synonyms when they are used in a similar context in similar queries.

    • Token and Phrase Spell Correction

      Detect misspellings in queries or documents using the numbers of occurrences of words and phrases.

    Signals aggregation

    • SQL Aggregation

      A Spark SQL aggregation job where user-defined parameters are injected into a built-in SQL template at runtime.

    Experiment analysis

    • Ranking Metrics

      Calculate relevance metrics (nDCG and so on) by replaying ground truth queries against catalog data using variants from an experiment.

    Collaborative recommenders

    These jobs analyze signals and generate matrices used to provide collaborative recommendations.

    • BPR Recommender

      Use this job when you want to compute user recommendations or item similarities using a Bayesian Personalized Ranking (BPR) recommender algorithm.

    • Query-to-Query Session-Based Similarity jobs

      This recommender is based on co-occurrence of queries in the context of clicked documents and sessions. It is useful when your data shows that users tend to search for similar items in a single search session. This method of generating query-to-query recommendations is faster and more reliable than the Query-to-Query Similarity recommender job, and is session-based unlike the similar queries previously generated as part of the Synonym Detection job.

    Content-based recommenders

    Content-based recommenders create matrices of similar items based on their content.

    • Content-Based Recommender

      Use this job when you want to compute item similarities based on their content, such as product descriptions.

    Content analysis

    • Cluster Labeling

      Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)

    • Document Clustering

      The Document Clustering job uses an unsupervised machine learning algorithm to group documents into clusters based on similarities in their content. You can enable more efficient document exploration by using these clusters as facets, high-level summaries or themes, or to recommend other documents from the same cluster. The job can automatically group similar documents in all kinds of content, such as clinical trials, legal documents, book reviews, blogs, scientific papers, and products.

    • Classification job

      This job analyzes how your existing documents are categorized and produces a classification model that can be used to predict the categories of new documents at index time.

    • Outlier Detection

      Use this job when you want to find outliers from a set of documents and attach labels for each outlier group.

    Data ingest

    • Parallel Bulk Loader

      The Parallel Bulk Loader (PBL) job enables bulk ingestion of structured and semi-structured data from big data systems, NoSQL databases, and common file formats like Parquet and Avro.