Machine learning with Spark
Apache Spark is an open-source cluster-computing framework that serves as a fast and general execution engine for large-scale data processing jobs that can be decomposed into stepwise tasks, which are distributed across a cluster of networked computers.
Spark improves on previous MapReduce implementations by using resilient distributed datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
Fusion manages a Spark cluster that is used for all signal aggregation processes.
See Machine Learning Jobs for details about each pre-defined machine learning job in Fusion.
Spark in Fusion on Kubernetes
Spark in Fusion On-Prem
These topics provide information about Spark administration in Fusion Server on premises:
Spark Components – Spark integration in Fusion, including a diagram
Spark Getting Started – Starting Spark processes and working with the shell and the Spark UI
Spark Driver Processes – Fusion jobs run on Spark use a driver process started by the API service
Spark Configuration – How to configure Spark for maximum performance. The article also provides information about ports, directories, and configuring connections for an SSL-enabled Solr cluster.
Scaling Spark Aggregations – How to configure Spark so that aggregations scale
Spark Troubleshooting – How to troubleshoot Spark
The Data Science Toolkit Integration (DSTI)
Beginning with Fusion 5.0, data scientists and machine learning engineers can deploy end-user-trained Python machine learning models to Fusion using the Data Science Toolkit Integration (DSTI). This offers real-time prediction and seamless integration with query and index pipelines.