Data Science Toolkit Integration

Beginning with Fusion 5.0, data scientists and machine learning engineers can deploy end-user-trained Python machine learning models to Fusion using the Data Science Toolkit Integration (DSTI). This offers real-time prediction and seamless integration with query and index pipelines.


  • Extension points for data scientists to plug in customized Python modeling code

  • Client libraries to ease the development and testing of Python plugins

  • API-driven and dynamic, runtime loading and updating of plugins

Example use cases:

  • Using SpaCy to extract named entities and indexing results into a Solr collection

  • Using a Keras model to perform query intent classification at query time

  • Using pre-trained word embeddings to generate synonyms for a query

DSTI components

  • Jupyter Notebook service: A fully-integrated Jupyter notebook in Fusion that allows for data scientists to explore data, Test SQL aggregations, and Run Fusion SQL statements. And import / export data to/from other storage mechanisms using Spark and choice of their language: Scala or Python. DSTI Component: Jupyter notebooks are still supported in Fusion 5.0.x.

  • Machine Learning service: Support model-serving in index pipelines and query pipelines.

  • Develop and Deploy a Machine Learning Model

DSTI Deprecations

In Fusion 5.3 and later, the ability to deploy Python-based models using the DSTI ml-python image has been removed. All Python-based models should be migrated to Seldon Core (see here for a tutorial on wrapping a Python-based model to work with Seldon Core).

SparkML and MLeap models are still supported via the DSTI, but the integration is deprecated as of Fusion 5.1 and these models should be migrated to Seldon Core as well. The DSTI support for these models will be removed in an uipcoming version of Fusion.

Users that were taking advantage of the spaCy model supplied with Fusion 5.0-5.2 will instead need to use the 'Create Seldon Core Model Deployment' job within Fusion to deploy a Seldon Core-enabled version which can be used as a drop-in replacement for the old model in the NLP Annotator stage. A sample configuration for deploying the replacement model will look like this:

   "columnNames":"[token_offsets, pos_labels, lemma_labels, ner_offsets, ner_labels, sentence_offsets]",