Usage
Python code can be entered directly in the job configuration editor or you can reference a script that has been uploaded to the blob store. Additional Python libraries or files can be supplied via a Python files configuration.Examples
Example Python script that indexes data from parquet to Solr via a Managed Fusion index pipeline:script
variable of the job config and several arguments are passed via the submitArgs
configuration key:
Configuration
Apache Arrow is installed to the image and the two settings below are enabled by default. If you want to disable arrow optimization, set these properties to false in the job config or in job-launcher config map:Available libraries
These libraries are available in the Managed Fusion Spark image:numpy
scipy
matplotlib
pandas
scikit-learn
Adding libraries
If you need to add extra libraries to run your code, you can upload the Python egg files to the blob store and reference their blob IDs in the job configuration. However, machine learning libraries (liketensorflow
, keras
, and pytorch
) are not easy to install with that approach. To install those libraries, follow this approach instead:
-
Use this example
Dockerfile
to extend from the base image: - Build the Docker image and publish it to your own Docker registry.
-
Once the image is built, the custom image can be specified in the Spark settings via
spark.kubernetes.driver.container.image
andspark.kubernetes.executor.container.image
.
ImportantIf you upload
.zip
files to add libraries, use the Other
blob type for binary files instead of the File
blob type. If the File
blob type is used, the custom Python job fails.