spark-solr
repository.
The following diagram depicts this process:
spark-mllib.json
contains metadata about the model implementation. In particular, how the model derives feature vectors from a document or query.
The JSON object has the following attributes:
id
. A string label that is used as a unique ID for the Fusion blobstore, for example, tweets_sentiment_svm
.modelClassName
. The name of the spark-mllib
class or the custom Java class that implements the com.lucidworks.spark.ml.MLModel
interface.featureFields
. A list of one or more field names.vectorizer
. Specifies the processing required to derive a vector of features from the contents of the document fields listed in the featureFields
entry.spark-mllib.json
file for the model with id tweets_sentiment_svm
:
vectorizer
consists of two steps: a lucene-analyzer
step followed by a hashingTF
step. The lucene-analyzer
step can use any Lucene analyzer to perform text analysis.
Other available vectorizer operations include the MLlib normalizer, the standard scaler, and the ChiSq selector. To see how to use the standard scaler, see the examples in the spark-solr
repository.