Component | Version |
---|---|
Solr | fusion-solr 5.8.1 (based on Solr 9.1.1) |
ZooKeeper | 3.7.1 |
Spark | 3.2.2 |
Kubernetes | GKE, AKS, EKS 1.24 Rancher (RKE) and OpenShift 4 compatible with Kubernetes 1.24 OpenStack and customized Kubernetes installs not supported. See Kubernetes support for end of support dates. |
Ingress Controllers | Nginx, Ambassador (Envoy), GKE Ingress Controller Istio not supported. |
Looking to upgrade?Check out the Fusion 5 Upgrades for details.
Looking to upgrade?See Fusion 5 Upgrades for detailed instructions.
Rosette Entity Extractor (REX) or Rosette Base Linguistics (RBL) for the Use Advanced Linguistics with Babel Street or are not compatible with Solr 9 included in this version of Fusion. If you rely on the Babel Street language module, do not upgrade until this compatibility issue is resolved.
Use Advanced Linguistics with Babel Street
Use Advanced Linguistics with Babel Street
The Fusion Advanced Linguistics Package embeds Babel Street’s (formerly Basistech) Rosette natural language processing tools for multilingual text analysis. To improve search recall, Rosette Base Linguistics (RBL) handles the unique linguistic phenomena of more than 30 Asian and European languages. Rosette Entity Extractor (REX) identifies named entities such as people, locations, and organizations, allowing you to quickly refine your search, remove noise, and increase search relevance.

Save the changes, re-index your data, and perform the same query on 
If there is a particular entity you want to make sure is extracted or rejected, or if you wish to create a custom entity type, REX also supports gazetteers and regular expressions.This regular expression will extract as entity type
The
The result in the Now, when you re-index your data and search
This will instruct RBL to consider URLs as a single token, for example:
Using Named Entities (REX)
REX extracts named entities in multiple languages, including English, Chinese (traditional and simplified), and German. In English, it extracts multiple entity types and subtypes. This includes the following entity types (along with their associated subtypes):PERSON
LOCATION
ORGANIZATION
PRODUCT
TITLE
NATIONALITY
RELIGION
Create Application
To begin, create a new application called “entities”.Configuration
Edit Solr Configuration
We will begin by adding the Basis library elements to thesolrconfig.xml
file. We will also add a new update processor to perform the entity extraction.-
Navigate to System > Solr Config to edit the
solrconfig.xml
file. -
Fusion 5.8 and earlier: In the
<lib/>
directive section, add the lines below. Fusion 5.9 and later already contain these lines.For Fusion 4.x users, thedir
paths are the local REX installation path. -
In the
<updateRequestProcessorChain/>
section, add the following lines after the existing processor chains:Note the reference to a field calledtext_eng
. We will create this field through the Fusion UI in the next step. -
Save your changes to
solrconfig.xml
.
Define Fields
The data file we will use, eng_docs.csv, contains two fields:title
. An article headlinearticle_text
. The text content of the article
Field name | Field type | Other options |
---|---|---|
title | string | Use default options. |
text_eng | text_en | Use default options. |
text_eng_REX_* | string | Create this field as a dynamic field by clicking the Dynamic checkbox. Click the Multivalued checkbox. Leave other options as defaults. |

Be sure to save each field after creating it.
Indexing Data
Create Indexing Pipeline
- Navigate to Indexing > Indexing Pipelines.
-
Click Add and create a new pipeline called
test-entities
. - Select the Field Mapping stage.
-
In the Field Translations section, add a new row with source
article_text
and targettext_eng
. Set the Operation tomove
.
keep
.- Select the Solr Indexer stage.
-
In the Additional Update Request Parameters section, add a new row with parameter name
update.chain
and valuerex
. - Save the new pipeline.
Create Datasource
In this step, we will upload and index our documents from the data file.- Navigate to Indexing > Datasources.
- Click Add and select File Upload V2 from the dropdown menu.
-
Enter
eng_docs
for the Datasource ID. Alternatively, use a name you prefer. -
Select
test-entities
for the Pipeline ID. -
In the File Upload field, choose the sample file
eng_docs.csv
and click Upload File. The File ID field will be automatically populated. Leave all other values as their defaults. - Save the new datasource. The form will refresh, adding a set of buttons at the top.
- Click Run and then Start. When the job is finished, you will see “Success” in the popup form.
Querying Data
-
Navigate to Querying > Query Workbench. The default query is
\*:*
, which should bring up three documents. -
For the document with title “SpaceX Successfully Launches its First Crewed Spaceflight”, select Show fields. You will see a number of entities listed under the
text_eng_REX_*
fieldnames. -
Search on these multivalued fields. For example, set your query to
text_eng_REX_LOCATION:"New York"
to return the article that contains a mention of New York.
Customization (Advanced)
When setting up the Solr configuration, you specified the rootDirectory and fields options in your processor chain. REX provides a number of other configuration options you can set to control how entities are extracted. For example, if you are finding false positives, you can set parameters instructing REX to return only entities above a confidence threshold. The confidence threshold is a value between0
and 1
and applies to entities extracted by the statistical model. We recommend starting with a low value, around 0.2
. In your solrconfig.xml
file, add the options calculateConfidence
and confidenceThreshold
to your processor chain definition:\*:*
. Note that for the SpaceX article, “Falcon”, is now correctly omitted from the list of LOCATIONs.
Gazetteers
A gazetteer is a UTF-8 text file in which the first line is the entity type. It is followed by the names of entities you wish to extract, separated by newlines, in the language of your documents. Comments can be prefixed by the # symbol. Create a file spacecraft_gaz.txt with the following lines:Regular expressions
REX uses the Tcl regex format. Create a file,zulu_time_regex.xml
, file with the following lines:ZULU_TIME
all spans that consist of a 4-digit military time unit followed by the time zone designator UTC or GMT.Example
To instruct REX to use the gazetteer and regex file, edit yoursolrconfig.xml
file. The addGazetteers
option takes four parameters:- language
- file
- accept (
True
) or reject (False
) - case-sensitive (
True
orFalse
)
<str name="addGazetteers">eng,/path/to/spacecraft_gaz.txt,True,True</str>
:language | file | accept | case-sensitive |
---|---|---|---|
eng | /path/to/spacecraft_gaz.txt | True | True |
addRegularExpressions
option takes two parameters:- file
- accept (
True
) or reject (False
)
<str name="addRegularExpressions">/path/to/zulu_time_regex.xml,True</str>
:file | accept |
---|---|
/path/to/zulu_time_regex.xml | True |
solrconfig.xml
file:\*:*
, the SpaceX document will have new entities listed in the text_eng_REX_SPACECRAFT
and text_eng_REX_ZULU_TIME
dynamic fields.Additional Fusion deployment configurations are needed to use the REX gazetteer and regex options.
Using Multilingual Search (RBL)
RBL provides a set of linguistic tools to prepare your data for analysis. Language-specific models provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots.In this tutorial, we will index and query headlines in English, Chinese, and German to demonstrate the linguistics capabilities of RBL: lemmatization, tokenization, and decompounding.Create Application
To begin, create a new application called “multilingual”.Configuration
Edit Solr Configuration
We will begin by adding the Basis library elements to thesolrconfig.xml
file.-
Navigate to System > Solr Config to edit the
solrconfig.xml
file. -
In the
<lib/>
directive section, add the following lines:For Fusion 4.x users, thedir
path is the local REX installation path. -
Save your changes to
solrconfig.xml
.
Edit Schema
Add afieldType
element for each language to be processed by the application. The fieldType
element includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter. The language
attribute is set to the language code, equal to the ISO 639-3 code in most cases. The rootDirectory
points to the RBL directory.-
Navigate to System > Solr Config to edit the
managed-schema.xml
file. -
In the fieldType section, add the following new field types:
basis_english
,basis_chinese
, andbasis_german
.
You can incorporate any additional Solr filters you need, such as the Solr lowercase filter. However, filters should be added into the chain after the Base Linguistics token filter. If you modify the token stream too significantly before RBL, you degrade its ability to analyze the text.
- Save your changes to
managed-schema.xml
.
Define Fields
The data file we will use, multilingual_headlines.csv, contains fields for headlines in three languages:eng_headline
, zho_headline
, and deu_headline
. The analysis chain requires a field definition with a type
attribute that maps to the fieldType
you defined in the schema.To create new fields, navigate to Collections > Fields and click Add a Field. Create the following fields:Field name | Field type | Other options |
---|---|---|
text_eng | basis_english | Use default options. |
text_zho | basis_chinese | Use default options. |
text_deu | basis_german | Use default options. |
Be sure to save each field after creating it.
Indexing Data
Create Indexing Pipeline
- Navigate to Indexing > Indexing Pipelines.
-
Click Add and create a new pipeline called
test-multilingual
. - Select the Field Mapping stage.
-
In the Field Translations section, add three new rows:
Field name Field name Operation eng_headline
text_eng
move
zho_headline
text_zho
move
deu_headline
text_deu
move
- Save the new pipeline.
Create Datasource
In this step, we will upload and index our documents from the data file.- Navigate to Indexing > Datasources.
- Click Add and select File Upload from the dropdown menu.
-
Enter
multilingual_headlines
for the Datasource ID. Alternatively, use a name you prefer. -
Select
test-multilingual
for the Pipeline ID. -
In the File Upload field, choose the sample file
multilingual_headlines.csv
and click Upload File. The File ID field will be automatically populated. Leave all other values as their defaults. - Save the new datasource. The form will refresh, adding a set of buttons at the top.
- Click Run and then Start. When the job is finished, you will see “Success” in the popup form.
Querying Data
- Navigate to Querying > Query Workbench. The default query is
\*:*
, which should bring up ten documents. - Follow the examples in the subsections below to see how Fusion’s Advanced Linguistics capabilities can improve your search results.
Lemmatization
A “lemma” is the canonical form of a word, or the version of a word that you find in the dictionary. For example, the lemma of “mice” is “mouse”. The words “speaks”, “speaking”, “spoke”, and “spoken” all share the same lemma: “speak”.With RBL, you can perform searches by lemma, thus increasing your search results. This example demonstrates this practice with the words “knife” and “knives” below.- For ease of viewing results, select the Display Fields dropdown and enter
text_eng
in the Description field. - Enter the query
text_eng:knife
in the search box.
knife
. With a standard Solr text field type, this would be the only result returned. However, the special type basis_english
we configured allows the search engine to recognize “knives” as a form of “knife”. Therefore, the article “The Best Ways to Sharpen Kitchen Knives at Home” is also returned.RBL can significantly reduce your dependence on creating, maintaining, and using large synonym lists.Tokenization
Tokenization is the process of separating a piece of text into smaller units called “tokens”. Tokens can be words, characters, or subwords, depending on how they are defined and analyzed. The RBL tokenizer first determines sentence boundaries, then segments each sentence into individual tokens. The most useful tokens are often words, though they may also be numbers or other characters.In some languages like Chinese and Japanese, word tokens are not separated by whitespace, and words can consist of one, two, or more characters. For example, the tokens in 我喜歡貓 (I like cats) are 我 (I), 喜歡 (like), and 貓 (cats). RBL uses statistical models to identify token boundaries, allowing for more accurate search results.- For ease of viewing results, select the Display Fields dropdown and enter
text_zho
in the Description field. - Enter the query
text_zho:美國
(United States) in the search box.
美
(beautiful) would trigger a false positive match, even though it is not a word in this context. However, with the advanced analytics we have configured here, the query text_zho:美
will correctly return zero results.Compounds
RBL can decompose Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components. The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form. RBL allows Solr to index and query on these components, increasing recall of search results.- For ease of viewing results, select the Display Fields dropdown and enter
text_deu
in the Description field. - Enter the query
text_deu:Land
in the search box.
Land
(country) with a standard Solr text field type would not trigger a match. However, because RBL performs decompounding with lemmatization, searching on Heimat
or Land
will return a result.Customization (Advanced)
When setting up the Solr configuration, you specified the language and rootDirectory options in your field type definition. This is sufficient for most use cases. However, RBL does provide more options to control the behavior of the tokenizer and analyzer. For example, the default tokenization does not consider URLs. As a result,https://lucidworks.com
is tokenized as https
, lucidworks
, and com
.If you wish to recognize URLs, you can add the option urls="true"
to the tokenizer in your field type definition:https://lucidworks.com
.To see a list of all options, consult the full RBL documentation.Bug Fixes
Fusion 5.8.1 fixes a bug in the Fusion 5.8.0 Helm chart that prevented horizontal pod autoscaling from working. If you are not using horizontal pod autoscaling, you do not need to upgrade to Fusion 5.8.1. This release does not make any other changes to your deployment. To use horizontal pod autoscaling in Fusion 5.8.1, follow these steps:- Add the metrics server to your Fusion 5.8.1 deployment.
-
Ensure the following changes are made your custom values YAML file for horizontal pod autoscaling:
-
Add service limits to the
resources
object of a service. For example:
-
Add service limits to the
Resource limits vary depending on your deployment. Use values specific to your needs.
Do not use horizontal scaling with Argo.Horizontal scaling is not available with the Argo workflow controller. Do not apply horizontal autoscaling, or any other type of horizontal scaling, to the Argo service.Applying horizontal autoscaling to Argo can cause unexpected behavior, such as pods being unnecessarily terminated, jobs failing to launch, or models failing to deploy.
-
Use the new keys and values to support autoscaling. For example, the key
targetAverageUtilization
is nowtarget
, which requires the keystype
andaverageUtilization
:
To verify that horizontal pod autoscaling is running, use the
k get HorizontalPodAutoscaler
command. The output should resemble the following:Known issues
- New Kerberos security realms cannot be configured successfully in this version of Fusion.