Component | Version |
---|---|
Solr | fusion-solr 5.8.0 (based on Solr 9.1.1) |
ZooKeeper | 3.7.1 |
Spark | 3.2.2 |
Kubernetes | GKE, AKS, EKS 1.24 Rancher (RKE) and OpenShift 4 compatible with Kubernetes 1.24 OpenStack and customized Kubernetes installs not supported. See Kubernetes support for end of support dates. |
Ingress Controllers | Nginx, Ambassador (Envoy), GKE Ingress Controller Istio not supported. |
Looking to upgrade?Check out the Fusion 5 Upgrades for details.
Looking to upgrade?See Fusion 5 Upgrades for detailed instructions.
Rosette Entity Extractor (REX) or Rosette Base Linguistics (RBL) for the Use Advanced Linguistics with Babel Street are not compatible with Solr 9 included in this version of Fusion. If you rely on the Babel Street language module, do not upgrade until this compatibility issue is resolved.
Use Advanced Linguistics with Babel Street
Use Advanced Linguistics with Babel Street
The Fusion Advanced Linguistics Package embeds Babel Street’s (formerly Basistech) Rosette natural language processing tools for multilingual text analysis. To improve search recall, Rosette Base Linguistics (RBL) handles the unique linguistic phenomena of more than 30 Asian and European languages. Rosette Entity Extractor (REX) identifies named entities such as people, locations, and organizations, allowing you to quickly refine your search, remove noise, and increase search relevance.

Save the changes, re-index your data, and perform the same query on 
If there is a particular entity you want to make sure is extracted or rejected, or if you wish to create a custom entity type, REX also supports gazetteers and regular expressions.This regular expression will extract as entity type
The
The result in the Now, when you re-index your data and search
This will instruct RBL to consider URLs as a single token, for example:
Using Named Entities (REX)
REX extracts named entities in multiple languages, including English, Chinese (traditional and simplified), and German. In English, it extracts multiple entity types and subtypes. This includes the following entity types (along with their associated subtypes):PERSON
LOCATION
ORGANIZATION
PRODUCT
TITLE
NATIONALITY
RELIGION
Create Application
To begin, create a new application called “entities”.Configuration
Edit Solr Configuration
We will begin by adding the Basis library elements to thesolrconfig.xml
file. We will also add a new update processor to perform the entity extraction.-
Navigate to System > Solr Config to edit the
solrconfig.xml
file. -
Fusion 5.8 and earlier: In the
<lib/>
directive section, add the lines below. Fusion 5.9 and later already contain these lines.For Fusion 4.x users, thedir
paths are the local REX installation path. -
In the
<updateRequestProcessorChain/>
section, add the following lines after the existing processor chains:Note the reference to a field calledtext_eng
. We will create this field through the Fusion UI in the next step. -
Save your changes to
solrconfig.xml
.
Define Fields
The data file we will use, eng_docs.csv, contains two fields:title
. An article headlinearticle_text
. The text content of the article
Field name | Field type | Other options |
---|---|---|
title | string | Use default options. |
text_eng | text_en | Use default options. |
text_eng_REX_* | string | Create this field as a dynamic field by clicking the Dynamic checkbox. Click the Multivalued checkbox. Leave other options as defaults. |

Be sure to save each field after creating it.
Indexing Data
Create Indexing Pipeline
- Navigate to Indexing > Indexing Pipelines.
-
Click Add and create a new pipeline called
test-entities
. - Select the Field Mapping stage.
-
In the Field Translations section, add a new row with source
article_text
and targettext_eng
. Set the Operation tomove
.
keep
.- Select the Solr Indexer stage.
-
In the Additional Update Request Parameters section, add a new row with parameter name
update.chain
and valuerex
. - Save the new pipeline.
Create Datasource
In this step, we will upload and index our documents from the data file.- Navigate to Indexing > Datasources.
- Click Add and select File Upload V2 from the dropdown menu.
-
Enter
eng_docs
for the Datasource ID. Alternatively, use a name you prefer. -
Select
test-entities
for the Pipeline ID. -
In the File Upload field, choose the sample file
eng_docs.csv
and click Upload File. The File ID field will be automatically populated. Leave all other values as their defaults. - Save the new datasource. The form will refresh, adding a set of buttons at the top.
- Click Run and then Start. When the job is finished, you will see “Success” in the popup form.
Querying Data
-
Navigate to Querying > Query Workbench. The default query is
\*:*
, which should bring up three documents. -
For the document with title “SpaceX Successfully Launches its First Crewed Spaceflight”, select Show fields. You will see a number of entities listed under the
text_eng_REX_*
fieldnames. -
Search on these multivalued fields. For example, set your query to
text_eng_REX_LOCATION:"New York"
to return the article that contains a mention of New York.
Customization (Advanced)
When setting up the Solr configuration, you specified the rootDirectory and fields options in your processor chain. REX provides a number of other configuration options you can set to control how entities are extracted. For example, if you are finding false positives, you can set parameters instructing REX to return only entities above a confidence threshold. The confidence threshold is a value between0
and 1
and applies to entities extracted by the statistical model. We recommend starting with a low value, around 0.2
. In your solrconfig.xml
file, add the options calculateConfidence
and confidenceThreshold
to your processor chain definition:\*:*
. Note that for the SpaceX article, “Falcon”, is now correctly omitted from the list of LOCATIONs.
Gazetteers
A gazetteer is a UTF-8 text file in which the first line is the entity type. It is followed by the names of entities you wish to extract, separated by newlines, in the language of your documents. Comments can be prefixed by the # symbol. Create a file spacecraft_gaz.txt with the following lines:Regular expressions
REX uses the Tcl regex format. Create a file,zulu_time_regex.xml
, file with the following lines:ZULU_TIME
all spans that consist of a 4-digit military time unit followed by the time zone designator UTC or GMT.Example
To instruct REX to use the gazetteer and regex file, edit yoursolrconfig.xml
file. The addGazetteers
option takes four parameters:- language
- file
- accept (
True
) or reject (False
) - case-sensitive (
True
orFalse
)
<str name="addGazetteers">eng,/path/to/spacecraft_gaz.txt,True,True</str>
:language | file | accept | case-sensitive |
---|---|---|---|
eng | /path/to/spacecraft_gaz.txt | True | True |
addRegularExpressions
option takes two parameters:- file
- accept (
True
) or reject (False
)
<str name="addRegularExpressions">/path/to/zulu_time_regex.xml,True</str>
:file | accept |
---|---|
/path/to/zulu_time_regex.xml | True |
solrconfig.xml
file:\*:*
, the SpaceX document will have new entities listed in the text_eng_REX_SPACECRAFT
and text_eng_REX_ZULU_TIME
dynamic fields.Additional Fusion deployment configurations are needed to use the REX gazetteer and regex options.
Using Multilingual Search (RBL)
RBL provides a set of linguistic tools to prepare your data for analysis. Language-specific models provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots.In this tutorial, we will index and query headlines in English, Chinese, and German to demonstrate the linguistics capabilities of RBL: lemmatization, tokenization, and decompounding.Create Application
To begin, create a new application called “multilingual”.Configuration
Edit Solr Configuration
We will begin by adding the Basis library elements to thesolrconfig.xml
file.-
Navigate to System > Solr Config to edit the
solrconfig.xml
file. -
In the
<lib/>
directive section, add the following lines:For Fusion 4.x users, thedir
path is the local REX installation path. -
Save your changes to
solrconfig.xml
.
Edit Schema
Add afieldType
element for each language to be processed by the application. The fieldType
element includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter. The language
attribute is set to the language code, equal to the ISO 639-3 code in most cases. The rootDirectory
points to the RBL directory.-
Navigate to System > Solr Config to edit the
managed-schema.xml
file. -
In the fieldType section, add the following new field types:
basis_english
,basis_chinese
, andbasis_german
.
You can incorporate any additional Solr filters you need, such as the Solr lowercase filter. However, filters should be added into the chain after the Base Linguistics token filter. If you modify the token stream too significantly before RBL, you degrade its ability to analyze the text.
- Save your changes to
managed-schema.xml
.
Define Fields
The data file we will use, multilingual_headlines.csv, contains fields for headlines in three languages:eng_headline
, zho_headline
, and deu_headline
. The analysis chain requires a field definition with a type
attribute that maps to the fieldType
you defined in the schema.To create new fields, navigate to Collections > Fields and click Add a Field. Create the following fields:Field name | Field type | Other options |
---|---|---|
text_eng | basis_english | Use default options. |
text_zho | basis_chinese | Use default options. |
text_deu | basis_german | Use default options. |
Be sure to save each field after creating it.
Indexing Data
Create Indexing Pipeline
- Navigate to Indexing > Indexing Pipelines.
-
Click Add and create a new pipeline called
test-multilingual
. - Select the Field Mapping stage.
-
In the Field Translations section, add three new rows:
Field name Field name Operation eng_headline
text_eng
move
zho_headline
text_zho
move
deu_headline
text_deu
move
- Save the new pipeline.
Create Datasource
In this step, we will upload and index our documents from the data file.- Navigate to Indexing > Datasources.
- Click Add and select File Upload from the dropdown menu.
-
Enter
multilingual_headlines
for the Datasource ID. Alternatively, use a name you prefer. -
Select
test-multilingual
for the Pipeline ID. -
In the File Upload field, choose the sample file
multilingual_headlines.csv
and click Upload File. The File ID field will be automatically populated. Leave all other values as their defaults. - Save the new datasource. The form will refresh, adding a set of buttons at the top.
- Click Run and then Start. When the job is finished, you will see “Success” in the popup form.
Querying Data
- Navigate to Querying > Query Workbench. The default query is
\*:*
, which should bring up ten documents. - Follow the examples in the subsections below to see how Fusion’s Advanced Linguistics capabilities can improve your search results.
Lemmatization
A “lemma” is the canonical form of a word, or the version of a word that you find in the dictionary. For example, the lemma of “mice” is “mouse”. The words “speaks”, “speaking”, “spoke”, and “spoken” all share the same lemma: “speak”.With RBL, you can perform searches by lemma, thus increasing your search results. This example demonstrates this practice with the words “knife” and “knives” below.- For ease of viewing results, select the Display Fields dropdown and enter
text_eng
in the Description field. - Enter the query
text_eng:knife
in the search box.
knife
. With a standard Solr text field type, this would be the only result returned. However, the special type basis_english
we configured allows the search engine to recognize “knives” as a form of “knife”. Therefore, the article “The Best Ways to Sharpen Kitchen Knives at Home” is also returned.RBL can significantly reduce your dependence on creating, maintaining, and using large synonym lists.Tokenization
Tokenization is the process of separating a piece of text into smaller units called “tokens”. Tokens can be words, characters, or subwords, depending on how they are defined and analyzed. The RBL tokenizer first determines sentence boundaries, then segments each sentence into individual tokens. The most useful tokens are often words, though they may also be numbers or other characters.In some languages like Chinese and Japanese, word tokens are not separated by whitespace, and words can consist of one, two, or more characters. For example, the tokens in 我喜歡貓 (I like cats) are 我 (I), 喜歡 (like), and 貓 (cats). RBL uses statistical models to identify token boundaries, allowing for more accurate search results.- For ease of viewing results, select the Display Fields dropdown and enter
text_zho
in the Description field. - Enter the query
text_zho:美國
(United States) in the search box.
美
(beautiful) would trigger a false positive match, even though it is not a word in this context. However, with the advanced analytics we have configured here, the query text_zho:美
will correctly return zero results.Compounds
RBL can decompose Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components. The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form. RBL allows Solr to index and query on these components, increasing recall of search results.- For ease of viewing results, select the Display Fields dropdown and enter
text_deu
in the Description field. - Enter the query
text_deu:Land
in the search box.
Land
(country) with a standard Solr text field type would not trigger a match. However, because RBL performs decompounding with lemmatization, searching on Heimat
or Land
will return a result.Customization (Advanced)
When setting up the Solr configuration, you specified the language and rootDirectory options in your field type definition. This is sufficient for most use cases. However, RBL does provide more options to control the behavior of the tokenizer and analyzer. For example, the default tokenization does not consider URLs. As a result,https://lucidworks.com
is tokenized as https
, lucidworks
, and com
.If you wish to recognize URLs, you can add the option urls="true"
to the tokenizer in your field type definition:https://lucidworks.com
.To see a list of all options, consult the full RBL documentation.New Features
- Managed Fusion only: Added a new feature called Dynamic Pricing which improves scalability for custom pricing. This feature lets B2B organizations with large product and pricing inventories sort, facet, boost, and filter on custom prices and entitlements.
LucidAcademyLucidworks offers free training to help you get started.The Course for Dynamic Pricing focuses on how Dynamic Pricing maximizes custom pricing strategies:Visit the LucidAcademy to see the full training catalog.
- Managed Fusion only: Fusion now supports Reverse Search, which lets you set up monitoring queries that automatically include new documents. Instead of running a query multiple times to see if new documents have been added, this feature matches incoming documents to existing relevant queries, improving content awareness and productivity.
- Tika Asynchronous Parsing improves document crawl speeds and prevents memory and stability issues during connector processes. You can Use Tika Asynchronous Parsing to separate document crawling from document parsing, which is useful for large sets of complex documents. For more information, see Asynchronous Tika Parsing.
Use Tika Asynchronous Parsing
Use Tika Asynchronous Parsing
This document describes how to set up your application to use Tika asynchronous parsing.Unlike synchronous Tika parsing, which uses a parser stage, asynchronous Tika parsing is configured in the datasource and index pipeline. For more information, see Asynchronous Tika Parsing.
Field names change with asynchronous Tika parsing.In contrast to synchronous parsing, asynchronous Tika parsing prepends
parser_
to fields added to a document. System fields, which start with \_lw_
, are not prepended with parser_
. If you are migrating to asynchronous Tika parsing, and your search application configuration relies on specific field names, update your search application to use the new fields.Configure the connectors datasource
- Navigate to your datasource.
- Enable the Advanced view.
-
Enable the Async Parsing option.
Fusion 5.9.11 and later uses your parser configuration when using asynchronous parsing.The asynchronous parsing service performs Tika parsing using Apache Tika Server. In Fusion 5.8 through 5.9.10, other parsers, such as HTML and JSON, are not supported by the asynchronous parsing service. By enabling asynchronous parsing, the parser configuration linked to your datasource is ignored. In Fusion 5.9.11 and later, other parsers, such as HTML and JSON, are supported by the asynchronous parsing service. By enabling asynchronous parsing, the parser configuration linked to your datasource is used.
- Save the datasource configuration.
Configure the parser stage
You must do this step in Fusion 5.9.11 and later.
- Navigate to Parsers.
- Select the parser, or create a new parser.
- From the Add a parser stage menu, select Apache Tika Container Parser.
- (Optional) Enter a label for this stage. This label changes the names from Apache Tika Container Parser to the value you enter in this field.
- If the Apache Tika Container Parser stage is not already the first stage, drag and drop the stage to the top of the stage list so it is the first stage that runs.
Configure the index pipeline
- Go to the Index Pipeline screen.
- Add the Solr Partial Update Indexer stage.
-
Turn off the Reject Update if Solr Document is not Present option and turn on the Process All Pipeline Doc Fields option:
-
Include an extra update field in the stage configuration using any update type and field name. In this example, an incremental field
docs_counter_i
with an increment value of1
is added: -
Enable the Allow reserved fields option:
- Click Save.
-
Turn off or remove the Solr Indexer stage, and move the Solr Partial Update Indexer stage to be the last stage in the pipeline.
The Apache Tika and Forked Tika stages are now deprecated. Follow the migration steps to begin using asynchronous parsing.
Improvements
- Improved recoverability for on-prem connectors in high network traffic environments.
- Fusion’s custom Solr image has been updated to fusion-solr 5.8.0. This upgrade includes the benefits and new features in Solr 9, while also including custom plugins to support Dynamic Pricing, Reverse Search, and autoscaling.
- Developed new authentication methods for the MongoDB connector.
Bug Fixes
Fusion
- Fixed a bug where the indexing service failed to load some classes from some JDBC drivers.
- Updated the Helm charts used when deploying Prometheus, Grafana, Loki, and Promtail for monitoring.
- Fixed an error with permissions required for the Upload Model Parameters To Cloud job.
- Graph Security Trimming stage now works when collections have multiple shards and replicas.
- Fixed a bug where having the same document updated twice in the same job could cause the job to hang.
- Fixed an issue where the Solr API was unable to pass through raw requests using the proxy.
- Updated the query pipeline and indexing container base images to use Java 11 so they are more secure.
- Removed UI link to view logs dashboard as its target screen is no longer available.
- Fixed a UI bug where zone display fields could not be manually removed.
- Fusion panel text editors can now scroll as expected in Firefox.
Predictive Merchandiser
- Fixed a bug in Predictive Merchandiser where templates having a higher precedence using a specified trigger phrase and facet were not appearing when that phrase was searched with that facet selected.
Deprecations
- Field Parser Index Stage is no longer used by Fusion connectors. It is officially deprecated in this release and will be removed entirely in a later release.
-
Streaming documents to the
/index
and/reindex
endpoints of the Index Pipelines API is deprecated and will eventually stop working altogether in the continuing switch to asynchronous parsing. - Tika Server Parser is deprecated and will be replaced by Tika Asynchronous Parser.
- Apache Tika Parser stage is deprecated and will be removed in a later release.
- The Forked Apache Tika Parser stage is deprecated and will be removed completely in a later release.
Known issues
- New Kerberos security realms cannot be configured successfully in this version of Fusion.
- When using the JavaScript query stage to query Solr, you have to provide parameters including
rows
. Previouslyrows
accepted an integer, but now it must be entered as a string as in("rows", "1")
.