Component | Version |
---|---|
Solr | fusion-solr 5.8.0 (based on Solr 9.1.1) |
ZooKeeper | 3.7.1 |
Spark | 3.2.2 |
Kubernetes | GKE, AKS, EKS 1.24 Rancher (RKE) and OpenShift 4 compatible with Kubernetes 1.24 OpenStack and customized Kubernetes installs not supported. See Kubernetes support for end of support dates. |
Ingress Controllers | Nginx, Ambassador (Envoy), GKE Ingress Controller Istio not supported. |
Use Advanced Linguistics with Babel Street
PERSON
LOCATION
ORGANIZATION
PRODUCT
TITLE
NATIONALITY
RELIGION
solrconfig.xml
file. We will also add a new update processor to perform the entity extraction.solrconfig.xml
file.
<lib/>
directive section, add the lines below. Fusion 5.9 and later already contain these lines.
dir
paths are the local REX installation path.<updateRequestProcessorChain/>
section, add the following lines after the existing processor chains:
text_eng
. We will create this field through the Fusion UI in the next step.
solrconfig.xml
.
title
. An article headlinearticle_text
. The text content of the articleField name | Field type | Other options |
---|---|---|
title | string | Use default options. |
text_eng | text_en | Use default options. |
text_eng_REX_* | string | Create this field as a dynamic field by clicking the Dynamic checkbox. Click the Multivalued checkbox. Leave other options as defaults. |
test-entities
.
article_text
and target text_eng
. Set the Operation to move
.
keep
.update.chain
and value rex
.
eng_docs
for the Datasource ID. Alternatively, use a name you prefer.
test-entities
for the Pipeline ID.
eng_docs.csv
and click Upload File.
The File ID field will be automatically populated. Leave all other values as their defaults.
\*:*
, which should bring up three documents.
text_eng_REX_*
fieldnames.
text_eng_REX_LOCATION:"New York"
to return the article that contains a mention of New York.
0
and 1
and applies to entities extracted by the statistical model. We recommend starting with a low value, around 0.2
. In your solrconfig.xml
file, add the options calculateConfidence
and confidenceThreshold
to your processor chain definition:\*:*
. Note that for the SpaceX article, “Falcon”, is now correctly omitted from the list of LOCATIONs.zulu_time_regex.xml
, file with the following lines:ZULU_TIME
all spans that consist of a 4-digit military time unit followed by the time zone designator UTC or GMT.solrconfig.xml
file. The addGazetteers
option takes four parameters:True
) or reject (False
)True
or False
)<str name="addGazetteers">eng,/path/to/spacecraft_gaz.txt,True,True</str>
:language | file | accept | case-sensitive |
---|---|---|---|
eng | /path/to/spacecraft_gaz.txt | True | True |
addRegularExpressions
option takes two parameters:True
) or reject (False
)<str name="addRegularExpressions">/path/to/zulu_time_regex.xml,True</str>
:file | accept |
---|---|
/path/to/zulu_time_regex.xml | True |
solrconfig.xml
file:\*:*
, the SpaceX document will have new entities listed in the text_eng_REX_SPACECRAFT
and text_eng_REX_ZULU_TIME
dynamic fields.solrconfig.xml
file.solrconfig.xml
file.
<lib/>
directive section, add the following lines:
dir
path is the local REX installation path.solrconfig.xml
.
fieldType
element for each language to be processed by the application. The fieldType
element includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter. The language
attribute is set to the language code, equal to the ISO 639-3 code in most cases. The rootDirectory
points to the RBL directory.managed-schema.xml
file.
basis_english
, basis_chinese
, and basis_german
.
managed-schema.xml
.eng_headline
, zho_headline
, and deu_headline
. The analysis chain requires a field definition with a type
attribute that maps to the fieldType
you defined in the schema.To create new fields, navigate to Collections > Fields and click Add a Field. Create the following fields:Field name | Field type | Other options |
---|---|---|
text_eng | basis_english | Use default options. |
text_zho | basis_chinese | Use default options. |
text_deu | basis_german | Use default options. |
test-multilingual
.
Field name | Field name | Operation |
---|---|---|
eng_headline | text_eng | move |
zho_headline | text_zho | move |
deu_headline | text_deu | move |
multilingual_headlines
for the Datasource ID. Alternatively, use a name you prefer.
test-multilingual
for the Pipeline ID.
multilingual_headlines.csv
and click Upload File.
The File ID field will be automatically populated. Leave all other values as their defaults.
\*:*
, which should bring up ten documents.text_eng
in the Description field.text_eng:knife
in the search box.knife
. With a standard Solr text field type, this would be the only result returned. However, the special type basis_english
we configured allows the search engine to recognize “knives” as a form of “knife”. Therefore, the article “The Best Ways to Sharpen Kitchen Knives at Home” is also returned.RBL can significantly reduce your dependence on creating, maintaining, and using large synonym lists.text_zho
in the Description field.text_zho:美國
(United States) in the search box.美
(beautiful) would trigger a false positive match, even though it is not a word in this context. However, with the advanced analytics we have configured here, the query text_zho:美
will correctly return zero results.text_deu
in the Description field.text_deu:Land
in the search box.Land
(country) with a standard Solr text field type would not trigger a match. However, because RBL performs decompounding with lemmatization, searching on Heimat
or Land
will return a result.https://lucidworks.com
is tokenized as https
, lucidworks
, and com
.If you wish to recognize URLs, you can add the option urls="true"
to the tokenizer in your field type definition:https://lucidworks.com
.To see a list of all options, consult the full RBL documentation.Use Tika Asynchronous Parsing
parser_
to fields added to a document. System fields, which start with \_lw_
, are not prepended with parser_
. If you are migrating to asynchronous Tika parsing, and your search application configuration relies on specific field names, update your search application to use the new fields.docs_counter_i
with an increment value of 1
is added:
/index
and /reindex
endpoints of the Index Pipelines API is deprecated and will eventually stop working altogether in the continuing switch to asynchronous parsing.
rows
. Previously rows
accepted an integer, but now it must be entered as a string as in ("rows", "1")
.