Index Data Sources
- Data source types
- File size limits
- Map fields
- Add and Configure Data Sources
- Change the Configuration of a Data Source
- Delete a Data Source
- Data sources and change
Note
|
If you haven’t read the introduction in the Site Search Guide, we recommend that you do. |
A Site Search app indexes data sources and provides a user interface in which users can search and interact with search results. You can also embed code for Site Search modules in your websites.
One Site Search app can search one or more data sources of the same type or of different types.
Data source types
Site Search supports these types of data sources:
-
Web crawler – Index web pages on a website that is on the public Web.
-
CSV – Index a comma-separated value (CSV) file. The delimiter can be a comma or tab.
-
JSON – Index a JSON file.
-
Push Endpoint – Index data sent to a push endpoint.
File size limits
Site Search data sources have the following file size limits. Site Search doesn’t index files larger than these limits.
Data source type | File size limit |
---|---|
Web crawler |
50 MB |
CSV |
50 MB |
JSON |
50 MB |
Push Endpoint |
5 MB |
Map fields
Mapping fields is a step in configuring data sources. This is background information. You will map fields when configuring datasources.
For each type of data source, Site Search uses fields with specific names to display parts of the data in the result list. Here are some examples:
Field | Data sources | Description | ||
---|---|---|---|---|
|
All |
Element in the result list that serves as a title for the result |
||
|
All |
Longer text content To use the More Like This smart panel, your data source must have a |
||
|
Web crawler and Push Endpoint |
A URL that the |
||
|
All |
An identifier for a specific record
|
If your data sources have different names than those that Site Search uses, you must map any fields you want to appear in results from the native data source field names to the names that Site Search uses.
Other fields can appear in the results too, but they won’t get special treatment in result templates, for example, different font sizes or links.
Add and Configure Data Sources
Note
|
We assume that you (or someone else) has already created an Site Search app, and that you have opened the app. You must have the role Admin or Owner. |
Add and configure the data sources that your app uses.
-
Open the Site Search menu – In the upper left corner of Site Search, click
.
-
Add and configure data sources from which to index data – One search app can search one or more data sources of the same type or of different types.
Here are detailed steps for each of the data source types:
Note
|
You can add and configure additional data sources while other data sources are still indexing documents. |
Change the Configuration of a Data Source
You might need to change the configuration of a data source, for example, if the source of data changes.
-
In the Site Search menu, click a data source.
-
Use the tabbed wizard to configure the data source. For details, see Add and Configure Data Sources.
-
Save your changes. In some cases, Site Search will re-index the documents. Click the button at the bottom of the page (Save or Save and Index).
Save saves your configuration changes. Save and Index saves your configuration changes and indexes (crawls) the data source. Don’t click Crawl Now in the upper-right corner; that will index the data source now, but without saving your changes.
Delete a Data Source
Delete a data source that you no longer want to supply data to your search app.
Deleting a data source removes the data source configuration in Site Search and the index that was built from the data source. It doesn’t delete the files, web pages, and so on that Site Search crawled.
-
In the Site Search menu, click the data source you want to delete.
-
In the right pane, scroll to the bottom and click
.
-
Click Yes, Delete to confirm that you want to delete the data source.
Data sources and change
Users that search using Site Search modules or the Site Search app expect to search results to be current. Here we explain how to ensure this for different types of data sources.
Web crawler data sources
Site Search periodically re-indexes Web crawler data sources so that search results are current:
-
For trial apps, the re-indexing frequency is every 24 hours.
-
For licensed apps, your contract determines the re-indexing frequency.
-
You can re-index a Web crawler data source at need, for example, if there is some urgent need to index new documents.
CSV and JSON data sources
Important
|
If you plan to upload data from a CSV or JSON file more than one time, and you want Site Search to handle the changes, then you must have an id field in the file, or map a field in the file to id . Doing so lets Site Search identify records. If you don’t do this, uploading the file again just adds all of the records again (unless you delete and recreate the data source first).
|
The id
field must be a string field.
Here, we describe how records are handled during the first file upload (when you create a data source) and during subsequent file uploads.
Records with unique IDs
With an id
field or field mapped to id
, this is the behavior for CSV and JSON data sources that have unique IDs:
On the first upload, all records get entries in the index.
For subsequent uploads of the same file:
-
Added records – Site Search adds entries for the new records to the index.
-
Updated records – Site Search updates the existing entries in the index.
-
Deleted records – Site Search doesn’t delete entries from the index for the deleted records. The entries will still appear in search results. See Manage record deletions in CSV and JSON files for strategies to manage record deletions from CSV and JSON files.
Records with nonunique IDs or no IDs; retain all records
Proceed as follows to index a CSV or JSON file that contains records with nonunique IDs (possibly multiple records for the same ID) or no ID field, when the goal is to retain all records:
-
With no ID or ID field is not named
id
– If the file doesn’t contain a field namedid
, don’t map a field toid
. Upload the file a single time (when you create the data source). Site search will index all records. Don’t upload the file again. If that is necessary, delete and recreate the data source. -
With
id
field – If the file contains a field namedid
, map the field to some other name.
In both cases, upload the file a single time (when you create the data source). Site search will index all records. Don’t upload the file again. If records in the file change and you want to update entries in the index, then delete and recreate the data source.
Records with nonunique IDs; retain only a single record
Important
|
Index updates for records with nonunique IDs take the first record during both the initial file upload and subsequent uploads of the same file. |
Proceed as follows to index a CSV or JSON file that contains records with nonunique IDs (possibly multiple records for the same ID), when the goal is to retain a single record:
-
With ID field not named
id
– If the file doesn’t contain a field namedid
, map the ID field toid
. Upload the file a single time (when you create the data source). Site search will index all records. Don’t upload the file again. If that is necessary, delete and recreate the data source. -
With
id
field – No action is necessary.
In both cases, you can upload the file multiple times.
Manage record deletions in CSV and JSON files
To ensure that previously uploaded records that are no longer in files don’t appear in search results, you can:
-
Block documents – You can search for and block documents that have been deleted from the source files.
-
Delete and recreate the data source – To remove all deleted documents en masse, delete and recreate that data source. With this approach, search results for the data source are briefly unavailable. Changing the source file to a different file isn’t sufficient.
Push Endpoint data sources
For Push Endpoint data sources, you manage change as follows:
-
Add documents (singly or in batches) – Push documents with new
id
fields. Site Search adds the new records to the index. -
Update documents (singly or in batches) – Push documents with existing
id
fields. Site Search updates the existing records in the index. -
Delete documents (one at a time) – Delete one document at a time. Site Search delete the records from the index. Reference the documents by appending the value of the document’s
id
field to the end of the Push Endpoint URL, for example:Syntax showing the push endpoint and the document ID:
https://subdomain.lucidworks.cloud/pathname/api/v1/push/endpoint/id
Example URL showing the push endpoint and the document ID:
https://my-corp.lucidworks.cloud/fusion-search/api/v1/push/push-endpoint-prod/Rec100