Index Data Sources

Note
If you haven’t read the introduction in the Site Search Guide, we recommend that you do.

A Site Search app indexes data sources and provides a user interface in which users can search and interact with search results. You can also embed code for Site Search modules in your websites.

One Site Search app can search one or more data sources of the same type or of different types.

Data source types

Site Search supports these types of data sources:

  • Web crawler – Index web pages on a website that is on the public Web.

  • CSV – Index a comma-separated value (CSV) file. The delimiter can be a comma or tab.

  • JSON – Index a JSON file.

  • Push Endpoint – Index data sent to a push endpoint.

File size limits

Site Search data sources have the following file size limits. Site Search doesn’t index files larger than these limits.

Data source type File size limit

Web crawler

50 MB

CSV

50 MB

JSON

50 MB

Push Endpoint

5 MB

Map fields

Mapping fields is a step in configuring data sources. This is background information. You will map fields when configuring datasources.

For each type of data source, Site Search uses fields with specific names to display parts of the data in the result list. Here are some examples:

Field Data sources Description

name

All

Element in the result list that serves as a title for the result

description

All

Longer text content

To use the More Like This smart panel, your data source must have a description field, or you must map a different field to description.

url

Web crawler and Push Endpoint

A URL that the name links to

id

All

An identifier for a specific record

Note
For a Push Endpoint data source, the id field must be a string field.

If your data sources have different names than those that Site Search uses, you must map any fields you want to appear in results from the native data source field names to the names that Site Search uses.

Other fields can appear in the results too, but they won’t get special treatment in result templates, for example, different font sizes or links.

Add and Configure Data Sources

Note
We assume that you (or someone else) has already created an Site Search app, and that you have opened the app. You must have the role Admin or Owner.

Add and configure the data sources that your app uses.

To add and configure data sources
  1. Open the Site Search menu – In the upper left corner of Site Search, click Edit.

  2. Add and configure data sources from which to index data – One search app can search one or more data sources of the same type or of different types.

    Add first data source

    Here are detailed steps for each of the data source types:

Note
You can add and configure additional data sources while other data sources are still indexing documents.

Change the Configuration of a Data Source

You might need to change the configuration of a data source, for example, if the source of data changes.

To change the configuration of a data source
  1. In the Site Search menu, click a data source.

  2. Use the tabbed wizard to configure the data source. For details, see Add and Configure Data Sources.

  3. Save your changes. In some cases, Site Search will re-index the documents. Click the button at the bottom of the page (Save or Save and Index).

    Save saves your configuration changes. Save and Index saves your configuration changes and indexes (crawls) the data source. Don’t click Crawl Now in the upper-right corner; that will index the data source now, but without saving your changes.

Delete a Data Source

Delete a data source that you no longer want to supply data to your search app.

Deleting a data source removes the data source configuration in Site Search and the index that was built from the data source. It doesn’t delete the files, web pages, and so on that Site Search crawled.

To delete a data source
  1. In the Site Search menu, click the data source you want to delete.

  2. In the right pane, scroll to the bottom and click Delete data source.

  3. Click Yes, Delete to confirm that you want to delete the data source.

Data sources and change

Users that search using Site Search modules or the Site Search app expect to search results to be current. Here we explain how to ensure this for different types of data sources.

Web crawler data sources

Site Search periodically re-indexes Web crawler data sources so that search results are current:

  • For trial apps, the re-indexing frequency is every 24 hours.

  • For licensed apps, your contract determines the re-indexing frequency.

  • You can re-index a Web crawler data source at need, for example, if there is some urgent need to index new documents.

CSV and JSON data sources

Important
If you plan to upload data from a CSV or JSON file more than one time, and you want Site Search to handle the changes, then you must have an id field in the file, or map a field in the file to id. Doing so lets Site Search identify records. If you don’t do this, uploading the file again just adds all of the records again (unless you delete and recreate the data source first).

The id field must be a string field.

Here, we describe how records are handled during the first file upload (when you create a data source) and during subsequent file uploads.

Records with unique IDs

With an id field or field mapped to id, this is the behavior for CSV and JSON data sources that have unique IDs:

On the first upload, all records get entries in the index.

For subsequent uploads of the same file:

  • Added records – Site Search adds entries for the new records to the index.

  • Updated records – Site Search updates the existing entries in the index.

  • Deleted records – Site Search doesn’t delete entries from the index for the deleted records. The entries will still appear in search results. See Manage record deletions in CSV and JSON files for strategies to manage record deletions from CSV and JSON files.

Records with nonunique IDs or no IDs; retain all records

Proceed as follows to index a CSV or JSON file that contains records with nonunique IDs (possibly multiple records for the same ID) or no ID field, when the goal is to retain all records:

  • With no ID or ID field is not named id – If the file doesn’t contain a field named id, don’t map a field to id. Upload the file a single time (when you create the data source). Site search will index all records. Don’t upload the file again. If that is necessary, delete and recreate the data source.

  • With id field – If the file contains a field named id, map the field to some other name.

In both cases, upload the file a single time (when you create the data source). Site search will index all records. Don’t upload the file again. If records in the file change and you want to update entries in the index, then delete and recreate the data source.

Records with nonunique IDs; retain only a single record

Important
Index updates for records with nonunique IDs take the first record during both the initial file upload and subsequent uploads of the same file.

Proceed as follows to index a CSV or JSON file that contains records with nonunique IDs (possibly multiple records for the same ID), when the goal is to retain a single record:

  • With ID field not named id – If the file doesn’t contain a field named id, map the ID field to id. Upload the file a single time (when you create the data source). Site search will index all records. Don’t upload the file again. If that is necessary, delete and recreate the data source.

  • With id field – No action is necessary.

In both cases, you can upload the file multiple times.

Manage record deletions in CSV and JSON files

To ensure that previously uploaded records that are no longer in files don’t appear in search results, you can:

  • Block documents – You can search for and block documents that have been deleted from the source files.

  • Delete and recreate the data source – To remove all deleted documents en masse, delete and recreate that data source. With this approach, search results for the data source are briefly unavailable. Changing the source file to a different file isn’t sufficient.

Push Endpoint data sources

For Push Endpoint data sources, you manage change as follows:

  • Add documents (singly or in batches) – Push documents with new id fields. Site Search adds the new records to the index.

  • Update documents (singly or in batches) – Push documents with existing id fields. Site Search updates the existing records in the index.

  • Delete documents (one at a time) – Delete one document at a time. Site Search delete the records from the index. Reference the documents by appending the value of the document’s id field to the end of the Push Endpoint URL, for example:

    Syntax showing the push endpoint and the document ID:

    https://subdomain.lucidworks.cloud/pathname/api/v1/push/endpoint/id

    Example URL showing the push endpoint and the document ID:

https://my-corp.lucidworks.cloud/fusion-search/api/v1/push/push-endpoint-prod/Rec100