LucidAcademyLucidworks offers free training to help you get started.The Quick Learning for Using the REST V2 Connector focuses on how to ingest a datasource using REST V2 connector recipes:Visit the LucidAcademy to see the full training catalog.
JSON guide
REST V2 recipes are JSON files appended to cURL calls that can be sent to Fusion, where their contents show up in the UI. From there, Fusion can use the APIs created by external software companies to crawl the content stored in those datasources. The way the JSON is set up allows the REST V2 connector to work with multiple different products, attesting to its flexibility. This article breaks down the different parts of a recipe’s JSON file and how it can be used when creating a Fusion datasource. The structure and parameters included will vary, but you can use this guide as a reference for information that has been used in some of the recipes. A parent request, also referred to as a root request, is set using the path for an endpoint. Child requests drill further down into an endpoint and can be looped through to add more information to the documents in the index. This is useful when the content being crawled has additional objects that are not picked up by the parent request alone, for example, comments on pages. A cURL call to the REST V2 connector has headers to use within the request definition that include content type and authorization to Fusion. Calls for the REST V2 connector go through the Connector Datasources API. Here is an example call. The headers in this call are used to define the content type as JSON and to enter your Lucidworks login details. Replace:FUSION_HOST:FUSION_PORT
with your Fusion address.AUTHORIZATION_CREDENTIALS
with your Lucidworks login information in Base64.JSON_RECIPE
with the preconfigured recipe obtained from GitHub, making sure to update any placeholders found in that recipe.
General parameters
The JSON recipe contains connector information with the general parameters used to handle the service. This table contains a selection of parameters that you might see.Parameter | Description |
---|---|
parserId | The parser ID is the name of the parser as set up in the Fusion UI under Indexing > Parsers. |
id | This populates the Configuration ID for the connector in the Fusion UI. You can name this whatever you want as long as the name does not already exist as another Configuration ID in your Fusion instance. |
Properties
The properties section of the JSON file contains the bulk of the information being sent and includes the API base URL, authentication mode, service endpoints, HTTP method, query parameters, pagination settings, and any loop configurations.API base URL
Parameter | Description |
---|---|
serviceURL | REST API base URL for the external service from where the data is extracted. Endpoints for the API call are added in the service endpoints section described below. Be sure to add additional levels of security for any content you do not want indexed, otherwise the connector will include all of the content it finds through that URL. |
Authentication mode
The authentication mode in the JSON body request contains the way to authenticate using the API. This will be the login information for the external service. For example, if indexing content from Confluence, you would use this section to include your Confluence login details.Parameter | Description |
---|---|
basicAuth | Uses password and user properties for authentication. Depending on the API used to connect to resources outside of Lucidworks, it may require an API Key to authenticate. In this case, enter the username and replace the password with the API Key. |
oAuth | Allows for fetching an authentication token used to authorize the request for the service endpoints crawl. See how to authenticate using OAuth. |
Service endpoints and list of requests configuration
The service endpoints section (serviceEndpoints
), known as list of requests configuration in some recipes (requestConfigurations
), specifies the API endpoint paths appended to the base URL (serviceURL
) used to crawl a datasource. Query parameters will not work if added directly in the endpoints, so be sure to include any query parameters using queryKey
and queryValue
fields, as the query fields are mapped to Solr using a queryKey
and populated with results from queryValue
.
Parent (root) requests
Parent requests target an API endpoint to crawl content. These endpoints are higher in the structure than child requests (described below), which are used to crawl objects embedded at a deeper level to add more content to a document being indexed.Parameter | Description |
---|---|
endpoint | The endpoint to append to the cURL location base URL path, for example /rest/api/content . |
httpMethod | HTTP method to use for the request, for example GET and POST . |
queryKey | The name of the field as it will appear in the Solr documents in the index. |
queryValue | The name of the field being queried in the datasource. Used in Solr documents to populate the value of the field entered in queryKey . |
Pagination
Pagination has two options: pagination by next page URL and pagination by batch size. For pagination by next page URL, the URL that starts the next page is sent by the request. For pagination by batch size, you can configure pagination in the query parameters by indicating the start number of the index and the batch size.Parameter | Description |
---|---|
paginationKey | Key that contains the nextPageUrl in the response. If the key is nested, use dot notation, for example list.nextPageUrl . |
batchSize | Number of objects to retrieve per page, for example "batchSize": 20 . This value must be indicated in the parent (root) query parameters from the data request. An example of such a query parameter is {"queryKey": "limit", "queryValue": "${LW_BATCH_SIZE}"} . All parent objects that are located are then indexed as Solr documents, and the batch size sets a limit and determines how many of those documents are displayed in the results at a time. The pagination is automatically set to stop when a response object returns as empty, indicating it has reached the end. |
indexStart | Index from where to start pagination. The default is "indexStart": 0 . This value must be indicated in the parent (root) query parameters from the data request. An example of such a query parameter is {"queryKey": "start", "queryValue": "${LW_INDEX_START}"} . |
Root response mapping
Root response mapping is used to separate the parent objects being crawled into individual Solr documents by assigning each document a unique ID. The data obtained from child requests is added to the same document by association with this parent ID. This is also the area where you can choose to index content other than text by enablingbinaryResponse
. For attachments, ensure Send as Binary Response is enabled. If it is not, then no attachments are received and indexed. When enabled, the connector looks for MIME type other than .json for attachments to index. For a JSON response, ensure Send as Binary Response is not enabled.
Parameter | Description |
---|---|
dataId | Name of the field in the data objects extracted with dataPath used to create the unique ID for Solr documents. If not provided, a random UUID will be used. This property also accepts JSONPath expressions. |
dataPath | The name of a specific data object from a datasource that is returned within a response. For example, in order to extract a list of elements named objects in the datasource, the dataPath would be objects , with each element indexed as a separate Solr document. If not provided or left blank as "" , the entire response body will be indexed as a single Solr document. This property also accepts JSONPath expressions,for example, objects[] or $.objects[] . |
binaryResponse | Set to true for indexing content other than text, for example images and attachments. This selects the Send as Binary Response checkbox in the Fusion UI. If true , the response will be sent as binary data to Fusion, properties dataId and dataPath will be ignored, and pagination will not be performed. |
Loops using child requests
Loops, also known as child requests, contain an array of queries to extract more information from a datasource for the documents being indexed. The loop will iterate over the data request for each parent ID and associate the response with the parent. This is useful in cases where the parent endpoint has additional endpoints that can be appended for data contained further down within the endpoint path. Loops perform a separate request for each data object. The REST V2 connector supports hierarchical discovery, meaning when content is located, that content is recursively checked to see if it has additional information associated with it for the child request and will continue collecting information for each request until no more content is located. For example, if the connector is crawling for comments and attachments, it will check each of those items for any comments and attachments connected to them. If any are found it will check for comments and attachments associated with those, and continue until all relevant content is collected. This is also useful in cases where the connector is searching through folders with multiple levels of subfolders.Parameter | Description |
---|---|
endpoint | The API endpoint to append to the cURL location base URL path, for example /rest/api/content/${LW_PARENT_DATA_KEY}/child/comment . |
httpMethod | HTTP method to use for the request. GET and POST are supported. |
queries | Contains the array of queries to use within the request definition, each with a queryKey and queryValue pair. |
queryKey | The name of the field as it will appear in the Solr documents in the index. |
queryValue | The name of the field being queried in the datasource. Used in Solr documents to populate the value of the field entered in queryKey . |
Child response mapping
The child responses are mapped to the parent through thedataId
and dataPath
in the root response mapping described earlier through the use of a parentIdKey
. The parentIdKey
should match the dataId
in the root response mapping.
For example, with root response mapping:
Other mappings
Additional mapping configures the data objects. Recipes do not necessarily include all parameters described here.Parameter | Description |
---|---|
idKey | This creates the ID Key in the Data Object Mapping section of the UI as the Solr document ID. Fill this property when Destination Key is empty. If neither idKey or destinationKey are specified, the document’s ID will be automatically assigned as a random UUID. When using some recipes with multiple endpoints, documents run the risk of being assigned the same idKey value, which can cause missing documents when indexing. To avoid this, set the idKey to self instead of id . |
objectKey (optional) | The key from the data object entry. This value is used to perform the additional requests. It is mapped to the variable ${LW_PARAM_KEY} , which should be referenced in the additional data request configuration (endpoint, query parameters, or body). Endpoint example: /api/path/${LW_PARAM_KEY}/additionalInfo . Query parameter example: queryValue=${LW_PARAM_KEY} . |
accessKey (optional) | The key to access the data objects in the response. If not set, the response is assumed to be the whole response body. |
destinationKey (optional) | The key used to store the additional data objects in the main data objects. If not set, the additional data objects will be indexed as individual Solr documents. |
App settings
The rest of the JSON can include settings for the collection name, pipeline name, and connector type.Parameter | Description |
---|---|
collection | The name of the app used in the Fusion UI. This must match the name of the app, or the connector will not show up in datasources. |
pipeline | The name for the Pipeline ID used. For example, rest_connector . |
connector | The value for the connector in the Datasources API. For the REST V2 connector, this will be lucidworks.rest . |
Advanced settings in the UI
Within Fusion, opening the datasource and enabling the Advanced toggle displays optional settings to be applied. Under Core Properties > Fetch Settings you can modify the settings to help control the speed at which the connector crawls the source. For example, increasing Fetch Threads might increase the crawl speed. Setting timeout limits can be useful to end a crawl when something is causing the crawl to get hung up.How to get a recipe into Fusion
This section shows how to get a recipe from GitHub into Fusion. Recipes are JSON files used as a quick method to create a Fusion datasource.- Open the REST V2 connector public GitHub repository.
- Locate the recipe you want and open the file.
- Copy the JSON.
-
Add the JSON as the body to a call to the Connector Datasources API.
An example cURL call looks like this: -
In the above, replace:
FUSION_HOST:FUSION_PORT
with the URL of your Fusion instance.APP_NAME
with the name of the app you are using in Fusion.AUTHORIZATION_CREDENTIALS
with your Lucidworks login information in Base64.JSON_RECIPE
with the recipe you copied from GitHub.
-
Change the following in the JSON:
id
: This populates the Configuration ID for the connector in the Fusion UI and sets the name of the datasource. You can keep the default or choose a different name to fit your needs.serviceURL
: API base URL from where the data is extracted. Change this to match your own datasource URL.password
anduser
: Change these values to your login information.collection
: This must match the name of the Fusion app that you are using.- Any other values marked for replacement, such as
parserId
orpipeline
.
- Once values are changed, send the API request.
- Log into Fusion in a web browser and open the app associated with the request.
-
Go to Indexing > Datasources and select the REST V2 connector in the list with the
id
as the name of the datasource. - Make any additional changes within the UI.
- If indexing content other than text, for example images and attachments, select Send as Binary Response.
- Save the datasource.
- After it saves, you can click Run > Start to begin indexing.