Crawl REST APIs
Our REST API connector lets you connect to over 800 datasources, including:
-
Microsoft Dynamics 365
-
SAP CRM
-
Salesforce Data.com
-
Atlassian
-
Asana
-
Google Drive
-
Filestage
-
OpenText
-
And hundreds more…
For a complete list of supported datasources, see Lucidworks Fusion Connectors.
This article teaches you how to crawl REST API endpoints using Fusion. Before beginning, the following prerequisites must be met:
-
All endpoints are available using bulk start links or a sitemap
-
The response data is in a parseable format (JSON, XML, etc.)
Options
Using bulk start links
If you have a small number of endpoints you want to crawl, enter each endpoint as a bulk start link.
To crawl the API endpoints using bulk start links:
-
Add a new Web connector datasource. To learn how to configure a new datasource, see Configure a New Datasource.
-
Under Start links, enter the main domain that contains the sitemap. For example,
http://www.restapiendpoint.com
. -
In the Link discovery section under Bulk Start Links, enter the URLs you want to crawl. Separate links with a new line. For example:
http://www.restapiendpoint.com/?apikey=user-token&s=dark%20knight&type=movie&page=1 http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=1 http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=2
-
Save and run the job.
-
Once complete, check the results in the Index Workbench.
Using a sitemap
If you have a large number of endpoints you want to crawl, use a sitemap containing the API endpoint locations. This is also helpful if someone without access to Fusion maintains the list of endpoint URLs. An example sitemap:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=dark%20knight&type=movie&page=1</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=1</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=2</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=3</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=batman&type=movie&page=1</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
</urlset>
To crawl the API endpoints using the sitemap:
-
Add a new Web connector datasource. To learn how to configure a new datasource, see Configure a New Datasource.
-
Under Start links, enter the main domain that contains the sitemap. For example,
http://www.restapiendpoint.com
. -
In the Link discovery section under Sitemap URLs, click the Add button.
-
Enter the URL of the sitemap. For example,
http://www.restapiendpoint.com/sitemap.xml
. -
Save and run the job.
-
Once complete, check the results in the Index Workbench.
Results
Both options above achieve the same result. Fusion indexes the JSON response provided at the endpoints. If an array of JSON objects is available, Fusion indexes each object and an individual document.
For example, Fusion creates three documents from the JSON response below:
{
"Search": [{
"Title": "Batman v Superman: Dawn of Justice",
"Year": "2016",
"imdbID": "tt2975590",
"Type": "movie",
"Poster": "https://m.media-amazon.com/images/M/MV5BYThjYzcyYzItNTVjNy00NDk0LTgwMWQtYjMwNmNlNWJhMzMyXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg"
}, {
"Title": "Superman Returns",
"Year": "2006",
"imdbID": "tt0348150",
"Type": "movie",
"Poster": "https://m.media-amazon.com/images/M/MV5BNzY2ZDQ2MTctYzlhOC00MWJhLTgxMmItMDgzNDQwMDdhOWI2XkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_SX300.jpg"
}, {
"Title": "Superman",
"Year": "1978",
"imdbID": "tt0078346",
"Type": "movie",
"Poster": "https://m.media-amazon.com/images/M/MV5BMzA0YWMwMTUtMTVhNC00NjRkLWE2ZTgtOWEzNjJhYzNiMTlkXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_SX300.jpg"
}],
"totalResults": "3",
"Response": "True"
}