The GitHub recipe retrieves data from a single GitHub repository via the GitHub REST API. You can view the configuration details and JSON recipe at the public REST configuration repository on GitHub in addition to this page.Documentation Index
Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
Use this file to discover all available pages before exploring further.
This recipe uses hierarchical requests and requires version 1.1.0 or later of the REST V2 connector.
GitHub REST configuration
The JSON template filegithub-repo.json crawls a single specific repository via /repos/{owner}/{repository}. To crawl multiple repositories, create one datasource per repository.
The GitHub REST configuration indexes each GitHub object listed below as a separate Solr document:
- Repositories
- Issues
- Pull requests
- Branches
- Commits (per-branch via BRANCH parent)
- Commit diffs (per-file change details for each commit)
- Tags
- Milestones
- Collaborators
- Releases
- Comments (issues and PR comments)
- Commit comments
- Content (root directory listing)
- Folders (recursive directory traversal)
- Blobs (file content via Contents API binary parsing)
2022-11-28. See API Versions in the GitHub documentation for details.
The configuration was tested with GitHub Cloud. For GitHub Enterprise, update the serviceURL property to point to the enterprise instance API URL, such as https://github.example.com/api/v3.
Authentication methods
The GitHub REST recipe supports basic authentication using the GitHub username and a Personal Access Token (PAT) as the password. For more information, see Managing your personal access tokens in the GitHub documentation.Classic personal access token (PAT)
For public repositories only, most endpoints require no additional scopes. However, crawling private repositories or organization-private repositories requires the following scopes:repo: Full control of private repositories (grants access to all repository data).read:org: Read organization membership (required for organization-private repositories and the collaborators endpoint).
- Public repositories only:
public_repoandread:org - Private repositories:
repoandread:org
Fine-grained personal access token
The GitHub recipe supports authentication through fine-grained personal access tokens. Set the token’s repository access to target the desired repositories, then grant the following read-only permissions:- Metadata (read): Required for repository listing, collaborators, tags, and commit comments.
- Contents (read): Required for commits, branches, and releases.
- Issues (read): Required for issues, milestones, and issue comments.
- Pull requests (read): Required for pull requests and PR review comments.
Permissions by endpoint
The following table shows the exact permissions required per endpoint for each token type:| Endpoint | Fine-Grained Permission | Classic PAT (Public Repos) | Classic PAT (Private Repos) |
|---|---|---|---|
/repos/{owner}/{repo} | Metadata: read | No scope needed | repo |
/repos/{o}/{r}/issues | Issues: read | No scope needed | repo |
/repos/{o}/{r}/pulls | Pull requests: read | No scope needed | repo |
/repos/{o}/{r}/commits | Contents: read | No scope needed | repo |
/repos/{o}/{r}/commits/{sha} | Contents: read | No scope needed | repo |
/repos/{o}/{r}/branches | Contents: read | No scope needed | repo |
/repos/{o}/{r}/tags | Metadata: read | No scope needed | repo |
/repos/{o}/{r}/milestones | Issues: read | No scope needed | repo |
/repos/{o}/{r}/collaborators | Metadata: read | repo + read:org | repo + read:org |
/repos/{o}/{r}/releases | Contents: read | No scope needed | repo |
/repos/{o}/{r}/issues/comments | Issues: read | No scope needed | repo |
/repos/{o}/{r}/comments | Metadata: read | No scope needed | repo |
/repos/{o}/{r}/contents/{path} | Contents: read | No scope needed | repo |
In addition to the scopes listed in the preceding table, the authenticated user must have push (write), **maintain, or admin access to the repository to use the
/repos/{o}/{r}/collaborators endpoint. Without this level of access, the endpoint returns an HTTP 403 error regardless of token scopes. If the crawl account does not have write access to all repositories, consider removing the collaborator request configuration from the JSON recipe to avoid 403 errors.Draft releases are only visible to users with push (write) access to the repository.Supported crawl options
For a full crawl, all the content from the source is fetched. For a re-crawl, all the content from the source is retrieved as if it were a full crawl. Orphan objects (deleted in the GitHub source that are not retrieved with a current crawl), are deleted from the index using stray content deletion, which runs after a crawl finishes.Rate limiting
GitHub enforces a rate limit of 5,000 authenticated requests per hour. Retry properties (retryCount, maxDelayTime) can be configured under the datasource’s retry settings to provide resilience against transient errors and rate limit responses (HTTP 403 or 429), though the default values are typically sufficient.
Unauthenticated requests are limited to 60 requests per hour. Always use an authenticated token to avoid rate limiting during crawls.
For repositories with large amounts of data, consider the total request volume: each repository triggers up to 15 child requests (one per entity type), and each child request paginates at 100 items per page. Additionally, COMMIT crawls per-branch (child of BRANCH), so the total commit API calls scale with the number of branches.
For GitHub Enterprise instances, rate limits may differ. Consult your administrator.
The CONTENT, FOLDER, and BLOB requests use the GitHub Contents API (
/repos/{owner}/{repo}/contents/{path}) to crawl repository file content. Each directory level requires a separate API call (one request per directory). File downloads via /contents/{path} with the Accept: application/vnd.github.raw+json header count against the API rate limit. For large repositories with many directories and files, this can consume the 5,000 requests/hour rate limit. Consider removing the CONTENT, FOLDER, and BLOB request configurations from the JSON if file content indexing is not needed.Pagination setup
Pagination by Batch Size is configured per child request that returns paginated arrays, using a page-number approach. The following child requests use pagination: ISSUE, PULL_REQUEST, COMMIT, BRANCH, TAG, MILESTONE, COLLABORATOR, RELEASE, COMMENT, COMMIT_COMMENT. GitHub REST API uses page-based pagination withpage and per_page query parameters. The API returns bare JSON arrays. When there are no more results, an empty array [] is returned.
Configure the pagination by batch size properties
IndexStart: 1: GitHub pages are 1-indexed. The first page is page 1.BatchSize: 1: Used to increment the page number by 1 each iteration. The${LW_INDEX_START}variable produces values 1, 2, 3, etc.Stop Condition Key: $: References the root response (bare JSON array).Stop Condition Value: []: Pagination stops when the response is an empty array.
Configure query parameters
per_page=100: The maximum number of items GitHub returns per page.page=${LW_INDEX_START}: The current page number, auto-incremented by the connector.
The
batchSize=1 setting is a technique to generate sequential page numbers (1, 2, 3…) from the ${LW_INDEX_START} variable, since GitHub uses page-number pagination rather than offset-based pagination. The actual number of items per page is controlled by the fixed per_page=100 query parameter.Variables used
The GitHub REST configuration variables used are:-
${LW_INDEX_START}: Used with pagination feature. This variable is used to set thepagequery parameter, which is the page number to retrieve. GitHub pagination is 1-indexed. The connector increments this value by increasing thebatchSizeby 1 after each page request, producing page numbers 1, 2, 3, etc. -
${LW_PARENT_DATA_KEY}: Used with Child Request Configuration. This variable is replaced with the value of theparentIdKeyfield from the parent object’s response.
Endpoints configuration
The following table describes the GitHub REST endpoints needed and how those are configured with the REST connector. Each request is configured under the property List of Requests Configuration (requestConfigurations in the JSON files).
| Request type | ObjectType | Parent ObjectType | Endpoint | Query parameters | Description |
|---|---|---|---|---|---|
| Root Request | REPOSITORY | GET /repos/{owner}/{repo-name} | (none) | Returns a single repository object. Replace {owner} and {repo-name} with the target repository. No pagination is needed since the endpoint returns a single JSON object. To crawl multiple repositories, create one datasource per repository. | |
| Child Request | ISSUE | REPOSITORY | GET /repos/{owner}/{repo}/issues | per_page=100&page=${LW_INDEX_START}&state=all&filter=all | Returns all issues (open and closed) for the repository. Note: GitHub’s issues endpoint also returns pull requests since every PR is an issue; PR objects can be identified by the presence of a pull_request field. |
| Child Request | PULL_REQUEST | REPOSITORY | GET /repos/{owner}/{repo}/pulls | per_page=100&page=${LW_INDEX_START}&state=all | Returns all pull requests (open, closed, and merged) for the repository. Provides PR-specific fields such as diff_url, merge_commit_sha, draft, head, and base. |
| Child Request | BRANCH | REPOSITORY | GET /repos/{owner}/{repo}/branches | per_page=100&page=${LW_INDEX_START} | Returns all branches for the repository. Uses name as the Data ID since branches do not have an html_url in the list response. Sets parentIdKey=name so its COMMIT child receives the branch name via ${LW_PARENT_DATA_KEY} in the sha query parameter. |
| Child Request | COMMIT | BRANCH | GET /repos/{owner}/{repo}/commits | sha=${LW_PARENT_DATA_KEY}&per_page=100&page=${LW_INDEX_START} | Returns commits per branch. The sha query parameter receives the branch name from the parent BRANCH entity via ${LW_PARENT_DATA_KEY} (parentIdKey=name). Note: commits reachable from multiple branches will be indexed once per branch. |
| Child Request | COMMIT_DIFF | COMMIT | GET /repos/{owner}/{repo}/commits/${LW_PARENT_DATA_KEY} | (none) | Fetches the single-commit detail and indexes all the modified files. Uses dataPath=files to extract the files array, creating a separate Solr document for each file entry with fields such as filename, status, additions, deletions, changes, patch, blob_url, raw_url, and contents_url. Uses sha as the Data ID. The ${LW_PARENT_DATA_KEY} is replaced with the commit sha from the parent COMMIT object. |
| Child Request | TAG | REPOSITORY | GET /repos/{owner}/{repo}/tags | per_page=100&page=${LW_INDEX_START} | Returns all tags for the repository. Uses name as the Data ID since tags do not have an html_url in the list response. |
| Child Request | MILESTONE | REPOSITORY | GET /repos/{owner}/{repo}/milestones | per_page=100&page=${LW_INDEX_START}&state=all | Returns all milestones (open and closed) for the repository. |
| Child Request | COLLABORATOR | REPOSITORY | GET /repos/{owner}/{repo}/collaborators | per_page=100&page=${LW_INDEX_START} | Returns all collaborators for the repository. Requires the PAT to have push access to the repository; otherwise returns HTTP 403. |
| Child Request | RELEASE | REPOSITORY | GET /repos/{owner}/{repo}/releases | per_page=100&page=${LW_INDEX_START} | Returns all releases for the repository, including draft releases. |
| Child Request | COMMENT | REPOSITORY | GET /repos/{owner}/{repo}/issues/comments | per_page=100&page=${LW_INDEX_START} | Returns all comments on all issues (and pull requests) for the entire repository. Each comment includes an issue_url field linking back to the parent issue. Uses the repo-level endpoint to avoid nested parent key requirements. |
| Child Request | COMMIT_COMMENT | REPOSITORY | GET /repos/{owner}/{repo}/comments | per_page=100&page=${LW_INDEX_START} | Returns all comments on all commits for the entire repository. Each comment includes commit_id linking back to the parent commit. Uses the repo-level endpoint. |
| Child Request | CONTENT | REPOSITORY | GET /repos/{owner}/{repo}/contents | (none) | Lists the root directory entries of the repository’s default branch using the Contents API. Each entry includes name, path, type (file or dir), size, sha, and html_url. Uses skipIndexation=true — exists only for discovery. |
| Child Request | FOLDER | CONTENT | GET /repos/{owner}/{repo}/contents/${LW_PARENT_DATA_KEY} | (none) | Recursively walks subdirectories. Sets parentIdKey=path to extract the path field from the parent CONTENT object; ${LW_PARENT_DATA_KEY} resolves to this value in the endpoint. Uses recursiveRequest=true to traverse all directory levels. Uses skipIndexation=true — exists only for discovery. |
| Child Request | BLOB | FOLDER | GET /repos/{owner}/{repo}/contents/${LW_PARENT_DATA_KEY} | (none) | Downloads raw file content for each file discovered by FOLDER. Sets parentIdKey=path to extract the path field from the parent FOLDER object; ${LW_PARENT_DATA_KEY} resolves to this value in the endpoint. Uses the Accept: application/vnd.github.raw+json header and binaryResponse=true to download the binary content. Uses path as the Data ID. |
Notes
- The requests are linked hierarchically using the
ObjectTypeandParentObjectTypeproperties. - When objects are indexed, the field
_lw_rest_parent_object_sskeeps the list of parents related to an object. - Comment endpoints use repository-level listing (
/issues/comments,/comments) rather than per-issue, per-pull request, or per-commit endpoints. This design avoids the need for nested parent key substitution. The parent entity can be identified using fields within each comment:- Issue comments:
issue_urlfield - Commit comments:
commit_idfield
- Issue comments:
Response parsing configuration
Per request, configure the Response Handling property to specify how to parse the response. This field isresponseConfiguration in the JSON recipe.
Plugin parsing
- This parsing happens by default. The responses are parsed as a JSON object structure using JsonPath.
- Plugin parsing applies to all the requests listed in the endpoints configuration table, except the BLOB request which uses binary parsing. The CONTENT and FOLDER requests also use plugin parsing but with
skipIndexation=true. These requests parse the JSON response to discover files and directories without creating Solr documents. - The
Response Handling -> Data IDproperties are configured to extract unique identifiers from the objects parsed. For most entities,html_urlprovides a globally unique, human-readable URL.- For branches and tags,
nameis used since these entities lackhtml_urlin list responses. - For COMMIT_DIFF,
blob_urlis used sincedataPath=filesextracts per-file entries that each have a uniqueblob_url.
- For branches and tags,
- The
Response Handling -> Parent Data Keyproperty is required by the connector on all child requests. It specifies which field to extract from the parent response object; that value replaces${LW_PARENT_DATA_KEY}in the child request’s endpoint or query parameters.- The COMMIT request uses
parentIdKey=nameto extract the branch name from its parent BRANCH object, passing it as${LW_PARENT_DATA_KEY}in theshaquery parameter. - The COMMIT_DIFF request uses
parentIdKey=shato extract the commit SHA from its parent COMMIT object, passing it as${LW_PARENT_DATA_KEY}in the endpoint. - The FOLDER request uses
parentIdKey=pathto extract the directory path from its parent CONTENT object. - The BLOB request uses
parentIdKey=pathto extract the file path from its parent FOLDER object.
- The COMMIT request uses
Binary parsing
- The BLOB request uses
binaryResponse=trueto enable binary parsing. The request includes theAccept: application/vnd.github.raw+jsonheader so the GitHub Contents API endpoint (/repos/{owner}/{repo}/contents/{path}) returns raw binary file content instead of a base64-encoded JSON object. With binary response enabled, the connector downloads the raw content and sends it to Fusion’s parser stages.
Terminology
The following terms are provided as a reference.| Term | Description |
|---|---|
| List of Requests Configuration | Configure List of Requests to extract data from the REST source. Requests are linked hierarchically using the properties Parent-Child Request Link -> ObjectType and ParentObjectType. |
| Object Type | The unique name to identify the request. |
| Parent Object Type | Reference an existent Object Type. Create a parent-child hierarchy, where the current request becomes the child of the specified Parent Object Type. If blank, the current request is considered a Root-Request. |
| Root Request | The type of request-configuration to retrieve the initial parent objects. |
| Child Request | The type of request-configuration to retrieve children objects per each parent object. A child-request can be a parent of another child-request. |
| Response Handling | The responseConfiguration defines the mapping between the response and data objects to be indexed. |
| Data Path | The path to access a specific data object within a response. For GitHub endpoints that return bare JSON arrays, set to an empty string. The COMMIT_DIFF request uses dataPath=files to extract the files array from the single-commit detail response, creating one document per changed file. This property accepts JsonPath expressions such as results, items[*]. |
| Data ID | The identifier key for the data objects extracted with ‘Data Path’. This value is used to build the Solr document’s ID. If not provided, a random UUID is used. This property accepts JsonPath expressions such as html_url to extract the unique URL of an object. |
| Parent Data Key | Required for all Child Requests. Map to a key from the parent object, whose value is used to replace the ${LW_PARENT_DATA_KEY} variable in the child request configuration (endpoint, query parameters, or body). In the repo template, most children set parentIdKey=full_name (required by the connector) but hardcode the {owner}/{repo} path in the endpoint. |
| _lw_rest_object_type_s | All objects index this field, which contains the ‘ObjectType’ of the request that retrieved the object, such as REPOSITORY, PULL_REQUEST, COMMIT, BRANCH. |
| _lw_rest_object_s | All objects index this field, which contains the object ID extracted with the data ID. For example, for a repository, indexes _lw_rest_object_s: "https://github.com/owner/repo". For a pull request, indexes _lw_rest_object_s: "https://github.com/owner/repo/pull/42". |
| _lw_rest_parent_object_ss | All objects index this field, which contains a list of the object IDs inherited from all their parents, and the object IDs from the object itself. For example, for a pull request, indexes _lw_rest_parent_object_ss: ["https://github.com/owner/repo", "https://github.com/owner/repo/pull/42"]. |
Recipe
Replace the following values in the recipe:pipelinewith your Fusion pipelinecollectionwith your Fusion collectionidwith the name of a Fusion datasource if you want to use a different name than the one providedpasswordwith your GitHub personal access tokenuserwith your GitHub username{add owner here}with the GitHub owner name{add repo name here}with the GitHub repository name