AEM V2 - Lucidworks documentation

Latest version: v1.3.0
Compatible with Fusion version: 5.9.4 and later

The AEM V2 connector supports the following:

Full crawling and recrawling of pages and assets in Adobe Experience Manager
Basic authentication
In Fusion 5, the username and password fields have moved under Authentication Settings > Login Settings.
OAuth authentication
Security trimming to filter results based on user permissions
- Filter document crawling by including and excluding paths and configuring content properties when setting up the connector.
Specific wait time between fetch requests to throttle crawls, if necessary
Optional crawling of child paths

Configure AEM V2 Connector

This document explains how to configure an AEM V2 connector to crawl data in Adobe Experience Manager. Refer to the AEM reference to learn more about how this connector works. This connector is compatible with Fusion 5.5.1 and later.

Configure AEM Datasource

In Fusion, under Indexing > Datasources, click Add, then select AEM
Enter a Configuration ID
Enter the AEM URL (the URL used to access the AEM Admin UI) as well as the AEM username and password used to authenticate access to the QueryBuilder JSON Servlet.

Go to AEM to log in and access CRXDE Lite. In the CRXDE Lite UI, select a path to crawl. Enter this path into Fusion. Click Add to crawl multiple paths.

Optional: To exclude paths from the crawl, in Fusion, enter a Java Regular Expression (regex) that represents paths to exclude in the indexed content.

In Fusion, enter the AEM type to crawl. In the CRXDE Lite UI, this is the jcr:primaryType. In this example, the AEM connector is configured to crawl the AEM Type cq:Page, which represents web content pages.

To index assets with a particular file extension, locate a file type in CRXDE Lite and enter the value of the jcr:primaryType into Fusion. In this example, the value of NY_FairHealth.pdf is dam:Asset.

You can choose which content properties to include and exclude from the index. These parameter values are represented by Java regex. For example, to only include properties that start with “jcr” enter jcr:(.*).
In Fusion, click Save when you’re done configuring the AEM datasource.

Configuration settings

Setting	Notes
AEM URL	Required. This is the URL used to access the AEM Admin UI.
AEM Username	Required. The user should have sufficient permissions to read content paths and access Users/Group APIs in case Security Trimming is needed.
AEM Password	Required.
Page Batch Size	Number of documents to fetch per page request. A higher value can increase crawling speed but also increases memory usage.
Thread wait (ms)	Number of milliseconds to wait between fetch requests. This property can be used to throttle a crawl if necessary.
Paths to search	Required.
Paths that should not be fetched	Java regex for paths that should not be fetched.
AEM Types	Required. AEM document type `jcr:primaryType` to include in the index. Examples: `cq:Page`, `dam:Asset`.
Attachment types	File extensions to index.
Content Property Include Regexes	A list of regex strings of content properties to include in indexed documents. Example: `jcr:.*`.
Content Property Exlude Regexes	A list of regex strings of content properties to exclude from indexed documents. Example: `sling:.*`
Enable Security Trimming	Enable this setting for content filtering of results based on the user’s id passed in during query.
Group Mappings	AEM user groups mapped to indexed values in the security trimming field which are used to filter content based on user id passed in query.
Cache Expire Time (m)	Specifies how long a query is cached in minutes.

Field data population

There are multiple sources where AEM data is indexed. The /bin/querybuilder.json endpoint data is mandatory and must exist in order for a document to be indexed.Note the list of fields that can appear in an indexed document:

Field	Source	Comments
`id`	`<AEM_URL>/bin/querybuilder.json`	Field: path
`content_txt`	`<AEM_URL>/bin/querybuilder.json`	Whole data in text format.
`<rest fields>`	`<AEM_URL>/bin/querybuilder.json`	All top level fields of JSON object.
`body_t`	`<AEM_URL>/crx/de/download.jsp`	Used if path ends with one of `Attachment types` OR path does not end with: `/jcr:content`.
`body_t`	`<AEM_URL><id>`	Used if there is no `jcr` data. If response status code is something other than 200, Fusion assumes there is no file to download under that path.
`body_t`	`[content_txt]`	Defaults to `[content_txt]` if `body_t` is empty.
`parentPage`	`Id` of document that contains attachment or link.	Populated in case of attachment/link.
`type`	File extension of the path.	Populated in case of attachment/link.
`file_size`	`<AEM_URL>/bin/querybuilder.json`	`:jcr:data;` used if `jcr` data is not empty.
`file_size`	`<AEM_URL>/bin/querybuilder.json`	`dam:size;` used if `jcr` data is empty.

Check for duplicate data when crawling child paths. For example, if the connector indexes both cq:Page and cq:PageContent then the results could include duplicated data.

The v1.3.0 version of this connector is only compatible with Fusion 5.9.4 and later when using security trimming. The v1.3.0 connector version uses Graph Security Trimming and not regular security trimming. It is imperative to treat this as a new connector, as configurations do not transfer over due to disparities between newer versions and previous ones. A full crawl is mandatory.

Prerequisites

Perform these prerequisites to ensure the connector can reliably access, crawl, and index your data. Proper setup helps avoid configuration or permission errors, so use the following guidelines to keep your content available for discovery and search in Fusion.

AEM instance

AEM Author instance must be reachable over HTTP/HTTPS.
The QueryBuilder JSON API must be enabled. It’s typically located at /bin/querybuilder.json.
The JCR Download endpoints must be accessible, such as /crx/de/download.jsp or direct node paths.

Configure an AEM service account

Create a service account with the following:

Read access to content paths being crawled
Access to user and group APIs if using security trimming
Permissions to access the following:
- Page metadata, such as cq:Page or jcr:content
- Binary attachments such as PDFs and DOCX files
- JCR nodes and properties
- If using group mapping, access to /libs/cq/security/userinfo.json or equivalent endpoints

Content Paths

You must define one or more JCR root paths to crawl, such as /content/SITE_NAME/en.
Optionally provide:
- Exclude path regexes to filter out subtrees
- Attachment extension types
- JCR property include/exclude filters

Authentication

Setting up the correct authentication according to your organization’s data governance policies helps keep sensitive data secure while allowing authorized indexing. The AEM V2 connector supports two modes of authenticating to your AEM instance: basic HTTP and OAuth. Fusion handles session management as needed, including cookie handling and token renewal. Basic HTTP authentication:

Provide a standard AEM username and password with read access to JCR paths.
If using security trimming, the AEM account also requires read access to the user and groups APIs.

OAuth 2.0 authentication:

Paste in an Access Token and optional Refresh Token.
If you do not have a pre-obtained token, the connector can fetch a token using JWT authentication.

AEM connector OAuth authorization

The AEM V2 Connector supports OAuth 2.0 authorization with JWT token.

Supported authorization options

Requests are authorized by including an Access Token in the Authorization header.Example:

curl -H 'Authorization: Bearer ACCESS_TOKEN' http://HOST:PORT/content/COMPANY/us/en/community/messaging.html

There are three ways the connector can get Access Token:

From the datasource configuration
From AEM server using Refresh Token
From AEM server using JWT token

Request Access Token, Refresh Token, and JWT token manually and set them in the datasource configuration:

Other settings: Client Id, Client Secret, and Redirect Uri can be found in the AEM admin page under Security > Oauth Clients:

Getting Access Token

Open this URL in a browser:

http://HOST:PORT/oauth/authorize?response_type=code&client_id=placeholderClientId&client_secret=CLIENT_SECRET&username=admin&password=PASSWORD&scope=offline_access&redirect_uri=REDIRECT_URI

You are redirected to login page (if you are not logged in):
Logging in redirects you to confirm the authorization. Click Yes, I authorize this request.
You are redirected to the URL provided in redirect_uri with parameter code: REDIRECT_URI?code=AUTHORIZATION_CODE.
Copy the value of the code parameter. This is your Authorization Code.

Execute the request to get Access Token:

curl --location --request POST 'http://HOST:PORT/oauth/token?code=AUTHORIZATION_CODE&client_id=CLIENT_ID&client_secret=CLIENT_SECRET&grant_type=authorization_code&redirect_uri=REDIRECT_URI' --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json'

Example:

curl --location --request POST 'http://34.71.168.50:4502/oauth/token?code=AUTHORIZATION_CODE&client_id=CLIENT_ID&client_secret=CLIENT_SECRET&grant_type=authorization_code&redirect_uri=http://localhost:8080/test' --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json'

Response:

"access_token":"{placeholderAccessToken}","refresh_token":"{placeholderRefreshToken}","expires_in":3600

Getting Refresh Token

To get Refresh Token, follow the same proccess for Access Token, but:

You must include offline_access in the scope list.
You must revoke all the previously obtained token. It can be done by clicking Revoke All Tokens.

Getting JWT Bearer Token

Download Private Key from the AEM Oauth client section. You should have downloaded file store.p12.
Run:
```
openssl pkcs12 -in store.p12 -out store.crt.pem -clcerts -nokeys
```
When asked about password type notasecret. You should have generated file named store.crt.pem.

Run:

openssl pkcs12 -in store.p12 -passin pass:notasecret -nocerts -nodes -out store.private.key.txt

You should have generated file named store.private.key.txt.

Create JWT token with the below payload and encrypt it with the private key using RS256:

{
"aud": "http://HOST:PORT/oauth/token",
"iss": "CLIENT_ID",
"sub": "USERNAME",
"exp": <Current time in milliseconds+expiry>,
"iat": <Current time in milliseconds>,
"scope": "SCOPE",
"cty": "code"
}

For example, install pyjwt with pip install pyjwt to encrypt with this python script:

import jwt
payload_data = {
"aud": "http://34.71.168.50:4502/oauth/token",
"iss": "dp0dtqd9lqpcntvb6t12hrscpa-z1hqkpdg",
"sub": "admin",
"exp": 1697840880541,
"iat": 1697740880541,
"scope": "offline_access",
"cty": "code"
}
private_key = open('store.private.key.txt', 'r').read()
token = jwt.encode(
    payload=payload_data,
    key=private_key,
    algorithm='RS256'
)
print(token)

Remote connectors

V2 connectors support running remotely in Fusion versions 5.7.1 and later. Refer to Configure Remote V2 Connectors.

Configure Remote V2 Connectors

If you need to index data from behind a firewall, you can configure a V2 connector to run remotely on-premises using TLS-enabled gRPC.

Prerequisites

Before you can set up an on-prem V2 connector, you must configure the egress from your network to allow HTTP/2 communication into the Fusion cloud. You can use a forward proxy server to act as an intermediary between the connector and Fusion.The following is required to run V2 connectors remotely:

The plugin zip file and the connector-plugin-standalone JAR.
A configured connector backend gRPC endpoint.
Username and password of a user with a remote-connectors or admin role.
If the host where the remote connector is running is not configured to trust the server’s TLS certificate, you must configure the file path of the trust certificate collection.

If your version of Fusion doesn’t have the remote-connectors role by default, you can create one. No API or UI permissions are required for the role.

Connector compatibility

Only V2 connectors are able to run remotely on-premises. You also need the remote connector client JAR file that matches your Fusion version. You can download the latest files at V2 Connectors Downloads.

Whenever you upgrade Fusion, you must also update your remote connectors to match the new version of Fusion.

The gRPC connector backend is not supported in Fusion environments deployed on AWS.

System requirements

The following is required for the on-prem host of the remote connector:

(Fusion 5.9.0-5.9.10) JVM version 11
(Fusion 5.9.11) JVM version 17
Minimum of 2 CPUs
4GB Memory

Note that memory requirements depend on the number and size of ingested documents.

Enable backend ingress

In your values.yaml file, configure this section as needed:

ingress:
  enabled: false
  pathtype: "Prefix"
  path: "/"
  #host: "ingress.example.com"
  ingressClassName: "nginx"   # Fusion 5.9.6 only
  tls:
    enabled: false
    certificateArn: ""
    # Enable the annotations field to override the default annotations
    #annotations: ""

Set enabled to true to enable the backend ingress.
Set pathtype to Prefix or Exact.
Set path to the path where the backend will be available.
Set host to the host where the backend will be available.
In Fusion 5.9.6 only, you can set ingressClassName to one of the following:
- nginx for Nginx Ingress Controller
- alb for AWS Application Load Balancer (ALB)
Configure TLS and certificates according to your CA’s procedures and policies.
TLS must be enabled in order to use AWS ALB for ingress.

Connector configuration example

kafka-bridge:
  target: mynamespace-connectors-backend.lucidworkstest.com:443 # mandatory
  plain-text: false # optional, false by default.  
    proxy-server: # optional - needed when a forward proxy server is used to provide outbound access to the standalone connector
    host: host
    port: some-port
    user: user # optional
    password: password # optional
  trust: # optional - needed when the client's system doesn't trust the server's certificate
    cert-collection-filepath: path1

proxy: # mandatory fusion-proxy
  user: admin
  password: password123
  url: https://fusiontest.com/ # needed only when the connector plugin requires blob store access

plugin: # mandatory
  path: ./fs.zip
  type: #optional - the suffix is added to the connector id
    suffix: remote

Minimal example

kafka-bridge:
  target: mynamespace-connectors-backend.lucidworkstest.com:443

proxy:
  user: admin
  password: "password123"

plugin:
  path: ./testplugin.zip

Logback XML configuration file example

<configuration>
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="com.lucidworks.logging.logback.classic.LucidworksPatternLayoutEncoder">
            <pattern>%d - %-5p [%t:%C{3.}@%L] - %m{nolookups}%n</pattern>
            <charset>utf8</charset>
        </encoder>
    </appender>

    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${LOGDIR:-.}/connector.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <!-- rollover daily -->
            <fileNamePattern>${LOGDIR:-.}/connector-%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>50MB</maxFileSize>
            <totalSizeCap>10GB</totalSizeCap>
        </rollingPolicy>
        <encoder class="com.lucidworks.logging.logback.classic.LucidworksPatternLayoutEncoder">
            <pattern>%d - %-5p [%t:%C{3.}@%L] - %m{nolookups}%n</pattern>
            <charset>utf8</charset>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="FILE"/>
    </root>
</configuration>

Run the remote connector

java [-Dlogging.config=[LOGBACK_XML_FILE]] \
  -jar connector-plugin-client-standalone.jar [YAML_CONFIG_FILE]

The logging.config property is optional. If not set, logging messages are sent to the console.

Test communication

You can run the connector in communication testing mode. This mode tests the communication with the backend without running the plugin, reports the result, and exits.

java -Dstandalone.connector.connectivity.test=true -jar connector-plugin-client-standalone.jar [YAML_CONFIG_FILE]

Encryption

In a deployment, communication to the connector’s backend server is encrypted using TLS. You should only run this configuration without TLS in a testing scenario. To disable TLS, set plain-text to true.

Egress and proxy server configuration

One of the methods you can use to allow outbound communication from behind a firewall is a proxy server. You can configure a proxy server to allow certain communication traffic while blocking unauthorized communication. If you use a proxy server at the site where the connector is running, you must configure the following properties:

Host. The hosts where the proxy server is running.
Port. The port the proxy server is listening to for communication requests.
Credentials. Optional proxy server user and password.

When you configure egress, it is important to disable any connection or activity timeouts because the connector uses long running gRPC calls.

Password encryption

If you use a login name and password in your configuration, run the following utility to encrypt the password:

Enter a user name and password in the connector configuration YAML.

Run the standalone JAR with this property:

-Dstandalone.connector.encrypt.password=true

Retrieve the encrypted passwords from the log that is created.
Replace the clear password in the configuration YAML with the encrypted password.

Connector restart (5.7 and earlier)

The connector will shut down automatically whenever the connection to the server is disrupted, to prevent it from getting into a bad state. Communication disruption can happen, for example, when the server running in the connectors-backend pod shuts down and is replaced by a new pod. Once the connector shuts down, connector configuration and job execution are disabled. To prevent that from happening, you should restart the connector as soon as possible.You can use Linux scripts and utilities to restart the connector automatically, such as Monit.

Recoverable bridge (5.8 and later)

If communication to the remote connector is disrupted, the connector will try to recover communication and gRPC calls. By default, six attempts will be made to recover each gRPC call. The number of attempts can be configured with the max-grpc-retries bridge parameters.

Job expiration duration (5.9.5 only)

The timeout value for irresponsive backend jobs can be configured with the job-expiration-duration-seconds parameter. The default value is 120 seconds.

Use the remote connector

Once the connector is running, it is available in the Datasources dropdown. If the standalone connector terminates, it disappears from the list of available connectors. Once it is re-run, it is available again and configured connector instances will not get lost.

Enable asynchronous parsing (5.9 and later)

To separate document crawling from document parsing, enable Tika Asynchronous Parsing on remote V2 connectors.

Below is an example configuration showing how to specify the file system to index under the connector-plugins entry in your values.yaml file:

additionalVolumes:
- name: fusion-data1-pvc
    persistentVolumeClaim:
    claimName: fusion-data1-pvc
- name: fusion-data2-pvc
    persistentVolumeClaim:
    claimName: fusion-data2-pvc
additionalVolumeMounts:
- name: fusion-data1-pvc
    mountPath: "/connector/data1"
- name: fusion-data2-pvc
    mountPath: "/connector/data2"

You may also need to specify the user that is authorized to access the file system, as in this example:

securityContext:
    fsGroup: 1002100000
    runAsUser: 1002100000

Learn more

AEM connector OAuth authorization

The AEM V2 Connector supports OAuth 2.0 authorization with JWT token.

Supported authorization options

Requests are authorized by including an Access Token in the Authorization header.Example:

curl -H 'Authorization: Bearer {placeholderAccessToken}' http://HOST:PORT/content/COMPANY/us/en/community/messaging.html

There are three ways the connector can get Access Token:

From the datasource configuration
From AEM server using Refresh Token
From AEM server using JWT token

Request Access Token, Refresh Token, and JWT token manually and set them in the datasource configuration:

Other settings: Client Id, Client Secret, and Redirect Uri can be found in the AEM admin page under Security->Oauth Clients:

Getting Access Token

Open this URL in a browser:

http://HOST:PORT/oauth/authorize?response_type=code&client_id=placeholderClientId&client_secret=placeholderClientSecret&username=admin&password=PASSWORD&scope=offline_access&redirect_uri=REDIRECT_URI

You are redirected to login page (if you are not logged in):
Logging in redirects you to confirm the authorization. Click Yes, I authorize this request.
You are redirected to the URL provided in redirect_uri with parameter code: <REDIRECT_URI>?code=<AUTHORIZATION_CODE>.
Copy the value of the code parameter. This is your Authorization Code.

Execute the request to get Access Token:

curl --location --request POST 'http://HOST:PORT/oauth/token?code={placeholderAuthorizationCode}&client_id={placeholderClientId}&client_secret={placeholderClientSecret}&grant_type=authorization_code&redirect_uri=<REDIRECT_URI>' --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json'

Example:

curl --location --request POST 'http://34.71.168.50:4502/oauth/token?code={placeholderAuthorizationCode}&client_id={placeholderClientId}&client_secret={placeholderClientSecret}&grant_type=authorization_code&redirect_uri=http://localhost:8080/test' --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json'

"access_token":"{placeholderAccessToken}","refresh_token":"{placeholderRefreshToken}","expires_in":3600

Getting Refresh Token

To get Refresh Token, follow the same process for Access Token, but:

You must include offline_access in the scope list.
You must revoke all the previously obtained token. It can be done by clicking Revoke All Tokens.

Getting JWT Bearer Token

Download Private Key from the AEM Oauth client section. You should have downloaded file store.p12.
Run:
```
openssl pkcs12 -in store.p12 -out store.crt.pem -clcerts -nokeys
```
When asked about password type notasecret. You should have generated file named store.crt.pem.

Run

openssl pkcs12 -in store.p12 -passin pass:notasecret -nocerts -nodes -out store.private.key.txt

You should have generated file named store.private.key.txt.

Create JWT token with the below payload and encrypt it with the private key using RS256:

{
 "aud": "http://<HOST>:<PORT>/oauth/token",
 "iss": "<client id>",
 "sub": "<user name>",
 "exp": <Current time in milliseconds+expiry>,
 "iat": <Current time in milliseconds>,
 "scope": "<scope>",
 "cty": "code"
}

For example, install pyjwt to use this python script:

pip install pyjwt

import jwt  

payload_data = {
 "aud": "http://34.71.168.50:4502/oauth/token",
 "iss": "dp0dtqd9lqpcntvb6t12hrscpa-z1hqkpdg",
 "sub": "admin",
 "exp": 1697840880541,
 "iat": 1697740880541,
 "scope": "offline_access",
 "cty": "code"
}  

private_key = open('store.private.key.txt', 'r').read()  

token = jwt.encode(
   payload=payload_data,
   key=private_key,
   algorithm='RS256'
)  

print(token)

Fusion Connectors

​Configure AEM Datasource

​Configuration settings

​Field data population

​Prerequisites

​AEM instance

​Configure an AEM service account

​Content Paths

​Authentication

​Supported authorization options

​Getting Access Token

​Getting Refresh Token

​Getting JWT Bearer Token

​Remote connectors

​Prerequisites

​Connector compatibility

​System requirements

​Enable backend ingress

​Connector configuration example

​Minimal example

​Logback XML configuration file example

​Run the remote connector

​Test communication

​Encryption

​Egress and proxy server configuration

​Password encryption

​Connector restart (5.7 and earlier)

​Recoverable bridge (5.8 and later)

​Job expiration duration (5.9.5 only)

​Use the remote connector

​Enable asynchronous parsing (5.9 and later)

​Learn more

​Supported authorization options

​Getting Access Token

​Getting Refresh Token

​Getting JWT Bearer Token

​Configuration

Configure AEM Datasource

Configuration settings

Field data population

Prerequisites

AEM instance

Configure an AEM service account

Content Paths

Authentication

Supported authorization options

Getting Access Token

Getting Refresh Token

Getting JWT Bearer Token

Remote connectors

Prerequisites

Connector compatibility

System requirements

Enable backend ingress

Connector configuration example

Minimal example

Logback XML configuration file example

Run the remote connector

Test communication

Encryption

Egress and proxy server configuration

Password encryption

Connector restart (5.7 and earlier)

Recoverable bridge (5.8 and later)

Job expiration duration (5.9.5 only)

Use the remote connector

Enable asynchronous parsing (5.9 and later)

Learn more

Supported authorization options

Getting Access Token

Getting Refresh Token

Getting JWT Bearer Token

Configuration