Javascript Connector and Datasource Configuration

The Javascript connector allows users to write ad-hoc document retrieval routines to fetch content from filesystems and websites.

It provides a property f.script, a JavaScript program that is compiled by the JDK. This program returns a content item which is handed off to the fetcher.

The script engine works exactly the same as the JavaScript Index and JavaScript Query Pipeline stages. The JavaScript program must be standard ECMAScript.

You can use any Java class available to the connectors JDK ClassLoader to manipulate that object within a function. As in Java, to access Java classes by their simple names instead of their fully specified class names, e.g. to be able to write String instead of java.lang.String, these classes must be imported. The java.lang package is not imported by default, because its classes would conflict with Object, Boolean, Math, and other built-in JavaScript objects. To import a Java class, use the JavaImporter object and the with statement, which limits the scope of the imported Java packages and classes.

var imports = new JavaImporter(java.lang.String);
...
with (imports) {
    var name = new String("foo"); ...
}

For global variables, you can reference these objects using the Java.type API extension. See this tutorial for details: http://winterbe.com/posts/2014/04/05/java8-nashorn-tutorial/

The JavaScript Program

The Javascript context provides the following variables:

Variable Type Description

id

java.lang.String

The ID of the object to fetch. This is almost always the URI of the datasource to connect to and fetch content.

lastModified

long

The time since the epoch from which the item was last touched.

signature

java.lang.String

An optional string meant to be used to compare versions of the ID being fetched, e.g. an ETag in a web-crawl.

content

crawler.common.MutableObject

A Content object that can be modified and returned, for fine grained control over the return.

_fetcher

Fetcher

The current Fetcher instance (usually type JavascriptFetcher), used to interact with the Fetcher, including getting a WebFetcher instance using _fetcher.getWebFetcher().

_context

java.util.Map

A map used to store data to persist across calls to fetch(), e.g. an instance of WebFetcher obtained using _fetcher.getWebFetcher().

The program must return one of the following kinds of objects:

Object Description

String

A string object. This is converted to UTF-8 bytes and added as the raw content on a common.crawler.Content object and returned from the fetch() method.

byte []

A byte array. This array is set on a common.crawler.Content object and returned from the fetch() method.

common.crawler.MutableContent

If you want to have complete control over the return from fetch(), make changes to the content object provided in the Context and return it.

Warning
Do not create a new object.

An array of Objects

The array is converted to Embedded Content. The Fetcher returns a parent Content object that has a "Container" discardMessage. The Embedded Content on that container is generated by calling toString() on the objects in the array.

A JavaScript Map

The map is converted to fields on the Content item returned.

If the JavaScript script is implemented as a function, the return statement must return one of the above types. If the script is not function-based, the last line in the script must evaluate to one of these object types.

Examples

Return content as a java.lang.String

var str = new java.lang.String("Java");
str;

Return content as a byte array

var bytes = new java.lang.String("Java");
bytes.getBytes('UTF-8');

Return content as a JavaScript array

var strings = ["hi", "bye"];
strings;

Return content as a JavaScript map

var map = {"hi": "bye", "bye": "hi", "number":1};
map;

Leverage the Fetcher

var webFetcher = _context.get("webFetcher");
if (null == webFetcher) {
  webFetcher = _fetcher.getWebFetcher();
  // it's possible to pass config options to getWebFetcher() as a map as well, e.g.:
  // _fetcher.getWebFetcher({"f.discardLinkURLQueries" : false });
  _context.put("webFetcher", webFetcher);
}
var webContent = webFetcher.fetch(id, lastModified, signature);
var jsoupDoc = webContent.getDocument();
if (null !== jsoupDoc) {
  // modify the Jsoup document or web-content as-needed here, adding new links, removing sections etc.
  // ...
  // ...
  webContent.setRawContent(jsoupDoc.toString().getBytes("UTF-8"));
}
webContent;

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.