Javascript Connector and Datasource Configuration

The Javascript connector allows users to write ad-hoc document retrieval routines to fetch content from filesystems and websites. It provides a property f.script which is a JavaScript program that is compiled by the JDK. This program returns a content item which is handed off to the fetcher.

The script engine works exactly the same as the Javascript Index and Javascript Query Pipeline stages. The JavaScript program must be standard ECMAScript.

You can use any Java class available to the connectors JDK classloader to manipulate that object within a function. As in Java, to access Java classes by their simple names instead of their fully specified class names, e.g. to be able to write String instead of java.lang.String, these classes must be imported. The java.lang package is not imported by default, because its classes would conflict with Object, Boolean, Math, and other built-in JavaScript objects. To import a Java class, use the JavaImporter object and the with statement, which limits the scope of the imported Java packages and classes.

var imports = new JavaImporter(java.lang.String);
...
with (imports) {
    var name = new String("foo"); ...
}

For global variables, you can reference these objects using the Java.type API extension. See this tutorial for details: http://winterbe.com/posts/2014/04/05/java8-nashorn-tutorial/

The JavaScript Program

The Javascript context provides the following variables:

  • id, type java.lang.String - the id of the object to fetch. This is almost always the URI of the datasource to connect to and fetch content.

  • lastModified, type long - the time since the epoch from which the item was last touched.

  • signature, type java.lang.String - an optional string meant to be used to compare versions of the ID being fetched, e.g. an ETag in a web-crawl.

  • content, type crawler.common.MutableObject - a Content object that can be modified and returned, for fine grained control over the return. See the section on return types below.

  • _fetcher, type Fetcher - the current Fetcher instance (usually type JavascriptFetcher), used to interact with the Fetcher, including getting a WebFetcher instance using`_fetcher.getWebFetcher()`

  • _context, type java.util.Map - a map used to store data to persist across calls to fetch(), e.g. an instance of WebFetcher obtained using`_fetcher.getWebFetcher().`

The program must return one of the following kinds of objects:

  • String — A string object. This will be converted to UTF-8 bytes and added as the raw content on a common.crawler.Content object and returned from the fetch() method

  • byte [] — A byte array. This array will be set on a common.crawler.Content object and returned from the fetch() method

  • common.crawler.MutableContent: If you wish to have complete control over the return from fetch(), make changes to the content object provided in the Context and return it. DO NOT CREATE A NEW OBJECT.

  • An array of Objects. These will be converted to Embedded Content (the Fetcher will return a parent Content object that has a "Container" discardMessage. The Embedded Content on that container will consist of calling toString() on the objects in the array. Thus, it is best if the array is simply

  • A Javascript Map. The map will be converted to fields on the Content item returned

If the JavaScript script is implemented as a function, the return statement must return one of the above types. If the script is not function-based, than the last line in the script must evaluate to one of these object types.

Examples

Return content as a java.lang.String

var str = new java.lang.String("Java");
str;

Return content as a byte array

var bytes = new java.lang.String("Java");
bytes.getBytes('UTF-8');

Return content as a JavaScript array

var strings = ["hi", "bye"];
strings;

Return content as a JavaScript map

var map = {"hi": "bye", "bye": "hi", "number":1};
map;

Leverage the Fetcher

var webFetcher = _context.get("webFetcher");
if (null == webFetcher) {
  webFetcher = _fetcher.getWebFetcher();
  // it's possible to pass config options to getWebFetcher() as a map as well, e.g.:
  // _fetcher.getWebFetcher({"f.discardLinkURLQueries" : false });
  _context.put("webFetcher", webFetcher);
}
var webContent = webFetcher.fetch(id, lastModified, signature);
var jsoupDoc = webContent.getDocument();
if (null !== jsoupDoc) {
  // modify the Jsoup document or web-content as-needed here, adding new links, removing sections etc.
  // ...
  // ...
  webContent.setRawContent(jsoupDoc.toString().getBytes("UTF-8"));
}
webContent;

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.