How To
Documentation
    Learn More

      Javascript Connector

      The Javascript connector executes a JavaScript program that is compiled by the JDK. This program returns a content item which is handed off to the fetcher. The JavaScript program must be standard ECMAScript.

      You can use any Java class available to the connectors JDK ClassLoader to manipulate that object within a function. As in Java, to access Java classes by their simple names instead of their fully specified class names, e.g. to be able to write String instead of java.lang.String, these classes must be imported. The java.lang package is not imported by default, because its classes would conflict with Object, Boolean, Math, and other built-in JavaScript objects. To import a Java class, use the JavaImporter object and the with statement, which limits the scope of the imported Java packages and classes.

      var imports = new JavaImporter(java.lang.String);
      ...
      with (imports) {
          var name = new String("foo"); ...
      }

      For global variables, you can reference these objects using the Java.type API extension. See this tutorial for details: http://winterbe.com/posts/2014/04/05/java8-nashorn-tutorial/

      The JavaScript Program

      The Javascript context provides the following variables:

      Variable Type Description

      id

      java.lang.String

      The ID of the object to fetch. This is almost always the URI of the datasource to connect to and fetch content.

      lastModified

      long

      The time since the epoch from which the item was last touched.

      signature

      java.lang.String

      An optional string meant to be used to compare versions of the ID being fetched, e.g. an ETag in a web-crawl.

      content

      crawler.common.MutableObject

      A Content object that can be modified and returned, for fine grained control over the return.

      _fetcher

      Fetcher

      The current Fetcher instance (usually type JavascriptFetcher), used to interact with the Fetcher, including getting a WebFetcher instance using _fetcher.getWebFetcher().

      _context

      java.util.Map

      A map used to store data to persist across calls to fetch(), e.g. an instance of WebFetcher obtained using _fetcher.getWebFetcher().

      The program must return one of the following kinds of objects:

      Object Description

      String

      A string object. This is converted to UTF-8 bytes and added as the raw content on a common.crawler.Content object and returned from the fetch() method.

      byte []

      A byte array. This array is set on a common.crawler.Content object and returned from the fetch() method.

      common.crawler.MutableContent

      If you want to have complete control over the return from fetch(), make changes to the content object provided in the Context and return it.

      Do not create a new object.

      An array of Objects

      The array is converted to Embedded Content. The Fetcher returns a parent Content object that has a "Container" discardMessage. The Embedded Content on that container is generated by calling toString() on the objects in the array.

      A JavaScript Map

      The map is converted to fields on the Content item returned.

      If the JavaScript script is implemented as a function, the return statement must return one of the above types. If the script is not function-based, the last line in the script must evaluate to one of these object types.

      Examples

      Return content as a java.lang.String

      var str = new java.lang.String("Java");
      str;

      Return content as a byte array

      var bytes = new java.lang.String("Java");
      bytes.getBytes('UTF-8');

      Return content as a JavaScript array

      var strings = ["hi", "bye"];
      strings;

      Return content as a JavaScript map

      var map = {"hi": "bye", "bye": "hi", "number":1};
      map;

      Leverage the Fetcher

      var webFetcher = _context.get("webFetcher");
      if (null == webFetcher) {
        webFetcher = _fetcher.getWebFetcher();
        // it is possible to pass config options to getWebFetcher() as a map as well, e.g.:
        // _fetcher.getWebFetcher({"f.discardLinkURLQueries" : false });
        _context.put("webFetcher", webFetcher);
      }
      var webContent = webFetcher.fetch(id, lastModified, signature);
      var jsoupDoc = webContent.getDocument();
      if (null !== jsoupDoc) {
        // modify the Jsoup document or web-content as-needed here, adding new links, removing sections etc.
        // ...
        // ...
        webContent.setRawContent(jsoupDoc.toString().getBytes("UTF-8"));
      }
      webContent;