Looking for the old docs site? You can still view it for a limited time here.

Create and Run an SQL Aggregation Job

You can perform a SQL aggregation on a signals collection for a datasource (or on some other collection), through the Fusion UI or using the Fusion API.

Preliminaries

Before you can create and run a SQL aggregation job, you must create an app, create a collection (or use the default collection for the app), and add a datasource. Before you run a SQL aggregation job, you need signal data. Otherwise, there is nothing to aggregate.

Use the Fusion UI

You can use the Fusion UI to perform a SQL aggregation.

Set up a SQL aggregation job
  1. With the app open, navigate to Collections Collections > Jobs.

  2. Click Add and select Aggregation from the dropdown list.

  3. Specify an arbitrary Spark Job ID.

  4. For the Source Collection, select the collection that contains the data to aggregate.

    Tip
    This is not the base collection. For example, to aggregate the signals in experiment_signals, you would select experiment_signals, not experiment.
  5. Under Aggregation Settings, expand SQL Aggregation.

  6. Enter or paste SQL in the SQL text box. Optionally, click Open Editor to open a dialog box with a larger editor. Click Close to close the dialog box.

  7. Click Save to save the job.

Run a SQL aggregation job
  1. With the app open, navigate to Collections Collections > Jobs.

  2. In the list of jobs, click the job you want to run, and then click Run.

  3. Click Start.

  4. Click Close Close to close the job management part of the Fusion UI.

Use the Fusion API

You can use the Fusion API to perform a SQL aggregation. For example:

Configure a SQL aggregation job:
curl ":{api-port}/api/spark/configurations/experiment_signals_aggregation"
{
  "type" : "spark",
  "id" : "experiment_click_signals_aggregation",
  "definition" : {
    "id" : "experiment_click_signals_aggregation",
    "sql" : "SELECT SUM(count_i) AS aggr_count_i, query AS query_s, doc_id AS doc_id_s, time_decay(count_i, timestamp_tdt) AS weight_d FROM default_signals WHERE type_s='click' GROUP BY query, doc_id",
    "selectQuery" : "*:*",
    "outputPipeline" : "_system",
    "rollupAggregator" : "SQL",
    "sourceRemove" : false,
    "sourceCatchup" : true,
    "outputRollup" : true,
    "parameters" : [ {
      "key" : "optimizeOutput",
      "value" : "4"
    } ]
  },
  "inputCollection" : "experiment_click_signals",
  "rows" : 10000
}
Run a SQL aggregation job:
curl -u user:pass -X POST -H "Content-Type: application/json" http://localhost:{api-port}/api/jobs/spark:experiment_click_signals_aggregation/actions -d '{"action": "start"}'