Product Selector

Fusion 5.9
    Fusion 5.9

    SQL AggregationJob configuration specifications

    A Spark SQL aggregation job where user-defined parameters are injected into a built-in SQL template at runtime.

    Aggregation jobs compile your raw signals into aggregated signals. Most Managed Fusion jobs that consume signals require aggregated signals.

    Default job name

    COLLECTION_NAME_click_signals_aggregation

    Input

    Raw signals (the COLLECTION_NAME_signals collection by default)

    Output

    Aggregated signals (the COLLECTION_NAME_signals_aggr collection by default)

    query
    count_i
    type
    timstamp_tdt
    user_id
    doc_id
    session_id
    fusion_query_id

    Required signals fields:

    required

    required

    required

    required

    required

    [1]

    1 Required if you are using response signals.

    Aggregation properties

    The aggregation process is specified by an aggregation type consisting of the following list of properties:

    Name Description

    id

    Aggregation ID

    groupingFields

    List of signal field names

    signalTypes

    List of signal types

    aggregator

    Symbolic name of the aggregator implementation

    selectQuery

    Query string, default *:*

    sort

    Ordering of aggregated signals

    timeRange

    String specifying time range, e.g., [* TO NOW]

    outputPipeline

    Pipeline ID for processing aggregated events

    outputCollection

    Output collection name

    rollupPipeline

    Rollup pipeline ID

    rollupAggregator

    Name of the aggregator implementation used for rollups

    sourceRemove

    Boolean, default is false

    sourceCatchup

    Boolean, default is true

    outputRollup

    Boolean, default is true

    aggregates

    List of aggregation functions

    params

    Arbitrary parameters to be used by specific aggregator implementations

    Use this job when you want to aggregate your data in some way.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    inputCollection - stringrequired

    Collection containing signals to be aggregated.

    outputCollection - string

    The collection to write the aggregates to on output. This property is required if the selected output / rollup pipeline requires it (the default pipeline does). A special value of '-' disables the output.

    >= 1 characters

    rows - integer

    Number of rows to read from the source collection per request.

    Default: 10000

    sql - stringrequired

    Use SQL to perform the aggregation. You do not need to include a time range filter in the WHERE clause as it gets applied automatically before executing the SQL statement.

    >= 1 characters

    rollupSql - string

    Use SQL to perform a rollup of previously aggregated docs. If left blank, the aggregation framework will supply a default SQL query to rollup aggregated metrics.

    >= 1 characters

    readOptions - array[object]

    Additional configuration settings to fine-tune how input records are read for this aggregation.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    sourceCatchup - boolean

    If checked, only aggregate new signals created since the last time the job was successfully run. If there is a record of such previous run then this overrides the starting time of time range set in 'timeRange' property. If unchecked, then all matching signals are aggregated and any previously aggregated docs are deleted to avoid double counting.

    Default: true

    sourceRemove - boolean

    If checked, remove signals from source collection once aggregation job has finished running.

    Default: false

    aggregationTime - string

    Timestamp to use for the aggregation results. Defaults to NOW.

    referenceTime - string

    Timestamp to use for computing decays and to determine the value of NOW.

    skipCheckEnabled - boolean

    If the catch-up flag is enabled and this field is checked, the job framework will execute a fast Solr query to determine if this run can be skipped.

    Default: true

    skipJobIfSignalsEmpty - boolean

    Skip Job run if signals collection is empty

    parameters - array[object]

    Other aggregation parameters (e.g. timestamp field etc..).

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    signalTypes - array[string]

    The signal types. If not set then any signal type is selected

    selectQuery - string

    The query to select the desired input documents.

    >= 1 characters

    Default: *:*

    timeRange - string

    The time range to select signals on.

    >= 1 characters

    useNaturalKey - boolean

    Use a natural key provided in the raw signals data for aggregation, rather than relying on Solr UUIDs. Migrated aggregations jobs from Fusion 4 will need this set to false.

    Default: true

    optimizeSegments - integer

    If set to a value above 0, the aggregator job will optimize the resulting Solr collection into this many segments

    exclusiveMinimum: false

    Default: 0

    dataFormat - stringrequired

    Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

    >= 1 characters

    Default: solr

    sparkSQL - string

    Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

    Default: SELECT * from spark_input

    sparkPartitions - integer

    Spark will re-partition the input to have this number of partitions. Increase for greater parallelism

    Default: 200

    type - stringrequired

    Default: aggregation

    Allowed values: aggregation