Analytics Component

The Analytics Component allows users to calculate complex statistical aggregations over result sets.

The component enables interacting with data in a variety of ways, both through a diverse set of analytics functions as well as powerful faceting functionality. The standard facets are supported within the analytics component with additions that leverage its analytical capabilities.

Analytics Configuration

The Analytics component is in a contrib module, therefore it will need to be enabled in the solrconfig.xml for each collection where you would like to use it.

Since the Analytics framework is a search component, it must be declared as such and added to the search handler.

For distributed analytics requests over cloud collections, the component uses the AnalyticsHandler strictly for inter-shard communication. The Analytics Handler should not be used by users to submit analytics requests.

To configure Solr to use the Analytics Component, the first step is to add a lib directive so Solr loads the Analytic Component classes (for more about the lib directive, see Lib Directives in SolrConfig). In the section of solrconfig.xml where the default lib directive are, add a line:

<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-analytics-\d.*\.jar" />

Next you need to enable the request handler and search component. Add the following lines to solrconfig.xml, near the defintions for other request handlers:

solrconfig.xml
<!-- To handle user requests -->
<searchComponent name="analytics" class="org.apache.solr.handler.component.AnalyticsComponent" />

<requestHandler name="/select" class="solr.SearchHandler">
    <arr name="last_components">
        <str>analytics</str>
    </arr>
</requestHandler>

<!-- For inter-shard communication during distributed requests -->
<requestHandler name="/analytics" class="org.apache.solr.handler.AnalyticsHandler" />

For these changes to take effect, restart Solr or reload the core or collection.

Request Syntax

An Analytics request is passed to Solr with the parameter analytics in a request sent to the Search Handler. Since the analytics request is sent inside of a search handler request, it will compute results based on the result set determined by the search handler.

For example, this curl command encodes and POSTs a simple analytics request to the the search handler:

curl --data-urlencode 'analytics={
   "expressions" : {
      "revenue" : "sum(mult(price,quantity))"
      }
   }'
   http://localhost:8983/solr/sales/select?q=*:*&wt=json&rows=0

There are 3 main parts of any analytics request:

Expressions

A list of calculations to perform over the entire result set. Expressions aggregate the search results into a single value to return. This list is entirely independent of the expressions defined in each of the groupings. Find out more about them in the section Expressions.

Functions

One or more Variable Functions to be used throughout the rest of the request. These are essentially lambda functions and can be combined in a number of ways. These functions for the expressions defined in expressions as well as groupings.

Groupings

The list of Groupings to calculate in addition to the expressions. Groupings hold a set of facets and a list of expressions to compute over those facets. The expressions defined in a grouping are only calculated over the facets defined in that grouping.

Note
Optional Parameters
Either the expressions or the groupings parameter must be present in the request, or else there will be no analytics to compute. The functions parameter is always optional.
Example Analytics Request
{
    "functions": {
        "sale()": "mult(price,quantity)"
    },
    "expressions" : {
        "max_sale" : "max(sale())",
        "med_sale" : "median(sale())"
    },
    "groupings" : {
        "sales" : {
            "expressions" : {
                "stddev_sale" : "stddev(sale())",
                "min_price" : "min(price)",
                "max_quantity" : "max(quantity)"
            },
            "facets" : {
                "category" : {
                    "type" : "value",
                    "expression" : "fill_missing(category, 'No Category')",
                    "sort" : {
                        "criteria" : [
                            {
                                "type" : "expression",
                                "expression" : "min_price",
                                "direction" : "ascending"
                            },
                            {
                                "type" : "facetvalue",
                                "direction" : "descending"
                            }
                        ],
                        "limit" : 10
                    }
                },
                "temps" : {
                    "type" : "query",
                    "queries" : {
                        "hot" : "temp:[90 TO *]",
                        "cold" : "temp:[* TO 50]"
                    }
                }
            }
        }
    }
}

Expressions

Expressions are the way to request pieces of information from the analytics component. These are the statistical expressions that you want computed and returned in your response.

Constructing an Expression

Expression Components

An expression is built using fields, constants, mapping functions and reduction functions. The ways that these can be defined are described below.

Sources
Mapping Functions

Mapping functions map values for each Solr Document or Reduction. The provided mapping functions are detailed in the Analytics Mapping Function Reference.

  • Unreduced Mapping: Mapping a Field with another Field or Constant returns a value for every Solr Document. Unreduced mapping functions can take fields, constants as well as other unreduced mapping functions as input.

  • Reduced Mapping: Mapping a Reduction Function with another Reduction Function or Constant returns a single value.

Reduction Functions

Functions that reduce the values of sources and/or unreduced mapping functions for every Solr Document to a single value. The provided reduction functions are detailed in the Analytics Reduction Function Reference.

Component Ordering

The expression components must be used in the following order to create valid expressions.

  1. Reduced Mapping Function

    1. Constants

    2. Reduction Function

      1. Sources

      2. Unreduced Mapping Function

        1. Sources

        2. Unreduced Mapping Function

    3. Reduced Mapping Function

  2. Reduction Function

This ordering is based on the following rules:

  • No reduction function can be an argument of another reduction function. Since all reduction is done together in one step, one reduction function cannot rely on the result of another.

  • No fields can be left unreduced, since the analytics component cannot return a list of values for an expression (one for every document). Every expression must be reduced to a single value.

  • Mapping functions are not necessary when creating functions, however as many nested mappings as needed can be used.

  • Nested mapping functions must be the same type, so either both must be unreduced or both must be reduced. A reduced mapping function cannot take an unreduced mapping function as a parameter and vice versa.

Example Construction

With the above definitions and ordering, an example expression can be broken up into its components:

div(sum(a,fill_missing(b,0)),add(10.5,count(mult(a,c)))))

As a whole, this is a reduced mapping function. The div function is a reduced mapping function since it is a provided mapping function and has reduced arguments.

If we break down the expression further:

  • sum(a,fill_missing(b,0)): Reduction Function
    sum is a provided reduction function.

    • a: Field

    • fill_missing(b,0): Unreduced Mapping Function
      fill_missing is an unreduced mapping function since it is a provided mapping function and has a field argument.

      • b: Field

      • 0: Constant

  • add(10.5,count(mult(a,c))): Reduced Mapping Function
    add is a reduced mapping function since it is a provided mapping function and has a reduction function argument.

Expression Cardinality (Multi-Valued and Single-Valued)

The root of all multi-valued expressions are multi-valued fields. Single-valued expressions can be started with constants or single-valued fields. All single-valued expressions can be treated as multi-valued expressions that contain one value.

Single-valued expressions and multi-valued expressions can be used together in many mapping functions, as well as multi-valued expressions being used alone, and many single-valued expressions being used together. For example:

add(<single-valued double>, <single-valued double>, …​)

Returns a single-valued double expression where the value of the values of each expression are added.

add(<single-valued double>, <multi-valued double>)

Returns a multi-valued double expression where each value of the second expression is added to the single value of the first expression.

add(<multi-valued double>, <single-valued double>)

Acts the same as the above function.

add(<multi-valued double>)

Returns a single-valued double expression which is the sum of the multiple values of the parameter expression.

Types and Implicit Casting

The new analytics component currently supports the types listed in the below table. These types have one-way implicit casting enabled for the following relationships:

Type Implicitly Casts To

Boolean

String

Date

Long, String

Integer

Long, Float, Double, String

Long

Double, String

Float

Double, String

Double

String

String

none

An implicit cast means that if a function requires a certain type of value as a parameter, arguments will be automatically converted to that type if it is possible.

For example, concat() only accepts string parameters and since all types can be implicitly cast to strings, any type is accepted as an argument.

This also goes for dynamically typed functions. fill_missing() requires two arguments of the same type. However, two types that implicitly cast to the same type can also be used.

For example, fill_missing(<long>,<float>) will be cast to fill_missing(<double>,<double>) since long cannot be cast to float and float cannot be cast to long implicitly.

There is an ordering to implicit casts, where the more specialized type is ordered ahead of the more general type. Therefore even though both long and float can be implicitly cast to double and string, they will be cast to double. This is because double is a more specialized type than string, which every type can be cast to.

The ordering is the same as their order in the above table.

Cardinality can also be implicitly cast. Single-valued expressions can always be implicitly cast to multi-valued expressions, since all single-valued expressions are multi-valued expressions with one value.

Implicit casting will only occur when an expression will not "compile" without it. If an expression follows all typing rules initially, no implicit casting will occur. Certain functions such as string(), date(), round(), floor(), and ceil() act as explicit casts, declaring the type that is desired. However round(), floor() and cell() can return either int or long, depending on the argument type.

Variable Functions

Variable functions are a way to shorten your expressions and make writing analytics queries easier. They are essentially lambda functions defined in a request.

Example Basic Function
{
    "functions" : {
        "sale()" : "mult(price,quantity)"
    },
    "expressions" : {
        "max_sale" : "max(sale())",
        "med_sale" : "median(sale())"
    }
}

In the above request, instead of writing mult(price,quantity) twice, a function sale() was defined to abstract this idea. Then that function was used in the multiple expressions.

Suppose that we want to look at the sales of specific categories:

{
    "functions" : {
        "clothing_sale()" : "filter(mult(price,quantity),equal(category,'Clothing'))",
        "kitchen_sale()" : "filter(mult(price,quantity),equal(category,\"Kitchen\"))"
    },
    "expressions" : {
        "max_clothing_sale" : "max(clothing_sale())"
      , "med_clothing_sale" : "median(clothing_sale())"
      , "max_kitchen_sale" : "max(kitchen_sale())"
      , "med_kitchen_sale" : "median(kitchen_sale())"
    }
}

Arguments

Instead of making a function for each category, it would be much easier to use category as an input to the sale() function. An example of this functionality is shown below:

Example Function with Arguments
{
    "functions" : {
        "sale(cat)" : "filter(mult(price,quantity),equal(category,cat))"
    },
    "expressions" : {
        "max_clothing_sale" : "max(sale(\"Clothing\"))"
      , "med_clothing_sale" : "median(sale('Clothing'))"
      , "max_kitchen_sale" : "max(sale(\"Kitchen\"))"
      , "med_kitchen_sale" : "median(sale('Kitchen'))"
    }
}

Variable Functions can take any number of arguments and use them in the function expression as if they were a field or constant.

Variable Length Arguments

There are analytics functions that take a variable amount of parameters. Therefore there are use cases where variable functions would need to take a variable amount of parameters.

For example, maybe there are multiple, yet undetermined, number of components to the price of a product. Functions can take a variable length of parameters if the last parameter is followed by ..

Example Function with a Variable Length Argument
{
    "functions" : {
        "sale(cat, costs..)" : "filter(mult(add(costs),quantity),equal(category,cat))"
    },
    "expressions" : {
        "max_clothing_sale" : "max(sale('Clothing', material, tariff, tax))"
      , "med_clothing_sale" : "median(sale('Clothing', material, tariff, tax))"
      , "max_kitchen_sale" : "max(sale('Kitchen', material, construction))"
      , "med_kitchen_sale" : "median(sale('Kitchen', material, construction))"
    }
}

In the above example a variable length argument is used to encapsulate all of the costs to use for a product. There is no definite number of arguments requested for the variable length parameter, therefore the clothing expressions can use 3 and the kitchen expressions can use 2. When the sale() function is called, costs is expanded to the arguments given.

Therefore in the above request, inside of the sale function:

  • add(costs)

is expanded to both of the following:

  • add(material, tariff, tax)

  • add(material, construction)

For-Each Functions

Caution
Advanced Functionality

The following function details are for advanced requests.

Although the above functionality allows for an undefined number of arguments to be passed to a function, it does not allow for interacting with those arguments.

Many times we might want to wrap each argument in additional functions. For example maybe we want to be able to look at multiple categories at the same time. So we want to see if category EQUALS x OR category EQUALS y and so on.

In order to do this we need to use for-each lambda functions, which transform each value of the variable length parameter. The for-each is started with the : character after the variable length parameter.

Example Function with a For-Each
{
    "functions" : {
        "sale(cats..)" : "filter(mult(price,quantity),or(cats:equal(category,_)))"
    },
    "expressions" : {
        "max_sale_1" : "max(sale('Clothing', 'Kitchen'))"
      , "med_sale_1" : "median(sale('Clothing', 'Kitchen'))"
      , "max_sale_2" : "max(sale('Electronics', 'Entertainment', 'Travel'))"
      , "med_sale_2" : "median(sale('Electronics', 'Entertainment', 'Travel'))"
    }
}

In this example, cats: is the syntax that starts a for-each lambda function over every parameter cats, and the _ character is used to refer to the value of cats in each iteration in the for-each. When sale("Clothing", "Kitchen") is called, the lambda function equal(category,_) is applied to both Clothing and Kitchen inside of the or() function.

Using all of these rules, the expression:

`sale("Clothing","Kitchen")`

is expanded to:

`filter(mult(price,quantity),or(equal(category,"Kitchen"),equal(category,"Clothing")))`

by the expression parser.

Groupings And Facets

Facets, much like in other parts of Solr, allow analytics results to be broken up and grouped by attributes of the data that the expressions are being calculated over.

The currently available facets for use in the analytics component are Value Facets, Pivot Facets, Range Facets and Query Facets. Each facet is required to have a unique name within the grouping it is defined in, and no facet can be defined outside of a grouping.

Groupings allow users to calculate the same grouping of expressions over a set of facets. Groupings must have both expressions and facets given.

Example Base Facet Request
{
    "functions" : {
        "sale()" : "mult(price,quantity)"
    },
    "groupings" : {
        "sales_numbers" : {
            "expressions" : {
                "max_sale" : "max(sale())",
                "med_sale" : "median(sale())"
            },
            "facets" : {
                "<name>" : "< facet request >"
            }
        }
    }
}
Example Base Facet Response
{
    "analytics_response" : {
        "groupings" : {
            "sales_numbers" : {
                "<name>" : "< facet response >"
            }
        }
    }
}

Facet Sorting

Some Analytics facets allow for complex sorting of their results. The two current sortable facets are Analytic Value Facets and Analytic Pivot Facets.

Parameters

criteria

The list of criteria to sort the facet by.

It takes the following parameters:

type

The type of sort. There are two possible values:

  • expression: Sort by the value of an expression defined in the same grouping.

  • facetvalue: Sort by the string-representation of the facet value.

Direction

(Optional) The direction to sort.

  • ascending (Default)

  • descending

expression

When type = expression, the name of an expression defined in the same grouping.

limit

Limit the number of returned facet values to the top N. (Optional)

offset

When a limit is set, skip the top N facet values. (Optional)

Example Sort Request
{
    "criteria" : [
        {
            "type" : "expression",
            "expression" : "max_sale",
            "direction" : "ascending"
        },
        {
            "type" : "facetvalue",
            "direction" : "descending"
        }
    ],
    "limit" : 10,
    "offset" : 5
}

Value Facets

Value Facets are used to group documents by the value of a mapping expression applied to each document. Mapping expressions are expressions that do not include a reduction function.

For more information, refer to the Expressions section.

  • mult(quantity, sum(price, tax)): breakup documents by the revenue generated

  • fillmissing(state, "N/A"): breakup documents by state, where N/A is used when the document doesn’t contain a state

Value Facets can be sorted.

Parameters

expression

The expression to choose a facet bucket for each document.

sort

A sort for the results of the pivot.

Note
Optional Parameters
The sort parameter is optional.
Example Value Facet Request
{
    "type" : "value",
    "expression" : "fillmissing(category,'No Category')",
    "sort" : {}
}
Example Value Facet Response
[
    { "..." : "..." },
    {
        "value" : "Electronics",
        "results" : {
            "max_sale" : 103.75,
            "med_sale" : 15.5
        }
    },
    {
        "value" : "Kitchen",
        "results" : {
            "max_sale" : 88.25,
            "med_sale" : 11.37
        }
    },
    { "..." : "..." }
]
Note
Field Facets
This is a replacement for Field Facets in the original Analytics Component. Field Facet functionality is maintained in Value Facets by using the name of a field as the expression.

Analytic Pivot Facets

Pivot Facets are used to group documents by the value of multiple mapping expressions applied to each document.

Pivot Facets work much like layers of Analytic Value Facets. A list of pivots is required, and the order of the list directly impacts the results returned. The first pivot given will be treated like a normal value facet. The second pivot given will be treated like one value facet for each value of the first pivot. Each of these second-level value facets will be limited to the documents in their first-level facet bucket. This continues for however many pivots are provided.

Sorting is enabled on a per-pivot basis. This means that if your top pivot has a sort with limit:1, then only that first value of the facet will be drilled down into. Sorting in each pivot is independent of the other pivots.

Parameters

pivots

The list of pivots to calculate a drill-down facet for. The list is ordered by top-most to bottom-most level.

name

The name of the pivot.

expression

The expression to choose a facet bucket for each document.

sort

A sort for the results of the pivot.

Note
Optional Parameters
The sort parameter within the pivot object is optional, and can be given in any, none or all of the provided pivots.
Example Pivot Facet Request
{
    "type" : "pivot",
    "pivots" : [
        {
            "name" : "country",
            "expression" : "country",
            "sort" : {}
        },
        {
            "name" : "state",
            "expression" : "fillmissing(state, fillmissing(providence, territory))"
        },
        {
            "name" : "city",
            "expression" : "fillmissing(city, 'N/A')",
            "sort" : {}
        }
    ]
}
Example Pivot Facet Response
[
    { "..." : "..." },
    {
        "pivot" : "Country",
        "value" : "USA",
        "results" : {
            "max_sale" : 103.75,
            "med_sale" : 15.5
        },
        "children" : [
            { "..." : "..." },
            {
                "pivot" : "State",
                "value" : "Texas",
                "results" : {
                    "max_sale" : 99.2,
                    "med_sale" : 20.35
                },
                "children" : [
                    { "..." : "..." },
                    {
                        "pivot" : "City",
                        "value" : "Austin",
                        "results" : {
                            "max_sale" : 94.34,
                            "med_sale" : 17.60
                        }
                    },
                    { "..." : "..." }
                ]
            },
            { "..." : "..." }
        ]
    },
    { "..." : "..." }
]

Analytics Range Facets

Range Facets are used to group documents by the value of a field into a given set of ranges. The inputs for analytics range facets are identical to those used for Solr range facets. Refer to the Range Facet documentation for additional questions regarding use.

Parameters

field

Field to be faceted over

start

The bottom end of the range

end

The top end of the range

gap

A list of range gaps to generate facet buckets. If the buckets do not add up to fit the start to end range, then the last gap value will repeated as many times as needed to fill any unused range.

hardend

Whether to cutoff the last facet bucket range at the end value if it spills over. Defaults to false.

include

The boundaries to include in the facet buckets. Defaults to lower.

  • lower - All gap-based ranges include their lower bound.

  • upper - All gap-based ranges include their upper bound.

  • edge - The first and last gap ranges include their edge bounds (lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified.

  • outer - The before and after ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries.

  • all - Includes all options: lower, upper, edge, and outer

others

Additional ranges to include in the facet. Defaults to none.

  • before - All records with field values lower then lower bound of the first range.

  • after - All records with field values greater then the upper bound of the last range.

  • between - All records with field values between the lower bound of the first range and the upper bound of the last range.

  • none - Include facet buckets for none of the above.

  • all - Include facet buckets for before, after and between.

Note
Optional Parameters
The hardend, include and others parameters are all optional.
Example Range Facet Request
{
    "type" : "range",
    "field" : "price",
    "start" : "0",
    "end" : "100",
    "gap" : [
        "5",
        "10",
        "10",
        "25"
    ],
    "hardend" : true,
    "include" : [
        "lower",
        "upper"
    ],
    "others" : [
        "after",
        "between"
    ]
}
Example Range Facet Response
[
    {
        "value" : "[0 TO 5]",
        "results" : {
            "max_sale" : 4.75,
            "med_sale" : 3.45
        }
    },
    {
        "value" : "[5 TO 15]",
        "results" : {
            "max_sale" : 13.25,
            "med_sale" : 10.20
        }
    },
    {
        "value" : "[15 TO 25]",
        "results" : {
            "max_sale" : 22.75,
            "med_sale" : 18.50
        }
    },
    {
        "value" : "[25 TO 50]",
        "results" : {
            "max_sale" : 47.55,
            "med_sale" : 30.33
        }
    },
    {
        "value" : "[50 TO 75]",
        "results" : {
            "max_sale" : 70.25,
            "med_sale" : 64.54
        }
    },
    { "..." : "..." }
]

Query Facets

Query Facets are used to group documents by given set of queries.

Parameters

queries

The list of queries to facet by.

Example Query Facet Request
{
    "type" : "query",
    "queries" : {
        "high_quantity" : "quantity:[ 5 TO 14 ] AND price:[ 100 TO * ]",
        "low_quantity" : "quantity:[ 1 TO 4 ] AND price:[ 100 TO * ]"
    }
}
Example Query Facet Response
[
    {
        "value" : "high_quantity",
        "results" : {
            "max_sale" : 4.75,
            "med_sale" : 3.45
        }
    },
    {
        "value" : "low_quantity",
        "results" : {
            "max_sale" : 13.25,
            "med_sale" : 10.20
        }
    }
]