# Fusion SQL Statistics

Sampling is often used in statistical analysis to gain an understanding of the **distribution**,
**shape**, and **dispersion** of a variable or the relationship between variables.

Fusion SQL returns a random sample for all basic selects that **do not** contain an ORDER BY clause. The random
sample is designed to return a uniform distribution of samples that match a query. The sample can be used to
infer statistical information about the larger result set.

The example below returns a random sample of single field:

`select filesize_d from logs where year_i = 2019`

If no limit is specified the sample size will be 25000. To increase the sample size add a limit larger than 25000.

`select filesize_d from logs where year_i = 2019 limit 50000`

The ability to subset the data with a query and then sample from that subset is called **Stratified Random
Sampling**. Stratified Random Sampling is an important statistical technique used to better understand sub-populations of a larger data set.

## Descriptive statistics

Sub-queries can be used to return random samples for fast, often sub-second, statistical analysis. For example:

```
select count(*) as samplesize,
mean(filesize_d) as mean,
min(filesize_d) as min,
max(filesize_d) as max,
approx_percentile(filesize_d, .50) as median,
variance(filesize_d) as variance,
std(filesize_d) as standard_dev,
skewness(filesize_d) as skewness,
kurtosis(filesize_d) as kurtosis,
sum(filesize_d) as sum
from (select filesize_d from logs where year_i = 2019 limit 50000)
```

In the example above, the sub-query is returning a random sample of 50000 results which is operated on by the main statistical query. The statistical query returns aggregations which describe the distribution, shape, and dispersion of the sample set.

## Correlation and covariance

Sub-queries can be used to provide random samples for correlation and covariance:

```
select corr(filesize_d, response_d) as correlation,
covar_samp(filesize_d, response_d) as covariance
from (select filesize_d, response_d from logs limit 50000)
```

In the example above, the random sample returns two fields to the corr and covar_samp functions in the main query. Correlation and covariance are used to show the strength of the linear relationship between two variables.

## Pushed-down statistical queries

A narrower set of statistical aggregations can be pushed down to the search engine and operate over entire result sets. These functions are: count(*), count(distinct), sum, min, max, avg and approx_percentile.

Below is an example of a fully pushed down statistical query:

`select count(*) as cnt, avg(filesize_d) as avg, approx_percentile(filesize_d, .50) as median from logs where year_i = 2018`

Statistical queries that contain a mix of the queries above and non-pushdown such as skewness or kurtosis will be operate over a random sample that matches the query.

Below is an example of a statisical query that operates over a random sample:

`select count(*) as cnt, skewness(filesize_d) as skewness from logs where year_i = 2018`