> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Fusion SQL Statistics

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/4/fusion-server/reference/fusion-sql/statistics

[mintlify link]: https://doc.lucidworks.com/docs/4/fusion-server/reference/fusion-sql/statistics

[old doc.lw link]: https://doc.lucidworks.com/fusion-server/4.2/3280

Sampling is often used in statistical analysis to gain an understanding of the **distribution**,
**shape** and **dispersion** of a variable or the relationship between variables.

Fusion SQL returns a random sample for all basic selects that **do not** contain an ORDER BY clause. The random
sample is designed to return a uniform distribution of samples that match a query. The sample can be used to
infer statistical information about the larger result set.

The example below returns a random sample of single field:

```sql wrap  theme={"dark"}
select filesize_d from logs where year_i = 2019
```

If no limit is specified the sample size will be 25000. To increase the sample size add a limit larger then
25000\.

```sql wrap  theme={"dark"}
select filesize_d from logs where year_i = 2019 limit 50000
```

The ability to subset the data with a query and then sample from that subset is called **Stratified Random
Sampling**. Stratified Random Sampling is an important statistical technique used to better understand sub-populations
of a larger data set.

<LwTemplate />

## Descriptive statistics

Sub-queries can be used to return random samples for fast, often sub-second, statistical analysis. For example:

```sql wrap  theme={"dark"}
select count(*) as samplesize,
       mean(filesize_d) as mean,
       min(filesize_d) as min,
       max(filesize_d) as max,
       approx_percentile(filesize_d, .50) as median,
       variance(filesize_d) as variance,
       std(filesize_d) as standard_dev,
       skewness(filesize_d) as skewness,
       kurtosis(filesize_d) as kurtosis,
       sum(filesize_d) as sum
           from (select filesize_d from logs where year_i = 2019 limit 50000)
```

In the example above the sub-query is returning a random sample of 50000 results
which is operated on by the main statistical query. The statistical query returns aggregations
which describe the distribution, shape and dispersion of the sample set.

## Correlation and covariance

Sub-queries can be used to provide random samples for correlation and covariance:

```sql wrap  theme={"dark"}
select corr(filesize_d, response_d) as correlation,
       covar_samp(filesize_d, response_d) as covariance
            from (select filesize_d, response_d from logs limit 50000)
```

In the example above the random sample returns two fields to the corr and covar\_samp functions
in the main query. Correlation and covariance are used to show the strength of the linear
relationship between two variables.

## Pushed-down statistical queries

A narrower set of statistical aggregations can be pushed down to the search engine
and operate over entire result sets. These functions are: count(\*), count(distinct),
sum, min, max, avg and approx\_percentile.

Below is an example of a fully pushed down statistical query:

```sql wrap  theme={"dark"}
select count(*) as cnt, avg(filesize_d) as avg, approx_percentile(filesize_d, .50) as median from logs where year_i = 2018
```

Statistical queries that contain a mix of the queries above and non-pushdown such as skewness or kurtosis will
be operate over a random sample that matches the query.

Below is an example of a statistical query that operates over a random sample:

```sql wrap  theme={"dark"}
select count(*) as cnt,  skewness(filesize_d) as skewness from logs where year_i = 2018
```
