Fusion SQL Statistics
Sampling is often used in statistical analysis to gain an understanding of the distribution,
shape and dispersion of a variable or the relationship between variables.
Fusion SQL returns a random sample for all basic selects that do not contain an ORDER BY clause. The random
sample is designed to return a uniform distribution of samples that match a query. The sample can be used to
infer statistical information about the larger result set.
The example below returns a random sample of single field:
If no limit is specified the sample size will be 25000. To increase the sample size add a limit larger then
25000.
The ability to subset the data with a query and then sample from that subset is called Stratified Random
Sampling. Stratified Random Sampling is an important statistical technique used to better understand sub-populations
of a larger data set.
In the example above the sub-query is returning a random sample of 50000 results
which is operated on by the main statistical query. The statistical query returns aggregations
which describe the distribution, shape and dispersion of the sample set.
In the example above the random sample returns two fields to the corr and covar_samp functions
in the main query. Correlation and covariance are used to show the strength of the linear
relationship between two variables.
Statistical queries that contain a mix of the queries above and non-pushdown such as skewness or kurtosis will
be operate over a random sample that matches the query.
Below is an example of a statistical query that operates over a random sample:
Was this page helpful?
⌘I