co_matrix function returns a matrix that shows the correlation of values within a categorical field based on their co-occurrence with another categorical field. For example in a medical database this could be used to correlate diseases by co-occurring symptoms. In the example below the
co_matrix function is used to correlate complaint types across zip codes in the NYC 311 complaint database. This can be used to better understand how complaint types tend to go together. The
co_matrix function has 4 parameters:
The categorical String field that will be correlated.
The categorical String field that will be used for co-occurrence.
Number of categorical variables to correlate.
Number of categorical variables to calculate co-occurrence from.
In the example below the top 25 values in the
complaint_type_s field are correlated across the top 20 values in the
zip_s field in the NYC 311 complaint database.
select co_matrix(complaint_type_s, zip_s, 25, 20) as corr, matrix_x, matrix_y from nyc311
The result set for the
co_matrix function is a correlation matrix for the first categorical field parameter. The
co_matrix function returns the correlation for each row. The
matrix_y fields contain the combinations of the top N categorical values.
In the example below Apache Zeppelin is used to display the the correlation matrix result for the top 25 occurring values in the
co_matrix function can be visualized in a heatmap by plotting
matrix_x on the x-axis,
matrix_y on the y-axis and the correlation value in the cells. The example below shows the
co_matrix function visualized in an Apache Zeppelin heatmap: