Making Correlation Coefficient Matrices to understand relationships in SQL
A scatter X-Y plot is a straightforward way to visualize the dependency between two variables. However, at times you want to understand how more than two variables are related. The correlation coefficient can be calculated between pairs of variables that scales between +1 and -1 demonstrating the degree of positive or negative correlation. The relationship between three variables can be presented as a symmetric matrix of dimensions with a value of 1 along the diagonal.
In this recipe, we'll look for correlations betwen two or more variables and visualize it as a matrix. You can use the correlation matrix to figure out what activities are correlated, to plan future activities. For example, within your customer support department, how are net promoter scores (NPS) related to support wait times.
The Correlation Coefficient math
One of the most commonly used correlation formula is Pearson’s. For a sample,
where, and are the sample means of variables and .
The SQL to correlate email subject lengths and open rates
Using Mailchimp's email campaign data – we are interested in finding if there's a relationship between the length of the subject and open/click rates.
1. Calculating correlation by hand
We are going to be using Common Table Expressions (CTEs) to organize our intermediate results:
WITH table_mean AS ( SELECT avg(char_length(subject_line)) AS mean_subject_length, avg(report_summary_open_rate) AS mean_open_rate FROM mailchimp.campaigns ), table_corrected AS ( SELECT char_length(subject_line) - mean_subject_length AS mean_subject_length_corrected, report_summary_open_rate - mean_open_rate AS mean_open_rate_corrected FROM table_mean, mailchimp.campaigns ), SELECT sum(mean_subject_length_corrected * mean_open_rate_corrected) / sqrt(sum(mean_subject_length_corrected * mean_subject_length_corrected) * sum(mean_open_rate_corrected * mean_open_rate_corrected)) AS r FROM table_corrected;
This is a direct translation of the math equation. Fortunately, we don't have to repeat this each time, we can simply use the in-built
corr function to calculate the correlation for us.
2. Calculating pairwise correlation using
The correlation can be calculated as follows:
SELECT corr(char_length(subject_line), report_summary_open_rate) AS r;
For more than two variables, we are going to repeat the correlation calculation pairwise between the variables and organize the results in the follow format. In step 3, it'll be clear why we use this format:
SELECT 'subject_length' AS row, 'subject_length' AS col, corr(subject_length, subject_length) as coeff FROM mailchimp.campaigns UNION SELECT 'subject_length' AS row, 'open_rate' AS col, corr(subject_length, open_rate) as coeff FROM mailchimp.campaigns UNION SELECT 'subject_length' AS row, 'click_rate' AS col, corr(subject_length, click_rate) as coeff FROM mailchimp.campaigns UNION SELECT 'open_rate' AS row, 'open_rate' AS col, corr(open_rate, open_rate) as coeff FROM mailchimp.campaigns UNION SELECT 'open_rate' AS row, 'click_rate' AS col, corr(open_rate, click_rate) as coeff FROM mailchimp.campaigns UNION SELECT 'click_rate' AS row, 'click_rate' AS col, corr(click_rate, click_rate) as coeff FROM mailchimp.campaigns
3. Pivoting the table to get a matrix
In this step, we'll be building a manual pivot table using the values from the
col column. After the pivot, the values of the
col column will become new columns in the resulting table.
SELECT row, sum(CASE WHEN col='subject_length' THEN coeff ELSE 0 END) AS subject_length, sum(CASE WHEN col='open_rate' THEN coeff ELSE 0 END) AS open_rate, sum(CASE WHEN col='click_rate' THEN coeff ELSE 0 END) AS click_rate FROM ( // ... query as before ) GROUP BY row ORDER BY row DESC
and our table will look like this:
which matches the values we get from Excel or Google sheets. Our calculations show that open and click rates are negatively correlated with subject lengths and there's a strong positive relation between open and click rates (which is obvious because they are not independent variables – you need to open an email in order to click it.)
You are reading a series of Advanced SQL tricks.
Join 300+ data analysts who are using our weekly SQL tips to improve their skills. No spam, promise!