Calculating Z-Score in SQL

In statistics, the z-score (or standard score) of an observation is the number of standard deviations that it is above or below the population mean.

where,

is the observation, is the mean and is the standard deviation.

You can use the z-score to answer questions like the following:

  • What percentage of values fall below a specific value?
  • What values can be expected to be outliers?
  • What is the relative score of one distribution versus another? Literally comparing apples and oranges.

It is generally accepted that z-scores lower than -1.96 or higher than 1.96 to be outliers, or at the least worth a second look. These values are statistically significant at the 95% confidence level. For a higher confidence level of 99%, you are interested in z-scores of higher than 2.576 or lower than -2.576. We are assuming that your data fits a normal distribution.


Calculating the Z-score of Several Columns

Consider an ecommerce store that sells products online. The two metrics our analytics team is tracking everyday is sales and website visitors. The table will look something like this:

date sales visitors conversion
2016-01-01 21 3373 0.62%
2016-01-02 50 3820 1.31%
2016-01-03 50 3175 1.57%
2016-01-04 33 4013 0.82%
2016-01-05 58 4022 1.44%
2016-01-06 5 4873 0.25%
2016-01-07 36 1924 1.87%
2016-01-08 44 3867 1.14%
2016-01-09 28 3621 0.77%
2016-01-10 50 1722 2.90%

We have some days with particularly good sales (eg: 2016-01-10 with 50 sales.) But it's difficult to get a feel for the data because the daily sales and visitors are not directly comparable. That's the "apples -vs- oranges" argument we put forth earlier.

Let's compute the z-scores for these two columns using SQL. We make use of Computed Table Expressions (CTEs) introduced in Postgres 9.4 in order to precalculate the means and standard deviations for the two columns:

WITH sales_stats AS
    (SELECT avg(sales) AS mean,
            stddev(sales) AS sd
    FROM zscore),
    visitor_stats AS
    (SELECT avg(visitors) AS mean,
            stddev(visitors) AS sd
    FROM zscore)
SELECT dt,
    abs(sales - sales_stats.mean) / sales_stats.sd AS z_score_sales,
    abs(visitors - visitor_stats.mean) / visitor_stats.sd AS z_score_visitors
FROM sales_stats,
    visitor_stats,
    zscore;
date z_score_sales z_score_visitors
2016-01-01 1.02 0.07
2016-01-02 0.77 0.39
2016-01-03 0.77 0.28
2016-01-04 0.28 0.59
2016-01-05 1.26 0.60
2016-01-06 2.00 1.48
2016-01-07 0.09 1.57
2016-01-08 0.40 0.44
2016-01-09 0.59 0.19
2016-01-10 0.77 1.78

Whoa, sales on 2016-01-06 has definitely take a hit and is worth investigating further!


No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. You'll learn bucketing, finding outliers, calculating correlations, budgeting & forecasting, return on spend and investment, data smoothing and cleaning applied to freemium, transactional, advertising, two-sided marketplace and subscription based business models. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time.