Making Histogram Frequency Distributions in SQL
A histogram is a special type of column statistic that sorts values into buckets – as you might sort coins into buckets. Generating a histogram is a great way to understand the distribution of data. We'll look at multiple ways of generating histograms.
Consider a table
employee_salary with data like this:
|Alice||Finance, Risk & Business Plan||Team Lead||114047|
|Bob||HR, Digital Strategy & IT||Computer Programmer||88329|
|Christine||Law Department||Solicitor III||85965|
|Dan||Community Services||Property Use Inspector||80150|
|Eve||HR, Digital Strategy & IT||Business Analyst||105694|
Let's analyze the distribution of salaries, across the entire table and across each department.
What's the Frequency Distribution?
You might be tempted to use this query:
SELECT salary, count(*) FROM employee_salary GROUP BY 1 ORDER BY 2 DESC
but there are too many unique salaries in the table, and the results won't be meaningful. The answer is to group the data into buckets of salary bands and count them.
width_bucket for histograms with equal bucket widths
PostgreSQL has a function
width_bucket that will return the bucket number for a numeric value based on a range and the number of buckets necessary. For example,
SELECT width_bucket(5.35, 0, 10, 5);
Internally, PostgreSQL keeps a data structure with the frequency counts, something like this:
|value||→ 5.35 ←|
width_bucket function to create five buckets ranging from $70,000 to $140,000, so each bucket has a width of $14,000:
SELECT width_bucket(salary, 70000, 140000, 5) as bucket, count(*) as cnt GROUP BY bucket ORDER BY bucket;
And if you want to format the bucket numbers to correspond to the salary bands,
SELECT 70000 + ((bucket-1) * (140000-70000)/5) || '-' || (70000 + (bucket) * (140000-70000)/5), cnt from ( SELECT width_bucket(salary, 70000, 140000, 5) as bucket, count(*) as cnt FROM employee_salary GROUP BY bucket ORDER BY bucket) x;
case for histograms with hand-picked bucket widths with
At times equiwidth buckets are insufficient for analysis purposes and you'll want to use custom bucket widths. The SQL
case statement comes handy for this purpose:
SELECT CASE WHEN salary BETWEEN 75000 AND 90000 THEN '75000-90000' WHEN salary BETWEEN 90000 AND 120000 THEN '90000-120000' ELSE '120000+' END as salary_band, count(*) FROM employee_salary GROUP BY 1
ntile for histograms with equal height bucket widths
If you want to optimize for bucket widths so that each bucket has the same number of salary counts, you can use the ntile window function to find the bucket widths. In the field of image processing, this is similar to histogram equalization.
SELECT ntile, min(salary), max(salary) FROM ( SELECT salary, ntile(4) OVER (ORDER BY salary) ) x GROUP BY ntile ORDER BY ntile
Using the min and max columns as bucket widths will give you an equiheight histogram.
4. SQL window functions for cumulative histograms
Using techniques described previously to calculate running totals, you can compute cumulative histgrams:
SELECT bucket, sum(cnt) OVER (ORDER BY bucket) FROM ( SELECT width_bucket(salary, 70000, 140000, 20) as bucket, count(*) as cnt FROM employee_salary GROUP BY bucket ORDER BY bucket) x;
Dividing the running totals with the total number of salaries will convert it to a percentage:
WITH total AS (SELECT count(*) AS cnt FROM employee_salary) SELECT bucket, sum(cnt) OVER (ORDER BY bucket) / total.cnt FROM ( SELECT width_bucket(salary, 70000, 140000, 20) as bucket, count(*) as cnt FROM employee_salary GROUP BY bucket ORDER BY bucket) x;
5. Detail versus Summary Histograms
At times you'll have one or more different series that you'll want to segment on, in this case, we want to segment on the department. We follow the same procedure as before, but add an additional
GROUP BY for the department:
SELECT width_bucket(salary, 70000, 140000, 5) AS bucket, department, count(*) as cnt GROUP BY department, bucket ORDER BY bucket, department;
Unfortunately, the above chart appears too noisy – there's too much going on. It's difficult to directly compare the contribution of the different departments. Trying out a proportional chart (where the bars add up to 100%) makes it a little easier on the eyes:
You are reading a series of Advanced SQL tricks.
Read this next: Calculating N-tiles
Join 200+ data analysts who are using our weekly SQL tips to improve their skills. No spam, promise!