Creating Pareto Charts to visualize the 80/20 principle in SQL
A Pareto chart highlights the most important factors amongst a large number of factors. A social scientist in Italy, Vilfredo Pareto discovered that roughly 20% of the population owned 80% of the wealth. From there, he noticed that these proportions described many other aspects of society. This led him to propose the 80-20 rule. The Pareto chart, named in his honor, shows the large contribution from a small proportion of the population.
You'll notice the 80-20 principle in action across your business:
- a small proportion of customers generate the most customer support load
- a small proportion of customers contribute the most to the over all revenue – financial statements usually list this as a risk factor
- a small proportion of web pages generate the most pageviews
- a small proportion of adwords keywords generate the most clicks and conversions, etc.
The Pareto chart has two components:
- a column chart where the height of the bar is the frequency of that particular category
- a line chart that shows the cumulative percentage contribution to the total
For this example, we'll be aggregating the top pages on our site by pageviews and a cumulative percentage contribution to the total number of page views in that period.
To calculate the total number of pageviews, categorized by page title and sorted descending by pageviews:
SELECT title, count(*) as pageviews FROM visits WHERE visits.dt > (now() - '2 weeks'::interval) -- just calculate for the last two weeks GROUP BY title ORDER BY 2 DESC
The above query will give us a table like the following:
|Calculating Percentage (%) of Total Sum||354|
|Calculating Month-over-Month Growth Rate||156|
|Calculating Relationships with Correlation Matrices||156|
|Forecasting, Identifying Trends with Linear Regression||134|
We'll next have to calculate a cumulative, running total (read more about SQL Running Totals.) For this, we'll be making use of window functions:
SELECT title, pageviews, SUM(pageview) OVER (ORDER BY pageviews DESC) as cumsum FROM ( // query as before ) tbl;
The result will now look like this:
|Calculating Percentage (%) of Total Sum||354||354|
|Calculating Month-over-Month Growth Rate||156||510|
|Calculating Relationships with Correlation Matrices||156||666|
|Forecasting, Identifying Trends with Linear Regression||134||800|
The last remaining step is to divide the cumulative sum by the total number of pageviews to calculate a percentage. We can precalculate the total number of pageviews with a common table expression:
WITH total_pageviews AS ( SELECT count(pageviews) AS cnt FROM visits WHERE visits.dt > (now() - '2 weeks'::interval) ) SELECT title, pageviews, SUM(pageview) OVER (ORDER BY pageviews DESC) * 1.0 / total_pageviews.cnt as cumsum FROM ( SELECT title, count(*) as pageviews FROM visits WHERE visits.dt > (now() - '2 weeks'::interval) -- just calculate for the last two weeks GROUP BY title ORDER BY 2 DESC ) tbl;
You are reading a series of Advanced SQL tricks.
Read this next: Calculating Top N items per Group
Join 300+ data analysts who are using our weekly SQL tips to improve their skills. No spam, promise!