Creating Pareto Charts to visualize the 80/20 principle in SQL

A Pareto chart highlights the most important factors amongst a large number of factors. A social scientist in Italy, Vilfredo Pareto discovered that roughly 20% of the population owned 80% of the wealth. From there, he noticed that these proportions described many other aspects of society. This led him to propose the 80-20 rule. The Pareto chart, named in his honor, shows the large contribution from a small proportion of the population.

You'll notice the 80-20 principle in action across your business:

  • a small proportion of customers generate the most customer support load
  • a small proportion of customers contribute the most to the over all revenue – financial statements usually list this as a risk factor
  • a small proportion of web pages generate the most pageviews
  • a small proportion of adwords keywords generate the most clicks and conversions, etc.

The Pareto chart has two components:

  • a column chart where the height of the bar is the frequency of that particular category
  • a line chart that shows the cumulative percentage contribution to the total

For this example, we'll be aggregating the top pages on our site by pageviews and a cumulative percentage contribution to the total number of page views in that period.

To calculate the total number of pageviews, categorized by page title and sorted descending by pageviews:

SELECT title, 
       count(*) as pageviews
FROM visits
WHERE visits.dt > (now() - '2 weeks'::interval)
-- just calculate for the last two weeks 
GROUP BY title 
ORDER BY 2 DESC

The above query will give us a table like the following:

title pageviews
Calculating Percentage (%) of Total Sum 354
Calculating Month-over-Month Growth Rate 156
Calculating Relationships with Correlation Matrices 156
Forecasting, Identifying Trends with Linear Regression 134
... ...

We'll next have to calculate a cumulative, running total (read more about SQL Running Totals.) For this, we'll be making use of window functions:

SELECT title, 
       pageviews, 
       SUM(pageview) OVER (ORDER BY pageviews DESC) as cumsum
FROM (
    // query as before
) tbl;

The result will now look like this:

title pageviews cumsum
Calculating Percentage (%) of Total Sum 354 354
Calculating Month-over-Month Growth Rate 156 510
Calculating Relationships with Correlation Matrices 156 666
Forecasting, Identifying Trends with Linear Regression 134 800
... ... ...

The last remaining step is to divide the cumulative sum by the total number of pageviews to calculate a percentage. We can precalculate the total number of pageviews with a common table expression:

WITH total_pageviews AS
(  
    SELECT count(pageviews) AS cnt
    FROM visits
    WHERE visits.dt > (now() - '2 weeks'::interval)
)
SELECT title, 
       pageviews, 
       SUM(pageview) OVER (ORDER BY pageviews DESC) * 1.0 / total_pageviews.cnt as cumsum
FROM (
    SELECT title, 
        count(*) as pageviews
    FROM visits
    WHERE visits.dt > (now() - '2 weeks'::interval)
    -- just calculate for the last two weeks 
    GROUP BY title 
    ORDER BY 2 DESC
) tbl;