Calculating Summary Statistics in SQL

When you first get your hands on a data set, what's it like:

  • quickly get a feel for the data?
  • are there outliers?
  • is the data shaped abnormally?

These are questions you might have about your data. The disadvantage of a spreadsheet-like interface is that it's difficult to understand your data, speically when you're seeing one for the first time.


For this purpose, we have summary statistics. Fortunately, SQL has a robust set of functions to do exactly that.

As an example, we'll use a list of the fastest growing companies in the United States according to Inc. magazine. The data table is formatted as:

Name State Industry Employees Revenue
Fuhu California Consumer Products & Services 227 195640000
Quest Nutrition California Food & Beverage 191 82640563
Reliant Asset Management Virginia Business Products & Services 145 85076502
Superfish California Software 62 35293000
Acacia Communications Massachusetts Telecommunications 92 77652360
Provider Power Maine Energy 50 137977203
Crescendo Bioscience California Health 129 27308000
Plexus Worldwide Arizona Health 130 159897088
Vacasa Oregon Travel & Hospitality 264 26263454
Go Energies North Carolina Energy 11 32851754
Minute Key Colorado Consumer Products & Services 113 15782039
Sainstore Kansas Business Products & Services 135 19243863
The HCI Group Florida Health 153 34583142
Dynamic Dental Partners Group Florida Health 313 19802935
... ... ... ... ...

Inspired by Spreadsheets

Our inspiration comes from spreadsheet applications like Microsoft Excel or Google Sheets that show you a summary of a column in the footer:

Understanding the Data

For the purposes of this example, we are only interested in the distribution of the numerical columns:

  • employees
  • revenue

We will calculate the mean and measure the level of variability (min and max) of the data. These are often a good start.

select 'total',
    sum(employees) as employees,
    sum(revenue) as revenue
from inc5000
union
select 'average',
    avg(employees),
    avg(revenue)
from inc5000
union
select 'min',
    min(employees),
    min(revenue)
from inc5000
union
select 'max',
    max(employees),
    max(revenue)
from inc5000

Will give us a result like this:

?column? employees revenue
Total 65301 15088861318
Avg 96.45642540620383 22320800.76627219
Max 5603 985737000
Min 0 1953000

This is cool, but let's append it to the main table above so that we have everything to look at in one place.

select name, employees, revenue 
from inc5000 
union
select 'total',
    sum(employees) as employees,
    sum(revenue) as revenue
from inc5000
union
select 'average',
    avg(employees),
    avg(revenue)
from inc5000
union
select 'min',
    min(employees),
    min(revenue)
from inc5000
union
select 'max',
    max(employees),
    max(revenue)
from inc5000

Unfortunately, it seems like the summary statistics are lost within the rows of the main table. We should bring up to the top so that it's easy to glance at. This is possible by artifically creating a column with the row index we want and sorting by them.

select *
from
    ( select 5,
            company,
            employees,
            revenue
    from inc5000
    union select 1,
                'total',
                sum(employees) as employees,
                                sum(revenue) as revenue
    from inc5000
    union select 2,
                'avg',
                avg(employees),
                avg(revenue)
    from inc5000
    union select 3,
                'min',
                min(employees),
                min(revenue)
    from inc5000
    union select 4,
                'max',
                max(employees),
                max(revenue)
    from inc5000) stats
order by 1

Final Result

?column? company employees revenue
1 Total 65301 15088861318
2 Avg 96.45642540620383 22320800.76627219
3 Min 0 1953000
4 Max 5603 985737000
5 Transactis 61 5485216
5 Bareburger 268 31150000
5 SmartZip Analytics 96 11396150
5 FastPay 26 5061986
5 Trend Nation 38 12149714
5 InterRail 11 17114363
5 JayBird 95 31216712
5 Softhq 188 16058111
... ... ... ...

More summary statistics

The SQL in this post are fairly simple using built-in functions. We'll cover more advanced summary statistics in other sections:


👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time. See past emails here.

results matching ""

    No results matching ""