Calculating Summary Statistics in SQL

When you first get your hands on a data set, what's it like:

quickly get a feel for the data?
are there outliers?
is the data shaped abnormally?

These are questions you might have about your data. The disadvantage of a spreadsheet-like interface is that it's difficult to understand your data, speically when you're seeing one for the first time.

For this purpose, we have summary statistics. Fortunately, SQL has a robust set of functions to do exactly that.

As an example, we'll use a list of the fastest growing companies in the United States according to Inc. magazine. The data table is formatted as:

Name	State	Industry	Employees	Revenue
Fuhu	California	Consumer Products & Services	227	195640000
Quest Nutrition	California	Food & Beverage	191	82640563
Reliant Asset Management	Virginia	Business Products & Services	145	85076502
Superfish	California	Software	62	35293000
Acacia Communications	Massachusetts	Telecommunications	92	77652360
Provider Power	Maine	Energy	50	137977203
Crescendo Bioscience	California	Health	129	27308000
Plexus Worldwide	Arizona	Health	130	159897088
Vacasa	Oregon	Travel & Hospitality	264	26263454
Go Energies	North Carolina	Energy	11	32851754
Minute Key	Colorado	Consumer Products & Services	113	15782039
Sainstore	Kansas	Business Products & Services	135	19243863
The HCI Group	Florida	Health	153	34583142
Dynamic Dental Partners Group	Florida	Health	313	19802935
...	...	...	...	...

Inspired by Spreadsheets

Our inspiration comes from spreadsheet applications like Microsoft Excel or Google Sheets that show you a summary of a column in the footer:

Understanding the Data

For the purposes of this example, we are only interested in the distribution of the numerical columns:

employees
revenue

We will calculate the mean and measure the level of variability (min and max) of the data. These are often a good start.

select 'total',
    sum(employees) as employees,
    sum(revenue) as revenue
from inc5000
union
select 'average',
    avg(employees),
    avg(revenue)
from inc5000
union
select 'min',
    min(employees),
    min(revenue)
from inc5000
union
select 'max',
    max(employees),
    max(revenue)
from inc5000

Will give us a result like this:

?column?	employees	revenue
Total	65301	15088861318
Avg	96.45642540620383	22320800.76627219
Max	5603	985737000
Min	0	1953000

This is cool, but let's append it to the main table above so that we have everything to look at in one place.

select name, employees, revenue 
from inc5000 
union
select 'total',
    sum(employees) as employees,
    sum(revenue) as revenue
from inc5000
union
select 'average',
    avg(employees),
    avg(revenue)
from inc5000
union
select 'min',
    min(employees),
    min(revenue)
from inc5000
union
select 'max',
    max(employees),
    max(revenue)
from inc5000

Unfortunately, it seems like the summary statistics are lost within the rows of the main table. We should bring up to the top so that it's easy to glance at. This is possible by artifically creating a column with the row index we want and sorting by them.

select *
from
    ( select 5,
            company,
            employees,
            revenue
    from inc5000
    union select 1,
                'total',
                sum(employees) as employees,
                                sum(revenue) as revenue
    from inc5000
    union select 2,
                'avg',
                avg(employees),
                avg(revenue)
    from inc5000
    union select 3,
                'min',
                min(employees),
                min(revenue)
    from inc5000
    union select 4,
                'max',
                max(employees),
                max(revenue)
    from inc5000) stats
order by 1

Final Result

?column?	company	employees	revenue
1	Total	65301	15088861318
2	Avg	96.45642540620383	22320800.76627219
3	Min	0	1953000
4	Max	5603	985737000
5	Transactis	61	5485216
5	Bareburger	268	31150000
5	SmartZip Analytics	96	11396150
5	FastPay	26	5061986
5	Trend Nation	38	12149714
5	InterRail	11	17114363
5	JayBird	95	31216712
5	Softhq	188	16058111
...	...	...	...

More summary statistics

The SQL in this post are fairly simple using built-in functions. We'll cover more advanced summary statistics in other sections:

medians
variances, standard deviations and z-scores
correlations
linear regression line

👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time. See past emails here.

Calculating Summary Statistics in SQL

Inspired by Spreadsheets

Understanding the Data

Final Result

More summary statistics

results matching ""

No results matching ""