Filling Missing Data and Gaps by Generating a Continuous Series in SQL

Often your business processes will store data in a database in a sparse format, i.e., if no data exists for a given dimension, no row will exist. But for reporting purposes, you will still want to show a zero next to a particular dimension even though no data exists in the database. It is important to have a consistent number of data points for period-to-period comparisons, and other analytical window functions like moving averages and running sums.


Product Sales from the database

We want to find out the total sum of products sold on a day-by-day basis. A simple query like this will give us that:

select dt, sum(sales) as sales from transactions
group by dt 
order by dt;
dt sales
2016-01-05 27
2016-01-06 4
2016-01-07 28
2016-01-08 25
... ...

The problem with this query is that for certain dates we have zero sales. By omitting that data from the result it appears as if we've sold products every single day. That's misleading.

Solution: Use generate_series with a LEFT JOIN

PostgreSQL has an function called generate_series that returns a continuous series as multiple rows. For the purposes of this example, we'll generate a continuous series of dates in the month of Jan 2016:

select generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval)
generate_series
2016-01-01
2016-01-02
2016-01-03
2016-01-04
...

Now that we have a continuous series of rows, we'll use a LEFT JOIN to generate another table with the gaps plugged:

select series as dt,
       sum(sales) as sales
from generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval) as series
left join transactions on transactions.dt = series
dt sales
2016-01-01 Ø
2016-01-02 Ø
2016-01-03 Ø
2016-01-04 Ø
2016-01-05 27
2016-01-06 4
2016-01-07 28
2016-01-08 25
... ...

As you can see, we now have data for every day of the month, even though the missing data has been filled with a NULL (Ø). We'll use the COALESCE function convert these null values to zero.

select series as dt,
       coalesce(sum(sales), 0) as sales
from generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval) as series
left join transactions on transactions.dt = series

Our final result can be visualized as:

Notes

  1. You can use the generate series function and left joins to fill gaps along any dimension. The time dimension is the most common because it's the most freqeuntly used dimensions for comparisons.
  2. Generate series for MySQL
  3. Generate series for Redshift

👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time. See past emails here.

results matching ""

    No results matching ""