Filling Missing Data and Gaps by Generating a Continuous Series in SQL
Often your business processes will store data in a database in a sparse format, i.e., if no data exists for a given dimension, no row will exist. But for reporting purposes, you will still want to show a zero next to a particular dimension even though no data exists in the database. It is important to have a consistent number of data points for period-to-period comparisons, and other analytical window functions like moving averages and running sums.
Product Sales from the database
We want to find out the total sum of products sold on a day-by-day basis. A simple query like this will give us that:
select dt, sum(sales) as sales from transactions group by dt order by dt;
The problem with this query is that for certain dates we have zero sales. By omitting that data from the result it appears as if we've sold products every single day. That's misleading.
generate_series with a
PostgreSQL has an function called
generate_series that returns a continuous series as multiple rows. For the purposes of this example, we'll generate a continuous series of dates in the month of Jan 2016:
select generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval)
Now that we have a continuous series of rows, we'll use a
LEFT JOIN to generate another table with the gaps plugged:
select series as dt, sum(sales) as sales from generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval) as series left join transactions on transactions.dt = series
As you can see, we now have data for every day of the month, even though the missing data has been filled with a
NULL (Ø). We'll use the
COALESCE function convert these null values to zero.
select series as dt, coalesce(sum(sales), 0) as sales from generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval) as series left join transactions on transactions.dt = series
Our final result can be visualized as:
- You can use the generate series function and left joins to fill gaps along any dimension. The time dimension is the most common because it's the most freqeuntly used dimensions for comparisons.
- Generate series for MySQL
- Generate series for Redshift
No spam, ever! Unsubscribe any time. See past emails here.