Time series are seen everywhere – your web event log, temperature readings, startup venture capital financings and your customer acquisition data. If you squint enough at any dataset, you’ll immediately see a time axis. Here are four tools to work with time series data in PostgresSQL & Redshift…
An example of a time series
A time series data set has a simple structure, for example your row might look something like this:
This is an example of a normalized table (you’ll have to join with the user table to find out the user properties like name, location, etc.)
The table is structured around a user, the event name and the properties related to that event. And ofcourse, being a time series, there’s a timestamp of when the event was generated.
Tool 1: Use
date_trunc to Group & Aggregate by custom periods
date_trunc function can be used to truncate a timestamp to a specified precision. Postgres supports a wide variety of precisions: second, minute, hour, day, week, month, quarter, year, etc.
> SELECT date_trunc('hour', '2016-01-01 17:27:15'::timestamp) date_trunc --------------------- 2016-01-01 17:00:00 (1 row) > SELECT date_trunc('month', '2016-01-01 17:27:15'::timestamp); date_trunc --------------------- 2016-01-01 00:00:00 (1 row)
date_trunc aggregate function along with
GROUP BY comes handy when you only want to look at a metric by a custom interval – for example, you can find the average of a metric for a month, or a quarter:
> SELECT date_trunc('month', timestamp), AVG(value) as metric FROM event_table GROUP BY 1;
Tool 2: Use
generate_series to plug gaps in the time interval
Sometimes, you’ll have many gaps in your time series because of a lack of data. For example, you were to inspect our visitors table, there were days where we had no visitors. If we were to chart this time series, we’ll get a distorted visual:
Ideally, we want to insert a zero for these missing days so that our chart isn’t misleading and is properly spaced out:
generate_series to generate a continuous series – given a start, end and step interval:
> SELECT * FROM generate_series(2,10,2); generate_series ----------------- 2 4 6 8 10 (5 rows)
What’s cool is that postgres natively understands datetime data types and can create a series with these data types:
> SELECT * FROM generate_series('2016-01-01'::date, '2016-01-05'::date, '1 day'); generate_series ------------------------ 2016-01-01 00:00:00-08 2016-01-02 00:00:00-08 2016-01-03 00:00:00-08 2016-01-04 00:00:00-08 2016-01-05 00:00:00-08 (5 rows)
You can now use a
LEFT JOIN with your data table to plug gaps and make sure each data points are regularly spaced out.
generate_series is not supported on Redshift. We’ll cover a workaround in a future article.
Tool 3: Month-over-Month growth rates
To benchmark your business, you’ll want to compute month-over-month or week-over-week growth rates. In the case of Silota, we are not only interested in the number of charts created monthly, but also their growth rates on a month-to-month basis.
> SELECT date_trunc('month', timestamp) as Date, count(*) as Count FROM events WHERE event_name = 'Created Chart' GROUP BY 1 ORDER BY 1
The above query should give us a neat table with the number of charts created every month. To compute the growth rates, we use window functions and the
lag function. First to understand how the lag function works:
> SELECT date_trunc('month', timestamp) as Date, count(*) as Count, lag(count(*), 1) over timestamp FROM events WHERE event_name = 'Created Chart' GROUP BY 1 ORDER BY 1
lag function returns a value evaluated at the row that is definable offset before the current row within the partition. In this particular we have simply picked the value from the previous row (offset of 1). To compute growth rates, it’s just a matter of subtracting the current value from the previous value:
> SELECT date_trunc('month', timestamp) as Date, count(*) as Count, 100 * (count(*) - lag(count(*), 1) over (order by timestamp)) / lag(count(*), 1) over (order by timestamp)) || '%' as growth FROM events WHERE event_name = 'Created Chart' GROUP BY 1 ORDER BY 1
Tool 4: Running & Cumulative Sums
You might have a count of a particular event aggregated by day, for example with a query:
> SELECT date_trunc('day', timestamp) as Date, COUNT(metric) as Count FROM event_table GROUP BY 1 ORDER BY 1;
Instead, you want to get a cumulative sum of the counts every day. In Postgres, window functions are the most efficient way to perform these kinds of operations.
> SELECT date_trunc('day', timestamp) as Date, COUNT(metric) as Count, SUM(COUNT(metric)) OVER (ORDER BY date_trunc('day', timestamp)) as CumSum FROM event_table GROUP BY 1 ORDER BY 1;
In the above query, the
OVER creates the window, the
ORDER BY means it has to sum up the counts in the timestamp order.
If instead you want to reset the running total every month, you’d use the
PARTITION BY clause inside the window function:
> SELECT date_trunc('day', timestamp) as Date, COUNT(metric) as Count, SUM(COUNT(metric)) OVER ( PARTITION BY date_trunc('month', timestamp) ORDER BY date_trunc('day', timestamp)) as CumSum FROM event_table GROUP BY date_trunc('month', timestamp), 1 ORDER BY 1;
Photo Credit: Unsplash