Modeling Exponential Growth in SQL

Quite often the output of a business – for example: revenue, number of employees – is not a straight line. A straight line has a constant slope. Instead, the revenue of a growing business is most likely a constant percentage. This is why magazines like Inc 5000 rank businesses by their growth rates rather than absolute revenue numbers.

If you want to value a business at a future time, you'll want to fit a trend line to past revenues to forecast future revenues. This is where the exponential growth model comes into play.


1. The equation behind exponential curves

The exponential function has the form:

where,

is the observation (independent variable which could be time), is the constant for the base of natural logarithms and is the dependent variable – which could be revenue.

The values for and determine the parameters of the exponential function.

2. Sample data to follow along

We are going to be working with a small data set that you can follow along step-by-step (or on paper or Excel if you wish):

Given a table with data for when a new user starts paying for a subscription:

select date_trunc('day', dt), count(*)
from transactions
group by 1;

we see something like this:

Assuming no user cancels their subscription, a cumulative/running sum of our subscribers gets us this:

select date_trunc('month', dt) as dt, 
       sum(count(*)) over (order by dt) as total_users_monthly
from transactions
group by 1;
dt total_users_monthly
Jan 2017 597
Feb 2017 595
Mar 2017 997
Apr 2017 1407
May 2017 2215
Jun 2017 3577

This looks exponential!

3. Linearizing the exponential curve

The slope of the exponential curve is not constant, therefore trying to fit a linear line to the curve would be awkward. Sometimes, the percentage slope of the curve is constant, which means that the actual slope of the curve is rapidly increasing. This is where an exponential fit would make sense.

One way to linearize the curve is to take the log of the data.

select date_trunc('month', dt) as dt, 
       sum(count(*)) over (order by dt) as total_users_monthly
       log(sum(count(*)) over (order by dt)) as log_total_users_monthly
from transactions
group by 1;
dt total_users_monthly log_total_users_monthly
Jan 2017 597 2.775974331
Feb 2017 595 2.774516966
Mar 2017 997 2.998695158
Apr 2017 1407 3.148294097
May 2017 2215 3.345373731
Jun 2017 3577 3.55351894

4. Finding the best fit line using Simple Linear Regression

We will now fit a simple linear regression over this line to figure out the coefficients of the best fit line. We've described the method of linear regression with SQL in great detail elsewhere on our site. On that page, we built the linear regression method from scratch, but here were are going to cheat and make use of PostgreSQL's regr_slope and regr_intercept inbuilt functions to calculate the coefficients of the best fit line.

One additional technicality here is that the regr_slope and regr_intercept functions expect numerical arguments and won't work with our datetime data types on the x-axis. Since the data points are equidistant, we can map these dates to contiguous integers which we'll do using the row_number window function. The row_number window analytical function enumerates each row in the result set with the partition, starting from 1.

select date_trunc('month', dt) as dt, 
       row_number() over (order by date_trunc('month', dt)) as dp, 
       sum(count(*)) over (order by dt) as total_users_monthly
       log(sum(count(*)) over (order by dt)) as log_total_users_monthly
from transactions
group by 1
dt dp total_users_monthly log_total_users_monthly
Jan 2017 1 597 2.775974331
Feb 2017 2 595 2.774516966
Mar 2017 3 997 2.998695158
Apr 2017 4 1407 3.148294097
May 2017 5 2215 3.345373731
Jun 2017 6 3577 3.55351894
select regr_slope(log_total_users_monthly, dp),
       regr_intercept(log_total_users_monthly, dp) 
from (
    select date_trunc('month', dt) as dt, 
           row_number() over (order by date_trunc('month', dt)) as dp, 
           log(sum(count(*)) over (order by dt)) as log_total_users_monthly
    from transactions
    group by 1
) b 
regr_slope regr_intercept
0.164282636534649 2.52440630934746

Our best-fit line has the equation:

5. Forecasting using the best-fit line

Next, we can extrapolate the line to the next few months to predict what our total new users will be in the future.

First, we generate a few more months with the generate_series function (here we generate six more months):

select 6 + generate_series(1, 6, 1);

Then, we use the equation of the best fit line to extrapolate:

with ext as (
    select generate_series(1, 6, 1) + 6 as x, 
           0.164282636534649 as slope, 
           2.5244063093474 as intercept
)
select ext.slope * ext.x + ext.intercept as y, 
       ext.x
from ext

Then, we reverse-map the x-axis back to date time formats:

with ext as (
    select generate_series(1, 6, 1) + 6 as x, 
           0.164282636534649 as slope, 
           2.5244063093474 as intercept
)
select '2017-01-01'::date +  (ext.x - 1) * '1 month'::interval as dt, 
       ext.slope * ext.x + ext.intercept as log_total_users_monthly, 
       ext.x as x           
from ext

which gives us:

dt log_total_users_monthly x
Jul 2017 3.674384765089943 7
Aug 2017 3.838667401624592 8
Sep 2017 4.002950038159241 9
Oct 2017 4.16723267469389 10
Nov 2017 4.331515311228539 11
Dec 2017 4.495797947763188 12

Then, we invert the linearized equation using the pow function to forecast total users for the next few months:

with ext as (
    select generate_series(1, 6, 1) + 6 as x, 
           0.164282636534649 as slope, 
           2.5244063093474 as intercept
)
select '2017-01-01'::date +  (ext.x -1 ) * '1 month'::interval as dt, 
       pow(10, ext.slope * ext.x + ext.intercept) as total_users_monthly
from ext

which gives us:

dt total_users_monthly
Jul 2017 4724.8145289282065
Aug 2017 6897.113956704491
Sep 2017 10068.158366961097
Oct 2017 14697.134705694105
Nov 2017 21454.34752656913
Dec 2017 31318.283257788567

6. The complete query

We can union the real and forecasted values to build a complete chart:

select date_trunc('month', dt) as dt, 
       sum(count(*)) over (order by dt) as total_users_monthly
from transactions
group by 1;

union
with ext as (
    select generate_series(1, 6, 1) + 6 as x, 
           0.164282636534649 as slope, 
           2.5244063093474 as intercept
)
select '2017-01-01'::date +  (ext.x - 1) * '1 month'::interval as dt, 
       pow(10, ext.slope * ext.x + ext.intercept) as total_users_monthly
from ext

To separate out the two curves, we can introduced another pivot column to label the series:

select date_trunc('month', dt) as dt, 
       'actual revenue' as revenue,
       sum(count(*)) over (order by dt) as total_users_monthly
from transactions
group by 1;

union
with ext as (
    select generate_series(1, 6, 1) + 6 as x, 
           0.164282636534649 as slope, 
           2.5244063093474 as intercept
)
select '2017-01-01'::date +  (ext.x - 1) * '1 month'::interval as dt, 
       'projected revenue' as revenue,    
       pow(10, ext.slope * ext.x + ext.intercept) as total_users_monthly
from ext


👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time. See past emails here.

results matching ""

    No results matching ""