Multichannel Marketing Attribution Modeling in SQL

A big challenge a lot of marketing departments face is measuring and improving the effectiveness of their marketing campaigns. It's important to figure out ROI because you can then double-down on campaigns/channels that have positive returns and discard marketing programs that are ineffective.

With one or two channels, it's straightforward to attribute a conversion event (i.e., purchase or signup) to that channel. With three or more channels, it gets complicated fast, and we need an attribution model to fully credit the conversion event.

Outcomes of the attribution model:

  • budgeting – use historical attribution data to inform future spend
  • forecasting – expectations on how future spend will perform
  • branding – measure the impact of branding campaigns
  • channel ROI – measure the performance of a channel, key to computing cost of customer acquisition and lifetime value of a customer
  • affiliate payments – measure the impact of a partner channel on your conversion events

1. What is marketing attribution

Consider three users: Alice, Bob and Eve who've completed a conversion event – make a purchase. The timeline of events is as follows:

User Session 3 Session 2 Session 1 Purchase
Alice Adwords $2,000
Bob Email Direct $5,000
Eve SEO Facebook SEM $1,000

Let's enumerate what's happening here:

  • three users span five different channels: Adwords, Email, SEO, Direct (as in they type in the URL of your website) and Facebook display adverts
  • the session/channel mix can be multiple days apart
  • the conversion values ($2,000, $5,000 and $1,000) are different

The conversion event for Alice can be attributed to Adwords directly, but what about Bob and Eve who've had touch points over multiple channels and over multiple sessions. What's the return on each channel? How do you compute the cost of customer acquisition? Attribution modeling is about answering these types of questions.

2. Common attribution models (as seen in Google Analytics)

We'll use Google Analytics, the 800lb gorilla in the analytics space, to guide our models. Google Analytics lists seven kinds of attribution models:

Model Description Benefit
Last click 100% of the credit goes to the last click. Simple, but only useful for brand awareness campaigns.
First click 100% of the credit goes to the first click. Simple, but ignores the channel that ultimately drove the conversion.
Linear Evenly distribute credit to all clicks. Simple, but does not differentiate between high value events and low value events.
Time decay More credit to the most recent click, credit reduces with time. Recognizes the significance of every channel along the conversion path and gives the most credit to the channel that drove the conversion.
Position based More credit to the first and last, everything else is evenly distributed. Combines the best features of linear and time-decay models.
Last non-direct clicks 100% of the credit to the last non-direct channel and ignores all other channels. Remove the limitation of "direct" traffic.
Last AdWords click 100% of the credit to the last AdWords click. Simple to understand and easy to compute the cost of acquisition.

3. Why "last click" is the default attribution model

The "last click" attribution model is an industry standard and the default for many marketing departments. This model assigns 100% of the credit for a conversion event to the last marketing channel the customer used. For example, if you clicked on a sponsored search/banner advert for a product and then buy, that channel will get 100% of the credit. It ignores all previous clicks/interactions before that conversion.

The assumption is that had the advert not been shown, the probability of purchase would have been 0%. This is a big assumption to make. In a 2012 survey by Econsultancy and Google, only 14% of marketers believed that "last click" was effective. Yet, the same survey revealed that more than half of them continued to rely on it for their day-to-day reporting.

Hmm, what gives?

4. What's the best attribution model

There is no right answer to this question. Pick what makes sense for your business. What we do want to highlight is the profitability of a channel depends on the attribution model you pick. Let's take an example:

Last clickFirst clickLinearTime decay
(not profitable)
(not profitable)
(not profitable)
(not profitable)
(not profitable)
(not profitable)

Assuming a user goes sequentially through the channels Facebook, Banner, Email and SEM with the respective cost per click (CPC) of $0.75, $1.00, $1.50 and $2.00. If we followed the last click model or the time-decay model, SEM is the cheapest source of acquiring customers. In the first-click or linear models, Facebook is the cheapest source.

The cost of customer acquisition along with payback period is one of the fundamentals of valuing a business. Getting this wrong can put your business on the wrong path to profitability!

5. SQL for marketing attribution modeling

5a. Organizing the marketing touch points

Our first step is to organize the marketing touch points so that we can iterate on our attribution models. Given a table pageviews of web traffic logs with the schema:

timestamp user_id url
2017-01-01 1 http.../sql-z-score.html?utm_source=Silota+SQL+Updates&utm_medium=email
2017-01-02 2 http.../sql-z-score.html?utm_source=Adwords&utm_medium=cpc
2017-01-03 3 http.../sql-z-score.html?utm_source=facebook&utm_medium=social&utm_campaign=black-friday
... ... ...

where the url column contains the marketing channel source, medium. campaign and any other relevant information. We can parse this url to a friendly column with a simple case statement:

select timestamp, 
           when url ilike '%utm_medium=email%' then 'email'
           when url ilike '%utm_medium=cpc%' then 'paid'
       end as channel
from pageviews;

where, ilike is the case-insensitive pattern matching function.

The above is only meant for illustrative purposes – for a complete regular expression that can parse out UTM parameters, see our SQL Regexp recipe. We now have a table that looks something like this:

timestamp user_id channel
2017-01-01 1 email
2017-01-02 2 paid
2017-01-03 3 banner
2017-01-04 1 paid
... ... ...

We can perform one more level of aggregation usign array_agg to organize the channels into a PostgreSQL array for each user. The array_agg is a way to aggregate a set of string values in a comma separated format.

select user_id, 
               when url ilike '%utm_medium=email%' then 'email'
               when url ilike '%utm_medium=cpc%' then 'paid'
               else 'unknown'
           end order by timestamp
       )::text as channels
from pageviews;
user_id channels
1 {email, paid}
2 {paid, banner, organic}
3 {banner, organic, paid}
... ...

Let's call this table marketing_channels.

5b. Organizing the revenue and conversion events

The revenue and conversion data can be dumped from your lead database or payment processor. The structure of this table transactions will be something like this:

timestamp user_id amount
2017-01-01 1 200.0
2017-01-10 3 150.0
2017-01-15 4 100.0
... ... ...

We can aggregate the revenue amounts for each user to get total amounts.

-- saved as revenue_events
select user_id, sum(amount) as total
from transactions
group by user_id;

It's now just a matter of joining the revenue data revenue_events and marketing touches marketing_channels with the right attribution model to connect revenue to channel.

5c. First-click marketing attribution model

The first-click attribution model gives 100% to the credit to the first channel the user is seen. We can use PostgreSQL's array subscripts to pick out the first channel.

select mc.user_id, mc.channels[1],
from marketing_channels mc
inner join revenue_events re
on mc.user_id = re.user_id;
user_id channels total
1 email 200.0
3 banner 150.0
... ... ...

A group by channel and aggregating the total column will give us the total revenue attributed to each channel:

select channel, sum(total) from (
    select mc.user_id, mc.channels[1],
    from marketing_channels mc
    inner join revenue_events re
    on mc.user_id = re.user_id
) a
group by channel;

5d. Last-click marketing attribution model

Similar to the first-click model, we give 100% of the credit to the last channel the user is seen. We'll use PostgreSQL's array_length to find the length of the channels array and use that as a subscript.

select mc.user_id, mc.channels[array_length(mc.channels, 1)],
from marketing_channels mc
inner join revenue_events re
on mc.user_id = re.user_id;
user_id channels total
1 paid 200.0
3 paid 150.0
... ... ...

5e. Linear marketing attribution model

This model evenly distributes the credit to all marketing touch points. The approach we'll take here is to calculate the weight of each channel by the number of touch points and then aggregate by the weighted revenue. Taking this step by step:

select user_id, channels, 1.0 / array_length(channels ,1) as weights 
from (
    // ... query for marketing_channels as before

gives us:

user_id channels weights
1 {email, paid} 0.5
2 {paid, banner, organic} 0.3333
... ... ...

We'll now expand the list of channels into individual rows using PostgreSQL's array unnest function:

select user_id, channels, weights, unnest(channels) 
from (
    select user_id, channels, 1.0 / array_length(channels, 1) as weights 
    from (
        // ... query for marketing_channels as before
) b

which gives us:

user_id channels weights unnest
1 {email, paid} 0.5 email
1 {email, paid} 0.5 paid
2 {paid, banner, organic} 0.3333 paid
2 {paid, banner, organic} 0.3333 banner
2 {paid, banner, organic} 0.3333 organic
... ... ... ...

A join with the revenue_events on the user_id, multiplying the revenue by the weights and an aggregate by channel gives us the weighted revenue total by channel.

select unnest_channel, sum(c.weights *
from (
    select user_id, channels, weights, unnest(channels) as unnest_channel
    from (
        select user_id, channels, 1.0 / array_length(channels ,1) as weights 
        from (
            // ... query for marketing_channels as before
    ) b
) c
inner join revenue_events re
on c.user_id = re.user_id
group by unnest_channel;

5f. Time-decay marketing attribution model

As with the linear model, we can vary the weights by a decay function. We'll use as the decay function, so the weights will be , , and so on. We also have to make sure the weights add upto 1.

In general, is a geometric series and the sum has a closed form equation:

We'll do unnest to get individual rows, row_number to enumerate the rows partitioned by user_id, pow(2.0, row_number()) to generate the weights, and a sum with the closed form equation to scale the weights back to 1.

👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. You'll learn bucketing, finding outliers, calculating correlations, budgeting & forecasting, return on spend and investment, data smoothing and cleaning applied to freemium, transactional, advertising, two-sided marketplace and subscription based business models. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time.