Analyze Mailchimp Data by Segmenting and Lead scoring your email list in SQL

Your email list is one of the biggest marketing assets your business can have. Email service providers like Mailchimp, AWeber and Campaign Monitor provide a lot of interesting stats out of the box, like open, click, bounce and unsubscribe rates.

While these stats might be interesting in aggregate, we'll want to calculate more fine-grained numbers to really understand the health of our list. The only way to do that is by exporting the raw data from your email service provider and calculating your own stats. Let's drive stick!

This recipe builds upon the previous RFM analysis, customized for email analytics. Being able to segment and score your list members based on their recency and frequency of engagement is a powerful tool in our toolbox. You'll be able to do things like:

cull dead members, which will increase open rates and click rates
special targeting to the most active members with offers
setup triggers when members increase or decrease their lead score

In the recipe, we'll use Mailchimp data, but the SQL should work with any other email provider.

Step 1: ETL your Mailchimp data

Before we can get started, we'll need to have our Mailchimp data in our database. There are many ways to do this – via downloading CSVs from the Mailchimp web interface, 3rd party ETL providers or by writing a script yourself. We took the script approach, which we have open-sourced. Our one-off script downloads all campaign level stats for each list member and dumps it as a CSV.

Here's the script: Export Mailchimp raw data.

The CSV file you get will have a row for each member and for each campaign. The row will have columns for the number of opens and the number of clicks, which in the case of a inactive member will be (0,0). The format looks something like this:

campaign_id	sent_time	subject	email	open_count	click_count
12abcd	2017-03-20T15:00:00+00:00	🙏 Your best customers in SQL	email_1	2	1
12abcd	2017-03-20T15:00:00+00:00	🙏 Your best customers in SQL	email_2	4	0
12wxyz	2017-03-27T15:00:00+00:00	📊 Histogram Summaries with SQL	email_1	0	0
12wxyz	2017-03-27T15:00:00+00:00	📊 Histogram Summaries with SQL	email_2	1	0
...	...	...	...	...	...

The CSV file should directly match the schema we'll create in our database:

create table stats (
    campaign_id varchar(16), 
    sent_time timestamp, 
    subject varchar(256), 
    email varchar(64), 
    open_count integer, 
    click_count integer
);

We can then use PostgreSQL's \copy command to copy the CSV file into the table we just created:

\copy stats from 'mailchimp.csv' CSV;

We are now ready to segment and score our list members.

Step 2: Calculating days since last open, recency, total open counts, and frequency

We are going to be building intermediate results with Common Table Expressions (CTE), which is a way to organize your queries.

Days since last opened email and open frequency

Our first task is to calculate the number of days lapsed since the last time the user opened an email, and the total number of email opens they have done thus far:

select email, 
       now()::date - max(sent_time)::date as last_opened_days,
       sum(open_count) as open_count
from stats
where open_count > 0
group by email;

The last_opened_days and the open_count columns are the recency and frequency parameters of interest. Let's wrap this in a CTE:

WITH rf_table AS (
    // ...query above...
)

With the CTE, you can access the columns as rf_table.last_opened_days and rf_table.open_count.

Bin members into quartiles based on recency and frequency

We don't know before hand the max count for email opens, or the age of the oldest inactive list member. So, we'll cheat and use percentiles to automatically bin the ranges:

select email,
       last_opened_days,
       open_count,
       ntile(4) over (order by last_opened_days) as recency,
       ntile(4) over (order by open_count desc) as frequency
from rf_table

This will create a matrix and assign each list member to a cell based on their recency and frequency. Let's wrap the above query in another CTE and call it rf_table_quartiles and we show the full query we've built so far for illustration:

with rf_table as (
        select email, 
            now()::date - max(sent_time)::date as last_opened_days,
            sum(open_count) as open_count
        from stats
        where open_count > 0
        group by email;
     ),
     rf_table_quartiles as (
        select email,
            last_opened_days,
            open_count,
            ntile(4) over (order by last_opened_days) as recency,
            ntile(4) over (order by open_count desc) as frequency
        from rf_table    
     )
select *;

Combine the quartiles into a score

There are many ways to combine and weigh the recency/frequency parameters. Tuning lead scores is a project in itself. For our purposes, we weigh the recency parameter more heavily (10x) than the frequency parameter:

select email,
       recency * 10 + frequency as score
from rf_table_quartiles
order by score desc

That's it! The resulting table will rank our list members from the most valuable (lowest score) to the least valuable (highest score).

The final lead scoring query

with rf_table as (
        select email, 
            now()::date - max(sent_time)::date as last_opened_days,
            sum(open_count) as open_count
        from stats
        where open_count > 0
        group by email;
     ),
     rf_table_quartiles as (
        select email,
            last_opened_days,
            open_count,
            ntile(4) over (order by last_opened_days) as recency,
            ntile(4) over (order by open_count desc) as frequency
        from rf_table    
     )
select email,
       recency * 10 + frequency as score
from rf_table_quartiles
order by score desc

Next steps

We have two ways of extending this analysis, either by breadth or depth:

breadth: adding more data sources as dimensions, like sales transactions, support load, etc.
depth: our email list will contain a mix of customers and non-customers – considering it as a funnel, you can take the analysis a further step down the funnel (with open rates, purchase rates, etc.) and your lead score can include a converstion rate as a factor.

👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time. See past emails here.