Segmenting and Lead scoring your email list in SQL
Your email list is one of the biggest marketing assets your business can have. Email service providers like Mailchimp, AWeber and Campaign Monitor provide a lot of interesting stats out of the box, like open, click, bounce and unsubscribe rates.
While these stats might be interesting in aggregate, we'll want to calculate more fine-grained numbers to really understand the health of our list. The only way to do that is by exporting the raw data from your email service provider and calculating your own stats. Let's drive stick!
This recipe builds upon the previous RFM analysis, customized for email analytics. Being able to segment and score your list members based on their recency and frequency of engagement is a powerful tool in our toolbox. You'll be able to do things like:
- cull dead members, which will increase open rates and click rates
- special targeting to the most active members with offers
- setup triggers when members increase or decrease their lead score
In the recipe, we'll use Mailchimp data, but the SQL should work with any other email provider.
Step 1: ETL your Mailchimp data
Before we can get started, we'll need to have our Mailchimp data in our database. There are many ways to do this – via downloading CSVs from the Mailchimp web interface, 3rd party ETL providers or by writing a script yourself. We took the script approach, which we have open-sourced. Our one-off script downloads all campaign level stats for each list member and dumps it as a CSV.
Here's the script: Export Mailchimp raw data.
The CSV file you get will have a row for each member and for each campaign. The row will have columns for the number of opens and the number of clicks, which in the case of a inactive member will be (0,0). The format looks something like this:
|12abcd||2017-03-20T15:00:00+00:00||🙏 Your best customers in SQL||email_1||2||1|
|12abcd||2017-03-20T15:00:00+00:00||🙏 Your best customers in SQL||email_2||4||0|
|12wxyz||2017-03-27T15:00:00+00:00||📊 Histogram Summaries with SQL||email_1||0||0|
|12wxyz||2017-03-27T15:00:00+00:00||📊 Histogram Summaries with SQL||email_2||1||0|
The CSV file should directly match the schema we'll create in our database:
CREATE TABLE stats ( campaign_id varchar(16), sent_time timestamp, subject varchar(256), email varchar(64), open_count integer, click_count integer );
We can then use PostgreSQL's
\copy command to copy the CSV file into the table we just created:
\copy stats from 'mailchimp.csv' CSV;
We are now ready to segment and score our list members.
Step 2: Calculating days since last open, recency, total open counts, and frequency
We are going to be building intermediate results with Common Table Expressions (CTE), which is a way to organize your queries.
Days since last opened email and open frequency
Our first task is to calculate the number of days lapsed since the last time the user opened an email, and the total number of email opens they have done thus far:
SELECT email, now()::date - max(sent_time)::date AS last_opened_days, sum(open_count) AS open_count FROM stats WHERE open_count > 0 GROUP BY email;
last_opened_days and the
open_count columns are the recency and frequency parameters of interest. Let's wrap this in a CTE:
WITH rf_table AS ( // ...query above... )
With the CTE, you can access the columns as
Bin members into quartiles based on recency and frequency
We don't know before hand the max count for email opens, or the age of the oldest inactive list member. So, we'll cheat and use percentiles to automatically bin the ranges:
SELECT email, last_opened_days, open_count, ntile(4) over (ORDER BY last_opened_days) AS recency, ntile(4) over (ORDER BY open_count DESC) AS frequency FROM rf_table
This will create a matrix and assign each list member to a cell based on their recency and frequency. Let's wrap the above query in another CTE and call it
rf_table_quartiles and we show the full query we've built so far for illustration:
WITH rf_table AS ( SELECT email, now()::date - max(sent_time)::date AS last_opened_days, sum(open_count) AS open_count FROM stats WHERE open_count > 0 GROUP BY email; ), rf_table_quartiles AS ( SELECT email, last_opened_days, open_count, ntile(4) over (ORDER BY last_opened_days) AS recency, ntile(4) over (ORDER BY open_count DESC) AS frequency FROM rf_table ) SELECT *;
Combine the quartiles into a score
There are many ways to combine and weigh the recency/frequency parameters. Tuning lead scores is a project in itself. For our purposes, we weigh the recency parameter more heavily (10x) than the frequency parameter:
SELECT email, recency * 10 + frequency as score FROM rf_table_quartiles ORDER BY score DESC
That's it! The resulting table will rank our list members from the most valuable (lowest score) to the least valuable (highest score).
The final lead scoring query
WITH rf_table AS ( SELECT email, now()::date - max(sent_time)::date AS last_opened_days, sum(open_count) AS open_count FROM stats WHERE open_count > 0 GROUP BY email; ), rf_table_quartiles AS ( SELECT email, last_opened_days, open_count, ntile(4) over (ORDER BY last_opened_days) AS recency, ntile(4) over (ORDER BY open_count DESC) AS frequency FROM rf_table ) SELECT email, recency * 10 + frequency as score FROM rf_table_quartiles ORDER BY score DESC
We have two ways of extending this analysis, either by breadth or depth:
- breadth: adding more data sources as dimensions, like sales transactions, support load, etc.
- depth: our email list will contain a mix of customers and non-customers – considering it as a funnel, you can take the analysis a further step down the funnel (with open rates, purchase rates, etc.) and your lead score can include a converstion rate as a factor.
You are reading a series of Advanced SQL tricks.
Read this next: Calculating N-tiles
Join 300+ data analysts who are using our weekly SQL tips to improve their skills. No spam, promise!