Finding Patterns & Matching Substrings using Regular Expressions

Although there are multiple ways to do pattern matching in SQL, we will look at the most powerful and flexible of them all – regular expressions.

Fortunately, PostgreSQL and Redshift support POSIX regular expressions out of the box. We'll consider two examples that use regular expressions to pattern match.


1. Pattern Matching Business and Consumer Email Addresses

A common analysis task is to count the number of emails in your database that are of commercial value, i.e., not using disposable email addresses or free providers like gmail or hotmail.

We'll make an assumption that the @ character splits the email address to a name and domain name. If your database table named "emails" contains email addresses like these:

email
first@gmail.com
second@gmail.com
third@business.com
fourth@provider.com

Then the split_part function will split the email address to its respective parts:

SELECT email, 
       split_part(email, '@', 1) as name,
       split_part(email, '@', 2) as domain
FROM email;

which gives us:

email name domain
first@gmail.com first gmail.com
second@gmail.com second gmail.com
third@business.com third business.com
fourth@provider.com fourth provider.com

We'll next need a database of services that hand out free and disposable email addresses. We found a github project that has an actively maintained list of such services. If the data for these services is in a table named "free_email_domains", a simple join with this table will classify an email as a business user or a consumer.

SELECT emails.email as email, 
       CASE WHEN 
            free_email_domains.domain IS NOT NULL 
            THEN 'free'
            ELSE 'business'
        END as email_type
FROM emails
JOIN free_email_domains
ON free_email_domains.domain = split_part(emails.email, '@', 2);
email email_type
first@gmail.com free
second@gmail.com free
third@business.com business
fourth@provider.com business

2. Pattern Matching Google Analytics UTM parameters

If you've ever worked with marketers, you would have come across UTM parameters. These are small bits of information attached to every URL to track campaign and channel effectiveness amongst others. A tracked URL will look like something like this:

http://www.example.com/?utm_source=facebook&utm_medium=social&utm_campaign=black-friday

The parameters are attached as query parameters after the "?" and have standard definitions like source, medium, campaign, etc. You can use a handy tool like the Google Campaign URL Builder to build URLs like these.

Suppose we have a table named pageviews with the following schema:

dt url
2016-01-01 http://www.example.com/?utm_source=facebook&utm_medium=social&utm_campaign=black-friday
2016-01-02 http://www.example.com/?utm_source=google&utm_medium=cpc&utm_campaign=black-friday

and we want a result set that looks like this:

dt source medium campaign
2016-01-01 facebook social black-friday
2016-01-02 google cpc black-friday

We are going to make use the substring and regexp_replace functions to first extract a substring that matches the regular expression for campaign, source and medium and then replace the matched text to get what we want. We can do this in one step, but for illustration purposes, we'll do it in two steps. Our regular expression is going to take the form:

(?!&)utm_campaign=[^&]*(?=&) 

To first extract utm_campaign, we use the query:

SELECT dt, substring(url from '(?!&)utm_campaign=[^&]*(?=&)')
FROM pageviews;

which gives us:

dt substring
2016-01-01 utm_campaign=facebook
2016-01-02 utm_campaign=google

then we replace the utm_campaign= string with an empty string for the final result:

SELECT dt, regexp_replace(substring(url from '(?!&)utm_campaign=[^&]*(?=&)'), 
                          'utm_[^=]*=', '') as utm_campaign
FROM pageviews;

which gives us:

dt utm_campaign
2016-01-01 facebook
2016-01-02 google

We can now repeat this pattern for utm_medium and utm_source to get the final result.


👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time.