Finding Patterns & Matching Substrings using Regular Expressions
Although there are multiple ways to do pattern matching in SQL, we will look at the most powerful and flexible of them all – regular expressions.
Fortunately, PostgreSQL and Redshift support POSIX regular expressions out of the box. We'll consider two examples that use regular expressions to pattern match.
1. Pattern Matching Business and Consumer Email Addresses
A common analysis task is to count the number of emails in your database that are of commercial value, i.e., not using disposable email addresses or free providers like gmail or hotmail.
We'll make an assumption that the
@ character splits the email address to a name and domain name. If your database table named "emails" contains email addresses like these:
split_part function will split the email address to its respective parts:
SELECT email, split_part(email, '@', 1) as name, split_part(email, '@', 2) as domain FROM email;
which gives us:
We'll next need a database of services that hand out free and disposable email addresses. We found a github project that has an actively maintained list of such services. If the data for these services is in a table named "free_email_domains", a simple join with this table will classify an email as a business user or a consumer.
SELECT emails.email as email, CASE WHEN free_email_domains.domain IS NOT NULL THEN 'free' ELSE 'business' END as email_type FROM emails JOIN free_email_domains ON free_email_domains.domain = split_part(emails.email, '@', 2);
2. Pattern Matching Google Analytics UTM parameters
If you've ever worked with marketers, you would have come across UTM parameters. These are small bits of information attached to every URL to track campaign and channel effectiveness amongst others. A tracked URL will look like something like this:
The parameters are attached as query parameters after the "?" and have standard definitions like source, medium, campaign, etc. You can use a handy tool like the Google Campaign URL Builder to build URLs like these.
Suppose we have a table named
pageviews with the following schema:
and we want a result set that looks like this:
We are going to make use the
regexp_replace functions to first extract a substring that matches the regular expression for campaign, source and medium and then replace the matched text to get what we want. We can do this in one step, but for illustration purposes, we'll do it in two steps. Our regular expression is going to take the form:
To first extract
utm_campaign, we use the query:
SELECT dt, substring(url from '(?!&)utm_campaign=[^&]*(?=&)') FROM pageviews;
which gives us:
then we replace the
utm_campaign= string with an empty string for the final result:
SELECT dt, regexp_replace(substring(url from '(?!&)utm_campaign=[^&]*(?=&)'), 'utm_[^=]*=', '') as utm_campaign FROM pageviews;
which gives us:
We can now repeat this pattern for
utm_source to get the final result.
You are reading a series of Advanced SQL tricks.
Read this next: Concatenating Rows of String Values for Aggregation
Join 300+ data analysts who are using our weekly SQL tips to improve their skills. No spam, promise!