Picking a Database for Analytics: PostgreSQL vs. MySQL
|MySQL: The world's most popular open source database.||PostgreSQL: The world's most advanced open source database.|
Transactional Databases can be different from Analytical Databases
At Silota, we do database analytics using SQL. All of our educational content and expert services are aligned with fulfilling this need. A common question we get when clients get started is what database to use for analytical workloads. Regardless of the transactional database your application uses, be it MySQL, Oracle, SQL Server or the dozens of other choices available – when it comes to analytics, you have the option of going with a different database.
Often teams pick the same database as their application database serving transactional workloads. That's a fair choice as your team already has the required expertise in-house. In case you're using MySQL as your transactional database, our recommendation is to seriously consider using PostgreSQL as your analytical database.
Let's see why.
1. Postgres has Database Schemas (namespacing)
PostgreSQL has the feature of database schemas, an additional level of organization for your tables. For clients we work with, they centralize their data into a single PostgreSQL database. For example, if a client uses Stripe for payment processing, MailChimp for email campaigns, in addition to their own application data, they would centralize their data using database schemas like this:
You can then join across these tables using the fully qualified table name "mailchimp.campaigns", "stripe.charges", etc:
SELECT mailchimp.campaigns.campaign_name, avg(mailchimp.campaigns.open_rate) as open_rate, sum(stripe.charges.charge) as revenue FROM stripe.charges JOIN mailchimp.campaigns ON mailchimp.campaigns.customer_id = stripe.charges.customer_id GROUP BY 1;
The above (illustrative) query tells you the total revenue and average open rates for each MailChimp campaign your marketing team sends out.
The MySQL way of achieving this level of categorization is to prefix each table with the data source, something like this:
No question this is cumbersome to use and reason about. Another approach we've seen is to create a database for each data source. This gives you the same functionality as PosgreSQL's schemas, but appears to be a heavy-handed approach just for organization.
2. PostgreSQL has Window Functions
A lot of the sophistication we apply to data analysis comes from SQL's window functions. Our library makes extensive use of window functions, for example, running totals, month-over-month growth rates, and finding and removing duplicating rows.
SQL Window Functions work differently from regular aggregate functions like
AVG in the sense that they do not collapse the rows to create an aggregate. Instead, they allow you to create a window that can span preceding or following rows. Functions like
ROW_NUMBER and others make this possible.
MySQL does not have window functions. Instead of using SQL's expressiveness to get to the point, you'll be wasting time with self-joins and shadow tables as workarounds.
3. PostgreSQL handles date and time formats better
One of the most important dimensions in data analysis is along a datetime axis. Comparisons along date-time occur in every business domain and PostgreSQL provides a rich set of functions to deal with them. Take for example, the
date_trunc function which extracts a portion of a timestamp to a specific resolution like this:
date_trunc('quarter', dt)to "round up" to a quarter
date_trunc('week', dt)to "round up" to a week
For MySQL, the above functionality will look like this:
(MAKEDATE(YEAR(dt), 1) + INTERVAL QUARTER(dt) QUARTER - INTERVAL 1 QUARTER )to "round up" to a quarter
date(dt - INTERVAL DAYOFWEEK(dt) - 1 DAY)to "round up" to a week
Much more verbose and not consistent in any manner!
4. PostgreSQL has Set Returning Functions
PostgreSQL supports functions that possibly return more than one row. One of the most important functions in this class is the
generate_series(start, stop, interval) can generate rows from
stop with a step size of
interval. For example:
> SELECT * FROM generate_series(2,10,2); generate_series ----------------- 2 4 6 8 10 (5 rows)
PostgreSQL can also generate a series with datetime data types, for example:
> SELECT * FROM generate_series('2016-01-01'::date, '2016-01-05'::date, '1 day'); generate_series ------------------------ 2016-01-01 00:00:00-08 2016-01-02 00:00:00-08 2016-01-03 00:00:00-08 2016-01-04 00:00:00-08 2016-01-05 00:00:00-08 (5 rows)
Sadly, MySQL does not support the
generate_series function and you'll be spending time building tables with data of varying resolution as workarounds.
5. PostgreSQL has Common Table Expressions
A key component of your analytical queries is the future maintainability and readability of your SQL. Unlike transactional queries – which are often point-queries and quite concise – analytical queries will have many layers of nesting.
PostgreSQL supports Common Table Expressions (CTE) that allows you to "refactor" your subqueries for use in a larger query. You can think of them as temporary tables that are created just for the use in the larger query.
We have two examples in our library that make use of CTEs:
Without the CTEs, your queries will require a self-join and be really messy to read and maintain.
While MySQL might be a robust and popular choice for transactional workloads, PostgreSQL is a much better choice for your next analytics project. The time invested in learning to use PostgreSQL is a great investment and will save you a lot of time down the line.