Modeling: Denormalized Dimension Tables with Materialized Views for Business Users

Your data warehouse has:

  • dimension tables containing categorization of people, products, place and time – generally modeled as one table per object.
  • fact tables containing measurements, metrics and facts of a business process.

For example, an e-commerce store will have an order and order_line fact tables recording each order and its associated order line items. The product and store dimension tables are related via the foreign keys.

This simplistic model is for illustration purposes only – a more complete model will contains fields for product categories, shipping options, sales persons, etc.

A data analyst can then calculate various metrics such as sum of orders per store, or sum of orders per category, or sum of orders per day/week/month etc. You get the idea.

While a normalized database model like the above makes sense to a data analyst and is absolutely necessary for your transaction / application database to ensure data consistency, it makes adhoc queries significantly harder for business users. Business users balk at joins and find it conceptually hard to understand. There are plenty of visual tools to create joins, but it's debatable whether it solves the problem of knowing what kind of join to use.

Let's investigate an approach to improve the usability of your data warehouse.


1. Approaches for business users to engage with data

Static reports at the end of the month are one thing and it's alright for a data analyst to work with the normalized table format. To truly unlock the value of your data – every line employee and manager should be able to engage with the data without the need to recruit a data analyst.

In our experience, there are three approaches of varying levels of success, each with it's own caveats:

  1. Denormalization – create a view (materialized for performance) that denormalizes the dimension tables and presents a flat structure. Business users are used to exploring data in spreadsheets, which is denormalized. They can then aggregate on measures, pivot on dimensions, etc.

  2. Data modeling – use a layer to build a semantic model of the relationships between the tables. Present a tree-like schema browser that allows the business user to "drill-down" a join path and create a tabular end result, without ever having to know or touch SQL.

  3. One query to rule them all – use a templating language over SQL to implement pluggable WHERE clauses and present a point and click interface to enable a business user to modify the query, without having to know or touch SQL.

2. Denormalized dimension tables

Our recommendation is to "cache" a denormalized version of the relational data model. You can accomplish this using technology that's already built into your database and no external software is needed. An additional advantage is that you could potentially see performance improvements with a specialized columnar database like Amazon Redshift or Google Bigquery.

The biggest win is usability. You've reduced numerous dimension tables into one table. It's not uncommon to see a business process contain 20 or more dimension tables.

To create a table like the above, you can join all of the dimension tables into one giant table. You can then refresh this table periodically depending on your needs. Every night seems to work for most cases. Any changes to the dimension tables, like store name, will be captured on the next refresh.

with products as (
    select * from products_table
), 
stores as (
    select * from stores_table
),
orders as (
    select * from orders_table
),
order_lines as (
    select * from order_lines_table
),

joined as (
    select 
        order_lines.order_id, 
        order_lines.amount, 
        order_lines.units,
        orders.date, 
        products.name, 
        products.other_data,
        stores.name,
        stores.other_data
    from order_lines
    left outer join orders on order_lines.order_id = orders.id, 
    left outer join products on orders.product_id = products.id
    left outer join stores on orders.store_id = stores.id
)

select * from joined

From the single joined table, it's still possible to calculate various metrics such as sum of orders per store, or sum of orders per category, or sum of orders per day/week/month as before. No functionality is lost.

3. Store the denormalized dimension table as a database View

A database view creates a pseudo-table and from the perspective of a select statement, it appears exactly as a regular table. In PostgreSQL, views are created with the create view statement:

create view <view_name> as 
    select <column1>, <column2>, .... 
    from <table_name>;

The view is now available to be queried with a select statement. As this is not a real table, you cannot delete or update it. The underlying query is run every time you query the view.

If you want to store the result of the underlying query – you’d just have to use the materialized keyword:

create materialized view <view_name> as 
    select <column1>, <column2>, .... 
    from <table_name>;

You now control the upgrade schedule of the view and can be refreshed at your convenience:

refresh materialized view <view_name>;

👋 No fuss, just SQL We are open sourcing everything from the experience working with our agency clients. They spend thousands of dollars to get this level of detailed analysis – which you can now get for free. We send one update every week. Join 400+ data analysts who are leveling up with our recipes. 👊

No spam, ever! Unsubscribe any time. See past emails here.

results matching ""

    No results matching ""