Common mistakes we see when companies setup their first Redshift cluster
4 Common mistakes we see Companies make when they build their first Redshift Cluster
Estimated reading time: 4 minutes.

We as an industry are in the early stages of figuring out how best to run an elastic data warehouse like Amazon Redshift. Some patterns are beginning to emerge, and best practices are being established. Here’s a catalogue of mistakes we often see that we’re excited to share…

1. Do not reinvent COPY

We often see teams investing engineering resources in building a system to load data into Redshift. Various techniques are used, most commonly using the INSERT command. This is futile.

Redshift provides a COPY command which is the most efficient way to load data into Redshift. The recommended approach is to stage your data as CSV files (preferrably gzipped) on Amazon S3 and use the COPY command to directly load data into Redshift.

COPY "schema"."table"
  FROM 's3://<bucket_name>/<object_name>.csv.gz'
  CREDENTIALS
    'aws_access_key_id=<aws_access_key>;
     aws_secret_access_key=<aws_secret_access_key>'
  REGION 'us-west-1'
  DELIMITER ','
  GZIP;

2. Indices are not what you’d expect on Redshift

Redshift uses a columnar storage, i.e., the values for a particular column are stored sequentially. This storage mechanism drastically reduces the amount of data required to be read from disk and stored in memory for processing. It also supports better compression ratios while stored on disk because similar data is stored sequentially together on disk.

As a result of these unique architectural choices, Redshift has no indices. Each query is a full table scan (though run in parallel across the shards.) It’s your responsibility to optimize your data for greatest throughput for which you have two tools:

  • sort key: your data is stored sorted on disk according to the sort key
  • dist key: your data is sharded across the Redshift nodes according the dist key

Most performance problems on Redshift boil down to incorrectly setting sort and dist keys.

3. Carefully design your Schemas

It’s tempting to store your data as a JSON field and use Redshift’s JSON functions to query it. Your data table will just have two columns:

id property
1 {“country”: “us”, “city”: “san francisco”}
2 {“country”: “us”, “city”: “boston”}
3 {“country”: “ca”, “city”: “toronto”}

The property column is a JSON field and you can use Redshift’s JSON functions to use any key in the JSON field:

SELECT json_extract_path_text(property, 'country') AS country
  FROM table_name;

What’s not to like? You don’t have to maintain the schema, it’s flexible to add new keys making the schema future proof.

The problem is that the performance of querying JSON fields is slow as these fields do not take advantage of Redshift’s unique architecture:

  1. These keys and values of the JSON field are stored sequentially not making use of Redshift’s columnar store architecture. Each query will have to load the entire JSON document, causing unneeded IO load.
  2. Defeats the query planner because Redshift cannot make use of statistical metadata on the keys of the field. With a first class column, Redshift knows how many NULL values and other distribution properties of a column. These column statistics are critical in the performance of columns participating in sorting and grouping operations, joins and query predicated.
  3. The table will be bigger in size because compression will be less effective. Since Redshift is a column-store, compression is available on a per-column basis. This allows tricks like storing ‘deltas’ between sequential values of a column, which wouldn’t be possible with a row-based storage.

Unless absolutely necessary, it’s worth the effort of converting the JSON field to a set of first class columns with the right data type.

4. Learn to use SQL

You found a way to load your data into Redshift and keep that data in sync with your primary data source. That’s great news!

Now you want to run some sophisticated analysis on your data on Redshift – like predicting your churn rate, or calculate your customer engagement / product usage KPIs.

Some data analysts are more familiar with R and Python – two popular data analysis frameworks for this kind of work – rather than SQL.

Sometimes we find customers executing SQL statements like a SELECT * and doing their joins and filtering in the Python or R application.

When they find that their code is running slow, they try to re-implement various optimization strategies like nested loop joins and hash joins.

A better and more scalable approach is to learn SQL to extract just the right amount of information, and in the right shape necessary for your analysis. This includes doing your joins, segmentation and filters in the SQL. Redshift excels at these kinds of tasks, so why not let it do its job?

Your turn

We’ve written about four common mistakes one can make as they begin to use a data warehouse product like Redshift. Anything else you can think of, please feel free to share in the comments below!

Photo Credit: Unsplash