TransformData: April 2019

The Redshift being columnar database it is good for the ETL load and transformation. We need to take extra care to work on the presentation layer. The concurrent query execution on the redshift cluster is not that great even if you use a bigger size of clusters. The maximum number of concurrent queries recommended by AWS is 15 on any cluster nodes.

Reporting layer built on top of the redshift database should be more focussed on the performance and query tuning.

The typical solution for performance tuning is increasing the node capacity vertically as performed in the relational databases however most of the cases of distribution environment this is not true.

There are a couple of alternative methods available to tackle this problem. One of the alternative methods is to create a single table with all the metrics required for the reporting and presenting the data from that layer. The reporting layer table is just the final presentation layer table.

An effective and best alternative way to tune the queries in AWS redshift is using WLM queue effectively to allocate resources for query execution.

Before attempting tuning the queries, the developer should make sure the below pointers are satisfied and query performance differ significantly based on the above parameters.

The below document from Amazon gives an overview of the distribution and sort keys and column encoding.
https://docs.aws.amazon.com/redshift/latest/dg/t_Creating_tables.html

1. Distribution Key - Proper distribution key should be chosen for the tables so that the data retrieval is faster
2. Sort Key - Sort keys are chosen appropriately and data sorted in the DB properly
3. Encoding - Encoding ensures the database space is utilized efficiently

Configuring the WLM Queue is 4 steps process

1. Define new user groups
2. Define new queues along with workloads with memory allocation for each queue
3. Add users to the groups
4. Assign the rules to the queues along with the memory utilization

First, define the parameter set for the cluster. Give a name to parameter set and single cluster can have multiple parameter sets.
Refer documentation from Redshift to tune the queries - https://docs.aws.amazon.com/redshift/latest/dg/wlm-queue-hopping.html

Table hygiene is another important step in the optimization of the redshift cluster.

Regular vacuum and analyze of the table will keep the table stats up to date. The table stats being up to date is an important step in getting better query performance.

Query execution can be made faster using alternative methods like Athena and Spectrum.

Use of Athena for faster query execution is another alternative method to achieve performance gain for ad-hoc queries. Athena is serverless, query execution is far faster compare fixed cluster nodes.

Data could be present in S3 layer with Parque format and query execution will be faster with Athena.

https://aws.amazon.com/athena/faqs/

Spectrum - Redshift provides the extension of the query execution along with table and file system.
The historical data could be present in the file and current data in the redshift cluster both will provide better query execution.

https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

TransformData

Friday, 26 April 2019

Tuning the Reporting Layer using Redshift