TransformData: 2019

The Redshift being columnar database it is good for the ETL load and transformation. We need to take extra care to work on the presentation layer. The concurrent query execution on the redshift cluster is not that great even if you use a bigger size of clusters. The maximum number of concurrent queries recommended by AWS is 15 on any cluster nodes.

Reporting layer built on top of the redshift database should be more focussed on the performance and query tuning.

The typical solution for performance tuning is increasing the node capacity vertically as performed in the relational databases however most of the cases of distribution environment this is not true.

There are a couple of alternative methods available to tackle this problem. One of the alternative methods is to create a single table with all the metrics required for the reporting and presenting the data from that layer. The reporting layer table is just the final presentation layer table.

An effective and best alternative way to tune the queries in AWS redshift is using WLM queue effectively to allocate resources for query execution.

Before attempting tuning the queries, the developer should make sure the below pointers are satisfied and query performance differ significantly based on the above parameters.

The below document from Amazon gives an overview of the distribution and sort keys and column encoding.
https://docs.aws.amazon.com/redshift/latest/dg/t_Creating_tables.html

1. Distribution Key - Proper distribution key should be chosen for the tables so that the data retrieval is faster
2. Sort Key - Sort keys are chosen appropriately and data sorted in the DB properly
3. Encoding - Encoding ensures the database space is utilized efficiently

Configuring the WLM Queue is 4 steps process

1. Define new user groups
2. Define new queues along with workloads with memory allocation for each queue
3. Add users to the groups
4. Assign the rules to the queues along with the memory utilization

First, define the parameter set for the cluster. Give a name to parameter set and single cluster can have multiple parameter sets.
Refer documentation from Redshift to tune the queries - https://docs.aws.amazon.com/redshift/latest/dg/wlm-queue-hopping.html

Table hygiene is another important step in the optimization of the redshift cluster.

Regular vacuum and analyze of the table will keep the table stats up to date. The table stats being up to date is an important step in getting better query performance.

Query execution can be made faster using alternative methods like Athena and Spectrum.

Use of Athena for faster query execution is another alternative method to achieve performance gain for ad-hoc queries. Athena is serverless, query execution is far faster compare fixed cluster nodes.

Data could be present in S3 layer with Parque format and query execution will be faster with Athena.

https://aws.amazon.com/athena/faqs/

Spectrum - Redshift provides the extension of the query execution along with table and file system.
The historical data could be present in the file and current data in the redshift cluster both will provide better query execution.

https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

Data Modelling -

Data modelling is more of business-driven approach rather considered as technical work.

Data modeller not only should have sound knowledge about the tools, technology but he should also understand the business process pretty deeply.

Bus matrix with the business process helps to define the entities and their relationships appropriately.

Bus Matrix Diagram

We hear from many people that data modelling is obsolete and it is not required in the world of Big Data processing. However, Data modelling is still relevant in case of the big data world. The huge volume of data processing has become easier due to the advent of new technologies and tools. However, this doesn't undervalue the importance of data modelling. The structure of database design has far more impact on the data storage and longevity of the data access.

Data modelling is part of all the different phases of the Data management system. Data management systems generally consist of 4 different stages of data process.

Ingest
Storage
Transform
Presentation

Ideal solution for any project would be to have an explicit separation between these layers with loose couple architecture. Good design provides a robust methodology to handle these different process of data management.

Data modelling should be kept in mind while designing all these different layers. AWS has many different services which will help in each of these phases of data management cycle. S3 is the best fit for the data ingestion layer. S3 is petabyte-scale storage.

AWS Redshift will provide the much-required flexibility for transformation and processing engine.
AWS Quick sight service could be used for the presentation layer.

Report driven approach for data modelling gives an effective way to justify the data needs from the OLTP/ Source systems. It will help to narrow down the use cases to be defined out of data lake. The table design and reporting layer design becomes very easy with the bottom-up approach. That is the preferred design method proposed by Kimball.

Data modelling involves the following steps

Source System Analysis
Defining the process flow
Defining the ER model - Logical Data model/Dimensional Model
Review the ER model/Dimensional Model with SME to understand if there are any gaps in the understanding
Defining the Physical Data model and optimizing the model for the presentation layer

Source System Analysis

Functional Knowledge of Domain
System technical knowledge
Access to the source system
Data extraction methodologies - Optimal method of data extraction methodologies
Source system documentation - Knowledge transfer along with reverse KT
Knowing the business
Knowing the business process flow
Understand key terminology and Master entities from heterogeneous source systems
Determine the Golden source for the master entities
Sessions with SME to go over the process and data importance
Understand the end user need with respect to data
How end user is using the data
Understand existing data irregularities
Keep moving the available data to next steps and be agile, flexible to adapt the changes as we discover more understanding about systems, data

2. Defining the Process flow: Process could span across multiple business units. Process flow could be sales order creation, Invoice generation, Item fulfilment, payment etc. Each of these processes may involve data from one or more business units data. E.g. Sales Order creation will start from the lead generation to opportunity closed/won which gets recorded in the CRM application and actual sales order gets created in the ERP system. Thus when data modelling is considered for DWH, it should be focussed on the process rather than a system. Invariantly the data from multiple systems are combined to generate any required reports from DWH.

3. a.
Defining the ER model for the process - List out all the entities involved in the process and draw the relationship between those entities. The relationship should clearly mention the cardinality between the entities and whether a relationship is strong or weak.

3. b
Defining the Dimensional model for the process - The bus matrix will provide data modeller with a picture of what different dimension and fact tables to be considered for dimension modelling.
The context of analysis will become the dimension and transactional data will be facts. Once the model is ready then it requires review with the customer to finalize the model.

4. Review the ER model with the Business SME: Next step post the ER model / Dimension model is to review the model with Business SME to understand if there are any gaps in the model. The model should explain all the relationship unambiguously to the users. The model should answer all the questions which business is trying to answer

5. Defining the physical model and optimizing the data model for the presentation layer: the physical model may not be an exact replica of the ER-logical model. Few entities might be merged to create a common table to help the optimization and consolidation. This process should ensure the data integrity maintained and preserved as we move forward. There may few aggregated tables created for the presentation layer for the better reporting ability.

Documentation of the data model is as important as building it. It provides the understanding of the data model for the support team.
Versioning of data model along with documentation lets team understands what changes were done as part of each release.

Overall data modelling is still relevant in the big data world and there is a lot of scope for Good Data Modellers.

References - Data warehouse bus matrix image is from Ralph Kimball website. Many contents are summary of my learnings with different books and website.

TransformData

Friday, 26 April 2019

Tuning the Reporting Layer using Redshift

Tuesday, 22 January 2019

Data Modelling in Big Data World