Monday, 26 November 2018

ETL and Redshift Data Loads for DWH

Cloud Data Warehouse Implementation using Redshift

 ETL Design for the Landing, Staging, DW and Reporting layer.

Landing Area S3 - Simple Storage Service from AWS will be used for the landing area load. We can use the different buckets created for each source system to handle the files.
Data ingestion into the S3 layer from source systems has to be dealt with separately. If we are pulling the data from the relational databases, CSV, Excel files, flat files then -open source ETL tools like Pentaho can be used for orchestrating purpose.

Daily incremental data copied directly to tables using the S3 copy command to redshift.
Copy command from S3 to Redshift is another great tool available for the huge volume of data loads.

If there are specific requirements for archiving the files required then we can think of using the different bucket for the same. S3 versioning and access restriction provide greater flexibility for the implementation.

With S3 versioning capability we can overwrite the existing files, however, keep the copy of the file our future reference or audit compliance.



Transformation/Processed Layer-
Data transformation rules applied to the landing tables and processed data is loaded to interim tables.
Generally, the transformation should be performed as a multi-step operation within a transaction with begin and end command.
Create as many temp tables as required for the transformation which will help the operation.

Data Mart/DW-

Data from the processed layer loaded into actual mart tables using the redshift commands.

Below are few different DW table loading methodologies-

Dimensional Data Load 

SCD-1 

/*Step1 - Copy existing data into S3 - unload - temp_existing_records_dim.csv */
/*Step2 - Delete the data from dimension table matching key with incremental table */
/*Step3 - Copy the data from s3 actual dimension table back*/
/*Step4 - Insert all data from incremental table to dimension table*/



SCD -2 

/*Step 1 - Copy latest dimensional records from target dimension table to a temp table temp_dim_copy */
/*Step 2 - Copy the existing record set from the incremental data which got the new value in the incremental compared to the dimension table, load it temp_changed_existing_record with an effective end date for existing records. This is inner join with dimension and staging table */
/*Step 3 - Copy All the incremental file data to temp_new_changed_record table. This will contain new and modified record set with a new effective date. This is left join with the existing incremental table and dimension table */
/*Step 4 - Unload the all the data from temp_changed_existing_record to S3 to mark deletion from dimension table  -temp_changed_existing_record.csv */
/*Step 5 - Delete matching records from dimension and temp_changed_existing_record table based on key */
/*Step 6 - Copy unloaded file back to the dimension from the file  temp_changed_existing_record.csv this will have updated end date for the records which were existing earlier*/
/*Step 7 - Insert both new records and changed records into the dimension table from temp_new_changed_record */
/*Step 8 - drop the temp tables*/


Fact Table Load-

Three different modes of the fact table loads.

Transaction fact - Insert only
This fact table load is simple and its plain insert of records coming from the source table.
Data loaded from landing table to S3 and then copied to the actual table.

Accumulative Fact -Insert and Update
Accumulative fact is kind of fact in which, data is updated for the existing business key and new records inserted and old records copied back from the previous snapshot if it has not changed.

Snapshot Fact table
The incoming records are loaded as the snapshot for a given day and it gives the status of the fact for that day.



Loading strategy for the three different facts-

Transaction Fact -

1. Data from the landing table simply copied to the target table with insert into command

Accumulative Fact

1.Data from the target table loaded into the temp_target table
2.Data from landing table i.e incremental transaction data for the given day is loaded into a temp_increment table
3.Create another temp_table1 joining the temp_target and incremental table get to know the matching data in the current day load.
3.Unload all the records from temp_table1 to S3
4.Delete the matching records from the fact table joining the same with temp_table1
5.Copy the data from S3 to restore old unchanged records
6.Inser the new and changed records from the incremental temp table


Snapshot Fact Table Load

1. This fact is creating the daily snapshot for the complete data set.
2. Load the today's load to the table
3. Copy any records which are missed from today's load with left join the current data to the existing fact table.



Tuesday, 18 September 2018

Redshift for Beginner - Part -2

Designing the tables in Redshift

General table design principle of data modelling has to be followed to define the star schema/snowflake model. Redshift is more suited for the star schema and easier for reporting analytics. 

DBSchema is one of the data modelling tool used for Redshift database design.

https://www.dbschema.com/index.html 

The general design of the Data warehouse is as shown below. 





General thumb rule followed to make the ETL load faster is to have landing Area tables without having any keys defined in the relational world databases. However, in the case of Redshift, we can use S3 itself as a landing area and avoid the data duplication to redshift cluster. The ability of unlimited storage provided by S3 is boon for the data warehouse developer. This method of using S3 not only saves a lot of space on the redshift cluster and it also saves the money not using the expensive redshift for the landing data storage. 

Tools like AWS Spectrum and Athena have the capability to connect to the S3 which is more advantageous. 

Table design on the redshift depends on the following factors. 
  1. The uniqueness of the data 
  2. The Volume of the data 
  3. Frequently queried data by the end user for analytical purposes
Distribution key, Sort Key and Compression Logic are three major designing factors while building the efficient table design on the redshift database. Redshift documentation has an in-depth explanation of the distribution keys, sort keys and compression logic. Tables must follow these design principles to get the best performance for queries.

AWS Redshift Git hub page below provides a lot of utilities which could be used by the developer to fix and design the tables efficiently.

Link for the GitHub-redshift code repository.  This is maintained by the AWS team.
https://github.com/awslabs/amazon-redshift-utils



  • Select Sort keys.


Redshift stores the data on the disk. Sort keys help to store the data in the sorted order. The sorted data will improve I/O performance and query throughput. Below are the different scenarios we consider for choosing the sort keys.
  1. If the recent data queried most, specify the timestamp column as a leading sort key 
  2. If frequently filtered data on one column specify that column as the sort key
  3. If frequently join on the dimension table then specify that column as the sort key    


  • Select Distribution Keys 


Leader node manages the query execution. Leader node depending upon the distribution of data and distribution key sends the data to each slice of the nodes. Below rule needs to be kept in mind while selecting the distribution key.
  1. To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data distribution skew, forces some nodes to do more work than others, which impairs query performance
  2. To minimize data movement during query execution. If the rows that participate in joins or aggregates are already collocated on the nodes with their joining rows in other tables, the optimizer does not need to redistribute as much data during query execution


There are 3 types of distribution methods.

  1. Key - Rows distributed according to the values present in the defined key column. If there are 2 tables involved in the join condition then defining the distribution key on the joining column will make sure data is stored at the same physical location for both tables hence increasing the query performance. 
  2. Even - The data is spread on the round robin method across different nodes present in the cluster
  3. All - The table data shared between each of the nodes and it will help in query execution. 


Suggested methods to choose distribution keys

  1. Define primary keys and foreign keys on the tables. Amazon Redshift doesn't enforce any constraints on the database, however, this will enable leader node to choose optimal query execution plan. 
  2. Distribute fact table and largest dimension to get better performance on the common joining "KEY" with KEY distribution style
  3. Make other small dimensions to use "ALL" distribution style
  4. If any table is not involved in join or it is denormalized then use "EVEN" distribution



  • Select Compression Methods


Compression is a column level feature which allows the storage of data on the disc efficiently stored. The operation involving that data will be efficient. Compressions can be applied manually to the table or automatically using copy command. AWS recommends using the copy command to apply compression techniques. 

Automatic compression using copy command can only be applied for an empty table. We cannot change the compression technique of a column once it is defined however you can define a new compression by adding a new column using alter command.

Analyze compression provides the report on the table which suggests the compression for the tables analyzed. For each column, the report includes an estimate of the potential reduction in the disk space in the storage. Compression command puts the exclusive lock on the table hence all reads and writes will be blocked during analyze compression command.


Encoding type
Keyword in CREATE TABLE and ALTER TABLE
Datatypes
Raw (no compression)
RAW
All
As well for the columns defined as sort keys, boolean and Double Precision
Byte dictionary
BYTEDICT
All except BOOLEAN
This is best for columns having lesser unique values. Like Gender, Country, lookup types 
Delta
DELTA 
DELTA32K
SMALLINT, INT, BIGINT, DATE, TIMESTAMP, DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
Very useful for Datetime columns. Delta uses 16 bit ranges from -127 to +127, Delta32K uses 32 bit ranges from -32k to 32k.
LZO
LZO
All except BOOLEAN, REAL, and DOUBLE PRECISION
This is best for Varchar columns
Mostlyn
MOSTLY8
MOSTLY16
MOSTLY32
SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
BIGINT, DECIMAL
Run-length
RUNLENGTH
All
Text
TEXT255
TEXT32K
VARCHAR only 
VARCHAR only
When column repeats the same words many times
Zstandard
ZSTD
All









Wednesday, 12 September 2018

Data Virtualization

Data Virtualization
Definition
Data Virtualization is an art of managing the data using the virtual data layer between source and end point of the data i.e. reports or dashboards.

Data Virtualization is different from machine virtualization. There is always a confusion between machine virtualization and data virtualization. Machine virtualization involves of resource sharing across many process for example memory, CPU etc.Data virtualization involves data Encapsulation, Abstraction and Data Federation in case of the data management. 
There are many vendors available for data virtualization. Data Virtualization Technic is available in the industry for many years now. Cisco Data Virtualization is one of the leading tool for the Data virtualization. Denodo - is another leading tool which gives the data virtualization capabilities. 
How Data Virtualization Works:-
Data Virtualization tool has the data management layer similar to data warehouse project. However significant difference comes into play in case of the data duplication and storage. In typical world of the Data Warehouse, data is replicated in different stages. Like landing area - source data copied, staging area - where data transformation applied and stored. Finally, data loaded into data mart or data warehouse as per the data model.  These replication of the data takes significant amount of ETL work which in turn needs lot of resources and money to be spent. 

Data Virtualization technology removes this redundant exercise of copying the data to different layers and reduces the ETL work.

Data Virtualization has the Technic of virtual tables / virtual view to be created on top of source data. The virtual table actually pulls the data from real source whenever required for processing thus avoiding the replication process.

The data virtualization built on the model of metadata framework rather actual database objects creation. Whenever data need from the actual objects virtual table pulls the data using the metadata associated with the real tables. The metadata is stored in the repository of the data virtualization server. It works similar to database views. However there are methods available within data virtualization tools which makes the query retrieval faster.

This technology has the ability to combine the data from different sources. We can have the RDBMS, Hadoop based clusters, Website, Web servers, Logs data, and CRM data. Most of the data virtualization tools have built in drivers to connect to these variety of the data sources. If the default drivers are not provided the custom development framework will help to build the drivers required for the connection.

Different flavors of Data Virtualization:-

Method 1 - The data sources are directly connected to the Data virtualization software and BI application access the data from the virtualization server.

Method 2 -  The existing data warehouses/marts are combined  using Data virtualization software and Data Virtualization software will act as collating layer which combines data from 2 different warehouses or marts.

Method 3-  The data sources are directly connected to the Data virtualization software and Data warehouse gets the data from the virtualization layer which acts like single source for all the need of BI reporting and dashboards .



Performance of Queries:-  

Performance issues are considered as major road block for any data warehousing BI solutions. The data virtualization being an additional layer between sources and target will give arise to number of questions about the performance? The performance of the data retrieval enhanced using the cache and other unique techniques of optimization within data virtualization servers. The virtualization tools re-write the queries in simplest forms to fetch the required data in faster mode.

Different methods available for query optimization with the leading data virtualization tools are

1. Query Substitution

2. SQL Push down 

3. Distributed Joins 

4. Ship Joins 

5. SQL Override

6. Cache Refresh 

7. Cache Replication

Summary :-

1. Data Virtualization will not replace and ETL. However, it will help ETL projects execution by reducing the time it takes to complete

2. Data Virtualization will help ETL projects to get ROI effectively

3. Data Virtualization helps data management layer to combine heterogeneous data sources

Data virtualization is boon for the organization. This will accelerate the time to market the change requests of the users. The project development and maintenance cost reduces significantly. The data virtualization provides the platform to combine various data source and help in building a unique data management platform which is scalable, economical, flexible and efficient.

References -
1. Data Virtualization for Business Intelligence Systems by Rick F. van der Lans

Monday, 10 September 2018

Redshift for Beginner -Part -1



Redshift is the relational columnar database provided by Amazon. It is a pretty powerful database with a lot of cool features. Redshift is the extension of PostgreSQL. The documentation for redshift very nice and provides all the details required for a new developer to learn the product with ease.

AWS Free Tier provides 2 months exclusive offer for the Redshift database access.
https://aws.amazon.com/redshift/resources/. This is  750 hrs of free usage.

Difference between Amazon Redshift database and PostgreSQL are summarized here.

https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-PostgreSQL.html


Redshift has the leader node which drives the query execution along with computing nodes. The power of compute nodes makes the redshift distributed system. We can scale the nodes to meet the performance level of query execution.

The application sends the query to the leader node, the leader node parses the queries, generates the execution plan and passes the same to compute nodes parallelly. The compute node upon receiving the execution plan, executes the queries and generates the results. This intermediate results from compute nodes aggregated by the leader node and returned to applications for consumption.
Leader node coordinates the query execution and query compilation logic with each of the compute node. Depending upon the distribution of keys of table design data gets distributed to compute node for the calculation.

Redshift clusters are categorised broadly into dense storage (ds) and dense compute (dc ).
ds2 nodes are designed for large volume workloads of data using hard drive, dc family cluster designed for intensive calculation and query performance, however, provides much lower space on the storage.

Redshift clusters are of 2 generations dc1 or dc2. AWS recently released dc2 family with the same cost as that of dc1 family. The dc2 family has the better performance compared to a dc1 family.

Below link gives the different size of clusters AWS provides for the user to choose.
https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html

Dense Storage Node Types


Node Size
vCPU
ECU
RAM (GiB)
Slices Per Node
Storage Per Node
Node Range
Total Capacity
ds2.xlarge
4
13
31
2
2 TB HDD
1–32
64 TB
ds2.8xlarge
36
119
244
16
16 TB HDD
2–128
2 PB


Dense Compute Node Types

Node Size
vCPU
ECU
RAM (GiB)
Slices Per Node
Storage Per Node
Node Range
Total Capacity
dc1.large
2
7
15
2
160 GB SSD
1–32
5.12 TB
dc1.8xlarge
32
104
244
32
2.56 TB SSD
2–128
326 TB
dc2.large
2
7
15.25
2
160 GB NVMe-SSD
1–32
5.12 TB
dc2.8xlarge
32
99
244
16
2.56 TB NVMe-SSD
2–128
326 TB

Node selection generally depends upon 3 categories
  1. Amount of data being loaded to redshift
  2. The response time expected by applications for requests
  3. The complexity of the  queries executed 


Data ingestion to Redshift cluster can be done using multiple methods.

The efficient method of data ingestion to Redshift database is copying the data using S3.
Data load to redshift using S3 will use MPP architecture and loads will be efficient.

 Table design and performance tuning for the redshift database will be covered as part of next post