Monday, 26 November 2018

ETL and Redshift Data Loads for DWH

Cloud Data Warehouse Implementation using Redshift

 ETL Design for the Landing, Staging, DW and Reporting layer.

Landing Area S3 - Simple Storage Service from AWS will be used for the landing area load. We can use the different buckets created for each source system to handle the files.
Data ingestion into the S3 layer from source systems has to be dealt with separately. If we are pulling the data from the relational databases, CSV, Excel files, flat files then -open source ETL tools like Pentaho can be used for orchestrating purpose.

Daily incremental data copied directly to tables using the S3 copy command to redshift.
Copy command from S3 to Redshift is another great tool available for the huge volume of data loads.

If there are specific requirements for archiving the files required then we can think of using the different bucket for the same. S3 versioning and access restriction provide greater flexibility for the implementation.

With S3 versioning capability we can overwrite the existing files, however, keep the copy of the file our future reference or audit compliance.



Transformation/Processed Layer-
Data transformation rules applied to the landing tables and processed data is loaded to interim tables.
Generally, the transformation should be performed as a multi-step operation within a transaction with begin and end command.
Create as many temp tables as required for the transformation which will help the operation.

Data Mart/DW-

Data from the processed layer loaded into actual mart tables using the redshift commands.

Below are few different DW table loading methodologies-

Dimensional Data Load 

SCD-1 

/*Step1 - Copy existing data into S3 - unload - temp_existing_records_dim.csv */
/*Step2 - Delete the data from dimension table matching key with incremental table */
/*Step3 - Copy the data from s3 actual dimension table back*/
/*Step4 - Insert all data from incremental table to dimension table*/



SCD -2 

/*Step 1 - Copy latest dimensional records from target dimension table to a temp table temp_dim_copy */
/*Step 2 - Copy the existing record set from the incremental data which got the new value in the incremental compared to the dimension table, load it temp_changed_existing_record with an effective end date for existing records. This is inner join with dimension and staging table */
/*Step 3 - Copy All the incremental file data to temp_new_changed_record table. This will contain new and modified record set with a new effective date. This is left join with the existing incremental table and dimension table */
/*Step 4 - Unload the all the data from temp_changed_existing_record to S3 to mark deletion from dimension table  -temp_changed_existing_record.csv */
/*Step 5 - Delete matching records from dimension and temp_changed_existing_record table based on key */
/*Step 6 - Copy unloaded file back to the dimension from the file  temp_changed_existing_record.csv this will have updated end date for the records which were existing earlier*/
/*Step 7 - Insert both new records and changed records into the dimension table from temp_new_changed_record */
/*Step 8 - drop the temp tables*/


Fact Table Load-

Three different modes of the fact table loads.

Transaction fact - Insert only
This fact table load is simple and its plain insert of records coming from the source table.
Data loaded from landing table to S3 and then copied to the actual table.

Accumulative Fact -Insert and Update
Accumulative fact is kind of fact in which, data is updated for the existing business key and new records inserted and old records copied back from the previous snapshot if it has not changed.

Snapshot Fact table
The incoming records are loaded as the snapshot for a given day and it gives the status of the fact for that day.



Loading strategy for the three different facts-

Transaction Fact -

1. Data from the landing table simply copied to the target table with insert into command

Accumulative Fact

1.Data from the target table loaded into the temp_target table
2.Data from landing table i.e incremental transaction data for the given day is loaded into a temp_increment table
3.Create another temp_table1 joining the temp_target and incremental table get to know the matching data in the current day load.
3.Unload all the records from temp_table1 to S3
4.Delete the matching records from the fact table joining the same with temp_table1
5.Copy the data from S3 to restore old unchanged records
6.Inser the new and changed records from the incremental temp table


Snapshot Fact Table Load

1. This fact is creating the daily snapshot for the complete data set.
2. Load the today's load to the table
3. Copy any records which are missed from today's load with left join the current data to the existing fact table.



No comments:

Post a Comment