Designing the tables in Redshift
General table design principle of data modelling has to be followed to define the star schema/snowflake model. Redshift is more suited for the star schema and easier for reporting analytics.
DBSchema is one of the data modelling tool used for Redshift database design.
https://www.dbschema.com/index.html
https://www.dbschema.com/index.html
The general design of the Data warehouse is as shown below.
General thumb rule followed to make the ETL load faster is to have landing Area tables without having any keys defined in the relational world databases. However, in the case of Redshift, we can use S3 itself as a landing area and avoid the data duplication to redshift cluster. The ability of unlimited storage provided by S3 is boon for the data warehouse developer. This method of using S3 not only saves a lot of space on the redshift cluster and it also saves the money not using the expensive redshift for the landing data storage.
Tools like AWS Spectrum and Athena have the capability to connect to the S3 which is more advantageous.
Table design on the redshift depends on the following factors.
- The uniqueness of the data
- The Volume of the data
- Frequently queried data by the end user for analytical purposes
Distribution key, Sort Key and Compression Logic are three major designing factors while building the efficient table design on the redshift database. Redshift documentation has an in-depth explanation of the distribution keys, sort keys and compression logic. Tables must follow these design principles to get the best performance for queries.
AWS Redshift Git hub page below provides a lot of utilities which could be used by the developer to fix and design the tables efficiently.
Link for the GitHub-redshift code repository. This is maintained by the AWS team.
https://github.com/awslabs/amazon-redshift-utils
Redshift stores the data on the disk. Sort keys help to store the data in the sorted order. The sorted data will improve I/O performance and query throughput. Below are the different scenarios we consider for choosing the sort keys.
Leader node manages the query execution. Leader node depending upon the distribution of data and distribution key sends the data to each slice of the nodes. Below rule needs to be kept in mind while selecting the distribution key.
There are 3 types of distribution methods.
Suggested methods to choose distribution keys
Compression is a column level feature which allows the storage of data on the disc efficiently stored. The operation involving that data will be efficient. Compressions can be applied manually to the table or automatically using copy command. AWS recommends using the copy command to apply compression techniques.
Automatic compression using copy command can only be applied for an empty table. We cannot change the compression technique of a column once it is defined however you can define a new compression by adding a new column using alter command.
Analyze compression provides the report on the table which suggests the compression for the tables analyzed. For each column, the report includes an estimate of the potential reduction in the disk space in the storage. Compression command puts the exclusive lock on the table hence all reads and writes will be blocked during analyze compression command.
AWS Redshift Git hub page below provides a lot of utilities which could be used by the developer to fix and design the tables efficiently.
Link for the GitHub-redshift code repository. This is maintained by the AWS team.
https://github.com/awslabs/amazon-redshift-utils
- Select Sort keys.
Redshift stores the data on the disk. Sort keys help to store the data in the sorted order. The sorted data will improve I/O performance and query throughput. Below are the different scenarios we consider for choosing the sort keys.
- If the recent data queried most, specify the timestamp column as a leading sort key
- If frequently filtered data on one column specify that column as the sort key
- If frequently join on the dimension table then specify that column as the sort key
- Select Distribution Keys
Leader node manages the query execution. Leader node depending upon the distribution of data and distribution key sends the data to each slice of the nodes. Below rule needs to be kept in mind while selecting the distribution key.
- To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data distribution skew, forces some nodes to do more work than others, which impairs query performance
- To minimize data movement during query execution. If the rows that participate in joins or aggregates are already collocated on the nodes with their joining rows in other tables, the optimizer does not need to redistribute as much data during query execution
There are 3 types of distribution methods.
- Key - Rows distributed according to the values present in the defined key column. If there are 2 tables involved in the join condition then defining the distribution key on the joining column will make sure data is stored at the same physical location for both tables hence increasing the query performance.
- Even - The data is spread on the round robin method across different nodes present in the cluster
- All - The table data shared between each of the nodes and it will help in query execution.
Suggested methods to choose distribution keys
- Define primary keys and foreign keys on the tables. Amazon Redshift doesn't enforce any constraints on the database, however, this will enable leader node to choose optimal query execution plan.
- Distribute fact table and largest dimension to get better performance on the common joining "KEY" with KEY distribution style
- Make other small dimensions to use "ALL" distribution style
- If any table is not involved in join or it is denormalized then use "EVEN" distribution
- Select Compression Methods
Compression is a column level feature which allows the storage of data on the disc efficiently stored. The operation involving that data will be efficient. Compressions can be applied manually to the table or automatically using copy command. AWS recommends using the copy command to apply compression techniques.
Automatic compression using copy command can only be applied for an empty table. We cannot change the compression technique of a column once it is defined however you can define a new compression by adding a new column using alter command.
Analyze compression provides the report on the table which suggests the compression for the tables analyzed. For each column, the report includes an estimate of the potential reduction in the disk space in the storage. Compression command puts the exclusive lock on the table hence all reads and writes will be blocked during analyze compression command.
Encoding type
|
Keyword in CREATE TABLE and ALTER TABLE
|
Datatypes
|
Raw (no compression)
|
RAW
|
All
As well for the columns defined as sort keys, boolean and Double Precision
|
Byte dictionary
|
BYTEDICT
|
All except BOOLEAN
This is best for columns having lesser unique values. Like Gender, Country, lookup types
|
Delta
|
DELTA
DELTA32K
|
SMALLINT, INT, BIGINT, DATE, TIMESTAMP, DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
Very useful for Datetime columns. Delta uses 16 bit ranges from -127 to +127, Delta32K uses 32 bit ranges from -32k to 32k.
|
LZO
|
LZO
|
All except BOOLEAN, REAL, and DOUBLE PRECISION
This is best for Varchar columns
|
Mostlyn
|
MOSTLY8
MOSTLY16
MOSTLY32
|
SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
BIGINT, DECIMAL
|
Run-length
|
RUNLENGTH
|
All
|
Text
|
TEXT255
TEXT32K
|
VARCHAR only
VARCHAR only
When column repeats the same words many times
|
Zstandard
|
ZSTD
|
All
|