TransformData: Redshift for Beginner -Part -1

Redshift is the relational columnar database provided by Amazon. It is a pretty powerful database with a lot of cool features. Redshift is the extension of PostgreSQL. The documentation for redshift very nice and provides all the details required for a new developer to learn the product with ease.

AWS Free Tier provides 2 months exclusive offer for the Redshift database access.
https://aws.amazon.com/redshift/resources/. This is 750 hrs of free usage.

Difference between Amazon Redshift database and PostgreSQL are summarized here.

https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-PostgreSQL.html

Redshift has the leader node which drives the query execution along with computing nodes. The power of compute nodes makes the redshift distributed system. We can scale the nodes to meet the performance level of query execution.

The application sends the query to the leader node, the leader node parses the queries, generates the execution plan and passes the same to compute nodes parallelly. The compute node upon receiving the execution plan, executes the queries and generates the results. This intermediate results from compute nodes aggregated by the leader node and returned to applications for consumption.
Leader node coordinates the query execution and query compilation logic with each of the compute node. Depending upon the distribution of keys of table design data gets distributed to compute node for the calculation.

Redshift clusters are categorised broadly into dense storage (ds) and dense compute (dc ).
ds2 nodes are designed for large volume workloads of data using hard drive, dc family cluster designed for intensive calculation and query performance, however, provides much lower space on the storage.

Redshift clusters are of 2 generations dc1 or dc2. AWS recently released dc2 family with the same cost as that of dc1 family. The dc2 family has the better performance compared to a dc1 family.

Below link gives the different size of clusters AWS provides for the user to choose.
https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html

Dense Storage Node Types

Node Size	vCPU	ECU	RAM (GiB)	Slices Per Node	Storage Per Node	Node Range	Total Capacity
ds2.xlarge	4	13	31	2	2 TB HDD	1–32	64 TB
ds2.8xlarge	36	119	244	16	16 TB HDD	2–128	2 PB

Dense Compute Node Types

Node Size	vCPU	ECU	RAM (GiB)	Slices Per Node	Storage Per Node	Node Range	Total Capacity
dc1.large	2	7	15	2	160 GB SSD	1–32	5.12 TB
dc1.8xlarge	32	104	244	32	2.56 TB SSD	2–128	326 TB
dc2.large	2	7	15.25	2	160 GB NVMe-SSD	1–32	5.12 TB
dc2.8xlarge	32	99	244	16	2.56 TB NVMe-SSD	2–128	326 TB

Node selection generally depends upon 3 categories

Amount of data being loaded to redshift
The response time expected by applications for requests
The complexity of the queries executed

Data ingestion to Redshift cluster can be done using multiple methods.

The efficient method of data ingestion to Redshift database is copying the data using S3.
Data load to redshift using S3 will use MPP architecture and loads will be efficient.

Table design and performance tuning for the redshift database will be covered as part of next post

1 comment:

My Journey12 November 2018 at 03:57
Dear Ram Ji,

First of all thanks for you that sharing and explore the Redshift database.
This blog will help for beginners like me,
The way you shared the DW system architecture and mentioned the detail about AWS free Redshift database access links and more it's very usable notes.

We expect more blogs like this.

Thanks

Monday, 10 September 2018

Redshift for Beginner -Part -1

1 comment: