Redshift is the relational columnar database provided by Amazon. It is a pretty powerful database with a lot of cool features. Redshift is the extension of PostgreSQL. The documentation for redshift very nice and provides all the details required for a new developer to learn the product with ease.
AWS Free Tier provides 2 months exclusive offer for the Redshift database access.
https://aws.amazon.com/redshift/resources/. This is 750 hrs of free usage.
Difference between Amazon Redshift database and PostgreSQL are summarized here.
https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-PostgreSQL.html
Redshift has the leader node which drives the query execution along with computing nodes. The power of compute nodes makes the redshift distributed system. We can scale the nodes to meet the performance level of query execution.
The application sends the query to the leader node, the leader node parses the queries, generates the execution plan and passes the same to compute nodes parallelly. The compute node upon receiving the execution plan, executes the queries and generates the results. This intermediate results from compute nodes aggregated by the leader node and returned to applications for consumption.
Leader node coordinates the query execution and query compilation logic with each of the compute node. Depending upon the distribution of keys of table design data gets distributed to compute node for the calculation.
Redshift clusters are categorised broadly into dense storage (ds) and dense compute (dc ).
ds2 nodes are designed for large volume workloads of data using hard drive, dc family cluster designed for intensive calculation and query performance, however, provides much lower space on the storage.
Redshift clusters are of 2 generations dc1 or dc2. AWS recently released dc2 family with the same cost as that of dc1 family. The dc2 family has the better performance compared to a dc1 family.
Below link gives the different size of clusters AWS provides for the user to choose.
https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html
Dense Storage Node Types
Node Size
|
vCPU
|
ECU
|
RAM (GiB)
|
Slices Per Node
|
Storage Per Node
|
Node Range
|
Total Capacity
|
ds2.xlarge
|
4
|
13
|
31
|
2
|
2 TB HDD
|
1–32
|
64 TB
|
ds2.8xlarge
|
36
|
119
|
244
|
16
|
16 TB HDD
|
2–128
|
2 PB
|
Dense Compute Node Types
Node Size
|
vCPU
|
ECU
|
RAM (GiB)
|
Slices Per Node
|
Storage Per Node
|
Node Range
|
Total Capacity
|
dc1.large
|
2
|
7
|
15
|
2
|
160 GB SSD
|
1–32
|
5.12 TB
|
dc1.8xlarge
|
32
|
104
|
244
|
32
|
2.56 TB SSD
|
2–128
|
326 TB
|
dc2.large
|
2
|
7
|
15.25
|
2
|
160 GB NVMe-SSD
|
1–32
|
5.12 TB
|
dc2.8xlarge
|
32
|
99
|
244
|
16
|
2.56 TB NVMe-SSD
|
2–128
|
326 TB
|
Node selection generally depends upon 3 categories
- Amount of data being loaded to redshift
- The response time expected by applications for requests
- The complexity of the queries executed
Data ingestion to Redshift cluster can be done using multiple methods.
The efficient method of data ingestion to Redshift database is copying the data using S3.
Data load to redshift using S3 will use MPP architecture and loads will be efficient.
Table design and performance tuning for the redshift database will be covered as part of next post
Dear Ram Ji,
ReplyDeleteFirst of all thanks for you that sharing and explore the Redshift database.
This blog will help for beginners like me,
The way you shared the DW system architecture and mentioned the detail about AWS free Redshift database access links and more it's very usable notes.
We expect more blogs like this.
Thanks