Wednesday, 2 October 2024

Become Big Data Engineer from Traditional DW/BI Developer -Part -1




General anxiety for the BI/DW developer is how to cope up with changing technologies and learn something new to move on.

However, if you understand the evolution of technology its always learning new things and adapting to the change. Someone who wants to embrace the change then he has to put the required hours of dedication to learn the technology.

The learning new things always uncomfortable until you find the chord which brings music nice to hear. We have to put the required efforts to get the music going on or find the right instrument which suited for our skill.

I have listed down a few of the general things DW/BI developer should embrace to become a big data engineer.

Traditional BI/DW developer keeps asking a question - How can I survive the big data wave and still be relevant in the industry. There are n number of technologies and buzz words makes any developer uncomfortable. The issue here is many customers or peers in the industry have borrowed the words from some webinars, blogs, interaction with coworkers.


Word Big Data is very scary in the mind of traditional data warehouse developers. I was one among them a few years ago. Not any more, big data, small data, stream data any kind of data its end of the day data to be managed, processed, analysed and presented for consumption. I would suggest that slowly our industry will normalize and start looking for the talents which talk only about the data engineers irrespective of the amount of data he has to manage. The tools and technologies used to manage the different volume of the data are different which has created the separate league of the data management folks called Big Data engineers.

Here are a few tips I pen down, these ideas will help traditional DW developer to become a big data developer or become relevant in the world of big data. These are few open suggestions, not hardcore list to be followed and I am kind still working on the transition path.

1.Books
2.Blog
3.Webinars
4.Programming
5.Project exposure
6.Cloud Solution

Learn some programming language such as Python, Java or Scala. This is mandatory for anyone willing to jump into the big data world as a developer.  It might sound little difficult to learn programming initially however it will help a lot to move forward in that direction. Most of the technologies use the programming for writing the map-reduce function or data processing function with few built-in libraries from the base language. It could be programs written using python or Scala or Java


Books -
Read at least 3-4 books on the Big data subject to understand varies jargon of the industry. There are terms which might be new for the people who are moving from DW to Big Data world. Get familiarized on those terminologies.

Start with simple books. Don't try to buy all the books, probably you can try to borrow the books from the library for reading and explore the options initially.

Blog - 
Regularly read some blogs related to the big data world and try to contribute to the blogging community by putting your efforts. Understand the pattern and trends used to solve the big data problem. Try to document the learnings into words or teach someone what you have learnt. This helps in 2 ways, one consolidates the learning which you have achieved and second inspires your coworker to start working on learning the subject along with you.


Webinars -
Attend at least 2 webinars in a month to get to understand what is the trend in the industry and what new technologies being released. Webinars provide greater detail of understanding which you get instantly instead you might need to spend 3-4 hrs to know a few simple new things.

Ask questions in the webinar to solidify your understanding, read some relevant posts, technical material related to the webinar, follow the webinar series to get the updates.


Programming:- 
Let me explain, how I am learning to programing and still learning it.

I started learning python, here are a few things which I started to work on it.

1. Pick up some online course of Python there are many courses available online for free in Coursera, edx, udemy etc.

2.Complete the course with all exercises and project work

3. Buy any book on python, I am reading  "Learn Python the  Hard Way"  by Shaw. This book is very good and teaches programming for the layman. It explains the concept from basics and builds the knowledge of the subject slowly.

4. Once you complete the book "Learn Python the Hard Way" then go and buy the "Learn more python the Hard Way" by Shaw next series in the same publisher

5. Subscribe to Stackoverflow channel and check for the problems which you can solve.

6. Work on the open-source projects so that you can help community developers

Repeat the steps 3 to 6 as long as you are master of the subject. This is a time-consuming activity but only way to master the subject.


Project Exposure -
Start working on the project which applies the techniques learnt during the process. Don't try to be over-engineer the project with all your learning, apply the simple techniques and explore the way forward. Make sure you are part of all phases of the project and implement solutions. Put your hands and mind into places where you feel uncomfortable. Things which will put you on the right path are things which make you uncomfortable.


Cloud Solution -
Pick any cloud solution and explore the solution implementation using the stack. It might take some time to absorb the concepts of the cloud with respect to on-premise licensed solution implementation.

AWS, Microsoft, Google cloud providers are leading solution providers who are engaged in the great transformation on the cloud solution implementation.


Conclusion - learning or mastering big data engineering is not rocket science. Anybody willing to put the required effort and learn things will master the techniques in no time.

Go out play it fully and become a Big Data Engineer. All the Best


Wednesday, 18 March 2020

Data Lake on AWS - Version -1

A data lake is the buzz word in the data management world. Everybody talks about the implementation of data lake and methodology to implement the best practices to build the data lake.

I would like to put forward my thought to implement the data lake.

The data lake is the large and common storage for all the data from enterprise-wide. This will get the data from a wide variety of sources.

Data from multiple source systems can be extracted and ingested into the data lake. The data lake could be object storage like S3 or Azure Blob Storage or ADLS or Google Cloud Storage. The source systems like CRM, ERP, flat files, OLTP systems, any other business applications get fed into the data lake layer.

Each layer of the data lake is called as zones. Typically data lake will contain Raw Zone, Processed Zone, Reporting Zone.

Raw zone - data is loaded into the layer as is and this layer gives the audit of the data extracted and loaded into the data lake. Whenever, if any audit needs to be done on the source systems data load, the raw zone will provide a source of truth for the data.

Processed zone - The data transformation i.e. applying the business rules, data clean up, data formatting etc. done in the processed layer. This processed layer will be storage for the transformed business data.

Reporting zone - The reporting zone used for the end-user consumptions and final visualization layers are built on top of this layer. The consolidated data layer for the reporting also built using the reporting layer.

The data scientist and other folks will be extracting/using the data from any required zones for their further analysis.

Data in the raw format is of no use, it will be a heap and need to transform and store the data effectively for usage.

Data processing on the file format is best using spark, py-spark, or map-reduce code or hive any query engine can be used for the processing.


Data Ingestion

Data ingestion process is loading the data from the external world to S3 which is the data lake.
Different steps for data ingestion are using a traditional ETL tool, using Python scripts, AWS Kinesis or AWS Kinesis Firehouse or Data Bricks on AWS cloud or spark code. In the case of AWS environment, data ingestion made pretty easier by AWS with the services like Kinesis however there needs a lot more work to be done to make sure the ingestion is effective.

Data Storage  
Post the data ingestion to data lake very important step is to organize the data suitably in the S3. Data storage is a very important step, we should make sure files are sorted and stored based on the time series or logical separation. Data access from S3 will be easier if the files are organized properly


Data Grouping
Data grouping with respect to reporting is very important. Data is logically grouped and used for the reporting.

Data Exploration
Once the data is in data lake it will be used by a data scientist to explore and analyze. Data exploration is done using various tools and programming language.


Data Quality
Data quality is a very important step in the process. Quality of data management and data issues needs to be addressed at the source to make sure data is clean. Many times data quality issues getting tried to fix it outside the source system is challenging

Data Governance 
Data Governance in case of the data lake is a pretty challenging task. we need to take care of the metadata tables. Audit configuration, identification of invalid records for error processing. Need to set up a data governance team for the data validation and as well to help in building the process across the company. Process of monitoring the load regularly need to be set and schedule should be monitored appropriately.

Archiving the data
We might need to archive the historical data. The older data which is not required for processing should be moved to low-cost storage.

Tools and technology used for each of the process data ingestion methodologies.

The business use case for the data lake implementation. The usage of the data lake will be based on the business use case. There should a few use cases defined. A use case could be related to one process of a particular business unit.

e.g. we are replacing data storage mechanism for an enterprise data warehouse as data lake then the specific problem can be solved, The data insights which requires data to be massaged and stored in a particular format. This helps in the final reporting layer and dw layer design.

If we don't have a specific use case to start with data lake formation. It will be a never-ending process.


Friday, 26 April 2019

Tuning the Reporting Layer using Redshift

The Redshift being columnar database it is good for the ETL load and transformation. We need to take extra care to work on the presentation layer. The concurrent query execution on the redshift cluster is not that great even if you use a bigger size of clusters. The maximum number of concurrent queries recommended by AWS is 15 on any cluster nodes.

Reporting layer built on top of the redshift database should be more focussed on the performance and query tuning.

The typical solution for performance tuning is increasing the node capacity vertically as performed in the relational databases however most of the cases of distribution environment this is not true. 

There are a couple of alternative methods available to tackle this problem. One of the alternative methods is to create a single table with all the metrics required for the reporting and presenting the data from that layer. The reporting layer table is just the final presentation layer table.


An effective and best alternative way to tune the queries in AWS redshift is using WLM queue effectively to allocate resources for query execution.


Before attempting tuning the queries, the developer should make sure the below pointers are satisfied and query performance differ significantly based on the above parameters.

The below document from Amazon gives an overview of the distribution and sort keys and column encoding.
https://docs.aws.amazon.com/redshift/latest/dg/t_Creating_tables.html


1. Distribution Key - Proper distribution key should be chosen for the tables so that the data retrieval is faster
2. Sort Key - Sort keys are chosen appropriately and data sorted in the DB properly
3. Encoding - Encoding ensures the database space is utilized efficiently


Configuring the WLM Queue is 4 steps process

1. Define new user groups
2. Define new queues along with workloads with memory allocation for each queue
3. Add users to the groups
4. Assign the rules to the queues along with the memory utilization

First, define the parameter set for the cluster. Give a name to parameter set and single cluster can have multiple parameter sets.
Refer documentation from Redshift to tune the queries - https://docs.aws.amazon.com/redshift/latest/dg/wlm-queue-hopping.html

Table hygiene is another important step in the optimization of the redshift cluster.

Regular vacuum and analyze of the table will keep the table stats up to date. The table stats being up to date is an important step in getting better query performance.

 Query execution can be made faster using alternative methods like Athena and Spectrum.

Use of Athena for faster query execution is another alternative method to achieve performance gain for ad-hoc queries. Athena is serverless, query execution is far faster compare fixed cluster nodes.

Data could be present in S3 layer with Parque format and query execution will be faster with Athena.

https://aws.amazon.com/athena/faqs/

Spectrum - Redshift provides the extension of the query execution along with table and file system.
The historical data could be present in the file and current data in the redshift cluster both will provide better query execution.

https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html






Tuesday, 22 January 2019

Data Modelling in Big Data World


Data Modelling -

Data modelling is more of business-driven approach rather considered as technical work. 
Data modeller not only should have sound knowledge about the tools, technology but he should also understand the business process pretty deeply. 
Bus matrix with the business process helps to define the entities and their relationships appropriately.

Bus Matrix Diagram

We hear from many people that data modelling is obsolete and it is not required in the world of Big Data processing. However, Data modelling is still relevant in case of the big data world. The huge volume of data processing has become easier due to the advent of new technologies and tools. However, this doesn't undervalue the importance of data modelling. The structure of database design has far more impact on the data storage and longevity of the data access.

Data modelling is part of all the different phases of the Data management system. Data management systems generally consist of 4 different stages of data process.
  1. Ingest 
  2. Storage 
  3. Transform 
  4. Presentation
Ideal solution for any project would be to have an explicit separation between these layers with loose couple architecture.  Good design provides a robust methodology to handle these different process of data management. 

Data modelling should be kept in mind while designing all these different layers. AWS has many different services which will help in each of these phases of data management cycle. S3 is the best fit for the data ingestion layer. S3 is petabyte-scale storage.

AWS Redshift will provide the much-required flexibility for transformation and processing engine.
AWS Quick sight service could be used for the presentation layer.

Report driven approach for data modelling gives an effective way to justify the data needs from the OLTP/ Source systems. It will help to narrow down the use cases to be defined out of data lake. The table design and reporting layer design becomes very easy with the bottom-up approach. That is the preferred design method proposed by Kimball.

 Data modelling involves the following steps
  1. Source System Analysis
  2. Defining the process flow
  3. Defining the ER model - Logical Data model/Dimensional Model 
  4. Review the ER model/Dimensional Model with SME to understand if there are any gaps in the understanding
  5. Defining the Physical Data model and optimizing the model for the presentation layer
  1. Source System Analysis
  • Functional Knowledge of Domain
  • System technical knowledge
  • Access to the source system
  • Data extraction methodologies - Optimal method of data extraction methodologies
  • Source system documentation - Knowledge transfer along with reverse KT
  • Knowing the business
  • Knowing the business process flow
  • Understand key terminology and Master entities from heterogeneous source systems
  • Determine the Golden source for the master entities
  • Sessions with SME to go over the process and data importance
  • Understand the end user need with respect to data 
  • How end user is using the data
  • Understand existing data irregularities
  • Keep moving the available data to next steps and be agile, flexible to adapt the changes as we discover more understanding about systems, data

      2. Defining the Process flow: Process could span across multiple business units. Process flow could be sales order creation, Invoice generation, Item fulfilment, payment etc. Each of these processes may involve data from one or more business units data. E.g. Sales Order creation will start from the lead generation to opportunity closed/won which gets recorded in the CRM application and actual sales order gets created in the ERP system. Thus when data modelling is considered for DWH, it should be focussed on the process rather than a system. Invariantly the data from multiple systems are combined to generate any required reports from DWH.

    3. a.
Defining the ER model for the process - List out all the entities involved in the process and draw the relationship between those entities. The relationship should clearly mention the cardinality between the entities and whether a relationship is strong or weak.

3. b
Defining the Dimensional model for the process - The bus matrix will provide data modeller with a picture of what different dimension and fact tables to be considered for dimension modelling.
The context of analysis will become the dimension and transactional data will be facts. Once the model is ready then it requires review with the customer to finalize the model.

    4. Review the ER model with the Business SME: Next step post the ER model / Dimension model is to review the model with Business SME to understand if there are any gaps in the model. The model should explain all the relationship unambiguously to the users. The model should answer all the questions which business is trying to answer

  5. Defining the physical model and optimizing the data model for the presentation layer: the physical model may not be an exact replica of the ER-logical model. Few entities might be merged to create a common table to help the optimization and consolidation. This process should ensure the data integrity maintained and preserved as we move forward. There may few aggregated tables created for the presentation layer for the better reporting ability.

Documentation of the data model is as important as building it. It provides the understanding of the data model for the support team.
Versioning of data model along with documentation lets team understands what changes were done as part of each release.

Overall data modelling is still relevant in the big data world and there is a lot of scope for Good Data Modellers.



References - Data warehouse bus matrix image is from Ralph Kimball website. Many contents are summary of my learnings with different books and website.






Monday, 26 November 2018

ETL and Redshift Data Loads for DWH

Cloud Data Warehouse Implementation using Redshift

 ETL Design for the Landing, Staging, DW and Reporting layer.

Landing Area S3 - Simple Storage Service from AWS will be used for the landing area load. We can use the different buckets created for each source system to handle the files.
Data ingestion into the S3 layer from source systems has to be dealt with separately. If we are pulling the data from the relational databases, CSV, Excel files, flat files then -open source ETL tools like Pentaho can be used for orchestrating purpose.

Daily incremental data copied directly to tables using the S3 copy command to redshift.
Copy command from S3 to Redshift is another great tool available for the huge volume of data loads.

If there are specific requirements for archiving the files required then we can think of using the different bucket for the same. S3 versioning and access restriction provide greater flexibility for the implementation.

With S3 versioning capability we can overwrite the existing files, however, keep the copy of the file our future reference or audit compliance.



Transformation/Processed Layer-
Data transformation rules applied to the landing tables and processed data is loaded to interim tables.
Generally, the transformation should be performed as a multi-step operation within a transaction with begin and end command.
Create as many temp tables as required for the transformation which will help the operation.

Data Mart/DW-

Data from the processed layer loaded into actual mart tables using the redshift commands.

Below are few different DW table loading methodologies-

Dimensional Data Load 

SCD-1 

/*Step1 - Copy existing data into S3 - unload - temp_existing_records_dim.csv */
/*Step2 - Delete the data from dimension table matching key with incremental table */
/*Step3 - Copy the data from s3 actual dimension table back*/
/*Step4 - Insert all data from incremental table to dimension table*/



SCD -2 

/*Step 1 - Copy latest dimensional records from target dimension table to a temp table temp_dim_copy */
/*Step 2 - Copy the existing record set from the incremental data which got the new value in the incremental compared to the dimension table, load it temp_changed_existing_record with an effective end date for existing records. This is inner join with dimension and staging table */
/*Step 3 - Copy All the incremental file data to temp_new_changed_record table. This will contain new and modified record set with a new effective date. This is left join with the existing incremental table and dimension table */
/*Step 4 - Unload the all the data from temp_changed_existing_record to S3 to mark deletion from dimension table  -temp_changed_existing_record.csv */
/*Step 5 - Delete matching records from dimension and temp_changed_existing_record table based on key */
/*Step 6 - Copy unloaded file back to the dimension from the file  temp_changed_existing_record.csv this will have updated end date for the records which were existing earlier*/
/*Step 7 - Insert both new records and changed records into the dimension table from temp_new_changed_record */
/*Step 8 - drop the temp tables*/


Fact Table Load-

Three different modes of the fact table loads.

Transaction fact - Insert only
This fact table load is simple and its plain insert of records coming from the source table.
Data loaded from landing table to S3 and then copied to the actual table.

Accumulative Fact -Insert and Update
Accumulative fact is kind of fact in which, data is updated for the existing business key and new records inserted and old records copied back from the previous snapshot if it has not changed.

Snapshot Fact table
The incoming records are loaded as the snapshot for a given day and it gives the status of the fact for that day.



Loading strategy for the three different facts-

Transaction Fact -

1. Data from the landing table simply copied to the target table with insert into command

Accumulative Fact

1.Data from the target table loaded into the temp_target table
2.Data from landing table i.e incremental transaction data for the given day is loaded into a temp_increment table
3.Create another temp_table1 joining the temp_target and incremental table get to know the matching data in the current day load.
3.Unload all the records from temp_table1 to S3
4.Delete the matching records from the fact table joining the same with temp_table1
5.Copy the data from S3 to restore old unchanged records
6.Inser the new and changed records from the incremental temp table


Snapshot Fact Table Load

1. This fact is creating the daily snapshot for the complete data set.
2. Load the today's load to the table
3. Copy any records which are missed from today's load with left join the current data to the existing fact table.



Tuesday, 18 September 2018

Redshift for Beginner - Part -2

Designing the tables in Redshift

General table design principle of data modelling has to be followed to define the star schema/snowflake model. Redshift is more suited for the star schema and easier for reporting analytics. 

DBSchema is one of the data modelling tool used for Redshift database design.

https://www.dbschema.com/index.html 

The general design of the Data warehouse is as shown below. 





General thumb rule followed to make the ETL load faster is to have landing Area tables without having any keys defined in the relational world databases. However, in the case of Redshift, we can use S3 itself as a landing area and avoid the data duplication to redshift cluster. The ability of unlimited storage provided by S3 is boon for the data warehouse developer. This method of using S3 not only saves a lot of space on the redshift cluster and it also saves the money not using the expensive redshift for the landing data storage. 

Tools like AWS Spectrum and Athena have the capability to connect to the S3 which is more advantageous. 

Table design on the redshift depends on the following factors. 
  1. The uniqueness of the data 
  2. The Volume of the data 
  3. Frequently queried data by the end user for analytical purposes
Distribution key, Sort Key and Compression Logic are three major designing factors while building the efficient table design on the redshift database. Redshift documentation has an in-depth explanation of the distribution keys, sort keys and compression logic. Tables must follow these design principles to get the best performance for queries.

AWS Redshift Git hub page below provides a lot of utilities which could be used by the developer to fix and design the tables efficiently.

Link for the GitHub-redshift code repository.  This is maintained by the AWS team.
https://github.com/awslabs/amazon-redshift-utils



  • Select Sort keys.


Redshift stores the data on the disk. Sort keys help to store the data in the sorted order. The sorted data will improve I/O performance and query throughput. Below are the different scenarios we consider for choosing the sort keys.
  1. If the recent data queried most, specify the timestamp column as a leading sort key 
  2. If frequently filtered data on one column specify that column as the sort key
  3. If frequently join on the dimension table then specify that column as the sort key    


  • Select Distribution Keys 


Leader node manages the query execution. Leader node depending upon the distribution of data and distribution key sends the data to each slice of the nodes. Below rule needs to be kept in mind while selecting the distribution key.
  1. To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data distribution skew, forces some nodes to do more work than others, which impairs query performance
  2. To minimize data movement during query execution. If the rows that participate in joins or aggregates are already collocated on the nodes with their joining rows in other tables, the optimizer does not need to redistribute as much data during query execution


There are 3 types of distribution methods.

  1. Key - Rows distributed according to the values present in the defined key column. If there are 2 tables involved in the join condition then defining the distribution key on the joining column will make sure data is stored at the same physical location for both tables hence increasing the query performance. 
  2. Even - The data is spread on the round robin method across different nodes present in the cluster
  3. All - The table data shared between each of the nodes and it will help in query execution. 


Suggested methods to choose distribution keys

  1. Define primary keys and foreign keys on the tables. Amazon Redshift doesn't enforce any constraints on the database, however, this will enable leader node to choose optimal query execution plan. 
  2. Distribute fact table and largest dimension to get better performance on the common joining "KEY" with KEY distribution style
  3. Make other small dimensions to use "ALL" distribution style
  4. If any table is not involved in join or it is denormalized then use "EVEN" distribution



  • Select Compression Methods


Compression is a column level feature which allows the storage of data on the disc efficiently stored. The operation involving that data will be efficient. Compressions can be applied manually to the table or automatically using copy command. AWS recommends using the copy command to apply compression techniques. 

Automatic compression using copy command can only be applied for an empty table. We cannot change the compression technique of a column once it is defined however you can define a new compression by adding a new column using alter command.

Analyze compression provides the report on the table which suggests the compression for the tables analyzed. For each column, the report includes an estimate of the potential reduction in the disk space in the storage. Compression command puts the exclusive lock on the table hence all reads and writes will be blocked during analyze compression command.


Encoding type
Keyword in CREATE TABLE and ALTER TABLE
Datatypes
Raw (no compression)
RAW
All
As well for the columns defined as sort keys, boolean and Double Precision
Byte dictionary
BYTEDICT
All except BOOLEAN
This is best for columns having lesser unique values. Like Gender, Country, lookup types 
Delta
DELTA 
DELTA32K
SMALLINT, INT, BIGINT, DATE, TIMESTAMP, DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
Very useful for Datetime columns. Delta uses 16 bit ranges from -127 to +127, Delta32K uses 32 bit ranges from -32k to 32k.
LZO
LZO
All except BOOLEAN, REAL, and DOUBLE PRECISION
This is best for Varchar columns
Mostlyn
MOSTLY8
MOSTLY16
MOSTLY32
SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
BIGINT, DECIMAL
Run-length
RUNLENGTH
All
Text
TEXT255
TEXT32K
VARCHAR only 
VARCHAR only
When column repeats the same words many times
Zstandard
ZSTD
All









Wednesday, 12 September 2018

Data Virtualization

Data Virtualization
Definition
Data Virtualization is an art of managing the data using the virtual data layer between source and end point of the data i.e. reports or dashboards.

Data Virtualization is different from machine virtualization. There is always a confusion between machine virtualization and data virtualization. Machine virtualization involves of resource sharing across many process for example memory, CPU etc.Data virtualization involves data Encapsulation, Abstraction and Data Federation in case of the data management. 
There are many vendors available for data virtualization. Data Virtualization Technic is available in the industry for many years now. Cisco Data Virtualization is one of the leading tool for the Data virtualization. Denodo - is another leading tool which gives the data virtualization capabilities. 
How Data Virtualization Works:-
Data Virtualization tool has the data management layer similar to data warehouse project. However significant difference comes into play in case of the data duplication and storage. In typical world of the Data Warehouse, data is replicated in different stages. Like landing area - source data copied, staging area - where data transformation applied and stored. Finally, data loaded into data mart or data warehouse as per the data model.  These replication of the data takes significant amount of ETL work which in turn needs lot of resources and money to be spent. 

Data Virtualization technology removes this redundant exercise of copying the data to different layers and reduces the ETL work.

Data Virtualization has the Technic of virtual tables / virtual view to be created on top of source data. The virtual table actually pulls the data from real source whenever required for processing thus avoiding the replication process.

The data virtualization built on the model of metadata framework rather actual database objects creation. Whenever data need from the actual objects virtual table pulls the data using the metadata associated with the real tables. The metadata is stored in the repository of the data virtualization server. It works similar to database views. However there are methods available within data virtualization tools which makes the query retrieval faster.

This technology has the ability to combine the data from different sources. We can have the RDBMS, Hadoop based clusters, Website, Web servers, Logs data, and CRM data. Most of the data virtualization tools have built in drivers to connect to these variety of the data sources. If the default drivers are not provided the custom development framework will help to build the drivers required for the connection.

Different flavors of Data Virtualization:-

Method 1 - The data sources are directly connected to the Data virtualization software and BI application access the data from the virtualization server.

Method 2 -  The existing data warehouses/marts are combined  using Data virtualization software and Data Virtualization software will act as collating layer which combines data from 2 different warehouses or marts.

Method 3-  The data sources are directly connected to the Data virtualization software and Data warehouse gets the data from the virtualization layer which acts like single source for all the need of BI reporting and dashboards .



Performance of Queries:-  

Performance issues are considered as major road block for any data warehousing BI solutions. The data virtualization being an additional layer between sources and target will give arise to number of questions about the performance? The performance of the data retrieval enhanced using the cache and other unique techniques of optimization within data virtualization servers. The virtualization tools re-write the queries in simplest forms to fetch the required data in faster mode.

Different methods available for query optimization with the leading data virtualization tools are

1. Query Substitution

2. SQL Push down 

3. Distributed Joins 

4. Ship Joins 

5. SQL Override

6. Cache Refresh 

7. Cache Replication

Summary :-

1. Data Virtualization will not replace and ETL. However, it will help ETL projects execution by reducing the time it takes to complete

2. Data Virtualization will help ETL projects to get ROI effectively

3. Data Virtualization helps data management layer to combine heterogeneous data sources

Data virtualization is boon for the organization. This will accelerate the time to market the change requests of the users. The project development and maintenance cost reduces significantly. The data virtualization provides the platform to combine various data source and help in building a unique data management platform which is scalable, economical, flexible and efficient.

References -
1. Data Virtualization for Business Intelligence Systems by Rick F. van der Lans