Wednesday, 12 September 2018

Data Virtualization

Data Virtualization
Definition
Data Virtualization is an art of managing the data using the virtual data layer between source and end point of the data i.e. reports or dashboards.

Data Virtualization is different from machine virtualization. There is always a confusion between machine virtualization and data virtualization. Machine virtualization involves of resource sharing across many process for example memory, CPU etc.Data virtualization involves data Encapsulation, Abstraction and Data Federation in case of the data management. 
There are many vendors available for data virtualization. Data Virtualization Technic is available in the industry for many years now. Cisco Data Virtualization is one of the leading tool for the Data virtualization. Denodo - is another leading tool which gives the data virtualization capabilities. 
How Data Virtualization Works:-
Data Virtualization tool has the data management layer similar to data warehouse project. However significant difference comes into play in case of the data duplication and storage. In typical world of the Data Warehouse, data is replicated in different stages. Like landing area - source data copied, staging area - where data transformation applied and stored. Finally, data loaded into data mart or data warehouse as per the data model.  These replication of the data takes significant amount of ETL work which in turn needs lot of resources and money to be spent. 

Data Virtualization technology removes this redundant exercise of copying the data to different layers and reduces the ETL work.

Data Virtualization has the Technic of virtual tables / virtual view to be created on top of source data. The virtual table actually pulls the data from real source whenever required for processing thus avoiding the replication process.

The data virtualization built on the model of metadata framework rather actual database objects creation. Whenever data need from the actual objects virtual table pulls the data using the metadata associated with the real tables. The metadata is stored in the repository of the data virtualization server. It works similar to database views. However there are methods available within data virtualization tools which makes the query retrieval faster.

This technology has the ability to combine the data from different sources. We can have the RDBMS, Hadoop based clusters, Website, Web servers, Logs data, and CRM data. Most of the data virtualization tools have built in drivers to connect to these variety of the data sources. If the default drivers are not provided the custom development framework will help to build the drivers required for the connection.

Different flavors of Data Virtualization:-

Method 1 - The data sources are directly connected to the Data virtualization software and BI application access the data from the virtualization server.

Method 2 -  The existing data warehouses/marts are combined  using Data virtualization software and Data Virtualization software will act as collating layer which combines data from 2 different warehouses or marts.

Method 3-  The data sources are directly connected to the Data virtualization software and Data warehouse gets the data from the virtualization layer which acts like single source for all the need of BI reporting and dashboards .



Performance of Queries:-  

Performance issues are considered as major road block for any data warehousing BI solutions. The data virtualization being an additional layer between sources and target will give arise to number of questions about the performance? The performance of the data retrieval enhanced using the cache and other unique techniques of optimization within data virtualization servers. The virtualization tools re-write the queries in simplest forms to fetch the required data in faster mode.

Different methods available for query optimization with the leading data virtualization tools are

1. Query Substitution

2. SQL Push down 

3. Distributed Joins 

4. Ship Joins 

5. SQL Override

6. Cache Refresh 

7. Cache Replication

Summary :-

1. Data Virtualization will not replace and ETL. However, it will help ETL projects execution by reducing the time it takes to complete

2. Data Virtualization will help ETL projects to get ROI effectively

3. Data Virtualization helps data management layer to combine heterogeneous data sources

Data virtualization is boon for the organization. This will accelerate the time to market the change requests of the users. The project development and maintenance cost reduces significantly. The data virtualization provides the platform to combine various data source and help in building a unique data management platform which is scalable, economical, flexible and efficient.

References -
1. Data Virtualization for Business Intelligence Systems by Rick F. van der Lans

No comments:

Post a Comment