A data lake is the buzz word in the data management world. Everybody talks about the implementation of data lake and methodology to implement the best practices to build the data lake.
I would like to put forward my thought to implement the data lake.
The data lake is the large and common storage for all the data from enterprise-wide. This will get the data from a wide variety of sources.
Data from multiple source systems can be extracted and ingested into the data lake. The data lake could be object storage like S3 or Azure Blob Storage or ADLS or Google Cloud Storage. The source systems like CRM, ERP, flat files, OLTP systems, any other business applications get fed into the data lake layer.
Each layer of the data lake is called as zones. Typically data lake will contain Raw Zone, Processed Zone, Reporting Zone.
Raw zone - data is loaded into the layer as is and this layer gives the audit of the data extracted and loaded into the data lake. Whenever, if any audit needs to be done on the source systems data load, the raw zone will provide a source of truth for the data.
Processed zone - The data transformation i.e. applying the business rules, data clean up, data formatting etc. done in the processed layer. This processed layer will be storage for the transformed business data.
Reporting zone - The reporting zone used for the end-user consumptions and final visualization layers are built on top of this layer. The consolidated data layer for the reporting also built using the reporting layer.
The data scientist and other folks will be extracting/using the data from any required zones for their further analysis.
Data in the raw format is of no use, it will be a heap and need to transform and store the data effectively for usage.
Data processing on the file format is best using spark, py-spark, or map-reduce code or hive any query engine can be used for the processing.
Data Ingestion
Data ingestion process is loading the data from the external world to S3 which is the data lake.
Different steps for data ingestion are using a traditional ETL tool, using Python scripts, AWS Kinesis or AWS Kinesis Firehouse or Data Bricks on AWS cloud or spark code. In the case of AWS environment, data ingestion made pretty easier by AWS with the services like Kinesis however there needs a lot more work to be done to make sure the ingestion is effective.
Data Storage
Post the data ingestion to data lake very important step is to organize the data suitably in the S3. Data storage is a very important step, we should make sure files are sorted and stored based on the time series or logical separation. Data access from S3 will be easier if the files are organized properly
Data Grouping
Data grouping with respect to reporting is very important. Data is logically grouped and used for the reporting.
Data Exploration
Once the data is in data lake it will be used by a data scientist to explore and analyze. Data exploration is done using various tools and programming language.
Data Quality
Data quality is a very important step in the process. Quality of data management and data issues needs to be addressed at the source to make sure data is clean. Many times data quality issues getting tried to fix it outside the source system is challenging
Data Governance
Data Governance in case of the data lake is a pretty challenging task. we need to take care of the metadata tables. Audit configuration, identification of invalid records for error processing. Need to set up a data governance team for the data validation and as well to help in building the process across the company. Process of monitoring the load regularly need to be set and schedule should be monitored appropriately.
Archiving the data
We might need to archive the historical data. The older data which is not required for processing should be moved to low-cost storage.
Tools and technology used for each of the process data ingestion methodologies.
The business use case for the data lake implementation. The usage of the data lake will be based on the business use case. There should a few use cases defined. A use case could be related to one process of a particular business unit.
e.g. we are replacing data storage mechanism for an enterprise data warehouse as data lake then the specific problem can be solved, The data insights which requires data to be massaged and stored in a particular format. This helps in the final reporting layer and dw layer design.
If we don't have a specific use case to start with data lake formation. It will be a never-ending process.
I would like to put forward my thought to implement the data lake.
The data lake is the large and common storage for all the data from enterprise-wide. This will get the data from a wide variety of sources.
Data from multiple source systems can be extracted and ingested into the data lake. The data lake could be object storage like S3 or Azure Blob Storage or ADLS or Google Cloud Storage. The source systems like CRM, ERP, flat files, OLTP systems, any other business applications get fed into the data lake layer.
Each layer of the data lake is called as zones. Typically data lake will contain Raw Zone, Processed Zone, Reporting Zone.
Raw zone - data is loaded into the layer as is and this layer gives the audit of the data extracted and loaded into the data lake. Whenever, if any audit needs to be done on the source systems data load, the raw zone will provide a source of truth for the data.
Processed zone - The data transformation i.e. applying the business rules, data clean up, data formatting etc. done in the processed layer. This processed layer will be storage for the transformed business data.
Reporting zone - The reporting zone used for the end-user consumptions and final visualization layers are built on top of this layer. The consolidated data layer for the reporting also built using the reporting layer.
The data scientist and other folks will be extracting/using the data from any required zones for their further analysis.
Data in the raw format is of no use, it will be a heap and need to transform and store the data effectively for usage.
Data processing on the file format is best using spark, py-spark, or map-reduce code or hive any query engine can be used for the processing.
Data Ingestion
Data ingestion process is loading the data from the external world to S3 which is the data lake.
Different steps for data ingestion are using a traditional ETL tool, using Python scripts, AWS Kinesis or AWS Kinesis Firehouse or Data Bricks on AWS cloud or spark code. In the case of AWS environment, data ingestion made pretty easier by AWS with the services like Kinesis however there needs a lot more work to be done to make sure the ingestion is effective.
Data Storage
Post the data ingestion to data lake very important step is to organize the data suitably in the S3. Data storage is a very important step, we should make sure files are sorted and stored based on the time series or logical separation. Data access from S3 will be easier if the files are organized properly
Data Grouping
Data grouping with respect to reporting is very important. Data is logically grouped and used for the reporting.
Data Exploration
Once the data is in data lake it will be used by a data scientist to explore and analyze. Data exploration is done using various tools and programming language.
Data Quality
Data quality is a very important step in the process. Quality of data management and data issues needs to be addressed at the source to make sure data is clean. Many times data quality issues getting tried to fix it outside the source system is challenging
Data Governance
Data Governance in case of the data lake is a pretty challenging task. we need to take care of the metadata tables. Audit configuration, identification of invalid records for error processing. Need to set up a data governance team for the data validation and as well to help in building the process across the company. Process of monitoring the load regularly need to be set and schedule should be monitored appropriately.
Archiving the data
We might need to archive the historical data. The older data which is not required for processing should be moved to low-cost storage.
Tools and technology used for each of the process data ingestion methodologies.
The business use case for the data lake implementation. The usage of the data lake will be based on the business use case. There should a few use cases defined. A use case could be related to one process of a particular business unit.
e.g. we are replacing data storage mechanism for an enterprise data warehouse as data lake then the specific problem can be solved, The data insights which requires data to be massaged and stored in a particular format. This helps in the final reporting layer and dw layer design.
If we don't have a specific use case to start with data lake formation. It will be a never-ending process.