Tuesday, 22 January 2019

Data Modelling in Big Data World


Data Modelling -

Data modelling is more of business-driven approach rather considered as technical work. 
Data modeller not only should have sound knowledge about the tools, technology but he should also understand the business process pretty deeply. 
Bus matrix with the business process helps to define the entities and their relationships appropriately.

Bus Matrix Diagram

We hear from many people that data modelling is obsolete and it is not required in the world of Big Data processing. However, Data modelling is still relevant in case of the big data world. The huge volume of data processing has become easier due to the advent of new technologies and tools. However, this doesn't undervalue the importance of data modelling. The structure of database design has far more impact on the data storage and longevity of the data access.

Data modelling is part of all the different phases of the Data management system. Data management systems generally consist of 4 different stages of data process.
  1. Ingest 
  2. Storage 
  3. Transform 
  4. Presentation
Ideal solution for any project would be to have an explicit separation between these layers with loose couple architecture.  Good design provides a robust methodology to handle these different process of data management. 

Data modelling should be kept in mind while designing all these different layers. AWS has many different services which will help in each of these phases of data management cycle. S3 is the best fit for the data ingestion layer. S3 is petabyte-scale storage.

AWS Redshift will provide the much-required flexibility for transformation and processing engine.
AWS Quick sight service could be used for the presentation layer.

Report driven approach for data modelling gives an effective way to justify the data needs from the OLTP/ Source systems. It will help to narrow down the use cases to be defined out of data lake. The table design and reporting layer design becomes very easy with the bottom-up approach. That is the preferred design method proposed by Kimball.

 Data modelling involves the following steps
  1. Source System Analysis
  2. Defining the process flow
  3. Defining the ER model - Logical Data model/Dimensional Model 
  4. Review the ER model/Dimensional Model with SME to understand if there are any gaps in the understanding
  5. Defining the Physical Data model and optimizing the model for the presentation layer
  1. Source System Analysis
  • Functional Knowledge of Domain
  • System technical knowledge
  • Access to the source system
  • Data extraction methodologies - Optimal method of data extraction methodologies
  • Source system documentation - Knowledge transfer along with reverse KT
  • Knowing the business
  • Knowing the business process flow
  • Understand key terminology and Master entities from heterogeneous source systems
  • Determine the Golden source for the master entities
  • Sessions with SME to go over the process and data importance
  • Understand the end user need with respect to data 
  • How end user is using the data
  • Understand existing data irregularities
  • Keep moving the available data to next steps and be agile, flexible to adapt the changes as we discover more understanding about systems, data

      2. Defining the Process flow: Process could span across multiple business units. Process flow could be sales order creation, Invoice generation, Item fulfilment, payment etc. Each of these processes may involve data from one or more business units data. E.g. Sales Order creation will start from the lead generation to opportunity closed/won which gets recorded in the CRM application and actual sales order gets created in the ERP system. Thus when data modelling is considered for DWH, it should be focussed on the process rather than a system. Invariantly the data from multiple systems are combined to generate any required reports from DWH.

    3. a.
Defining the ER model for the process - List out all the entities involved in the process and draw the relationship between those entities. The relationship should clearly mention the cardinality between the entities and whether a relationship is strong or weak.

3. b
Defining the Dimensional model for the process - The bus matrix will provide data modeller with a picture of what different dimension and fact tables to be considered for dimension modelling.
The context of analysis will become the dimension and transactional data will be facts. Once the model is ready then it requires review with the customer to finalize the model.

    4. Review the ER model with the Business SME: Next step post the ER model / Dimension model is to review the model with Business SME to understand if there are any gaps in the model. The model should explain all the relationship unambiguously to the users. The model should answer all the questions which business is trying to answer

  5. Defining the physical model and optimizing the data model for the presentation layer: the physical model may not be an exact replica of the ER-logical model. Few entities might be merged to create a common table to help the optimization and consolidation. This process should ensure the data integrity maintained and preserved as we move forward. There may few aggregated tables created for the presentation layer for the better reporting ability.

Documentation of the data model is as important as building it. It provides the understanding of the data model for the support team.
Versioning of data model along with documentation lets team understands what changes were done as part of each release.

Overall data modelling is still relevant in the big data world and there is a lot of scope for Good Data Modellers.



References - Data warehouse bus matrix image is from Ralph Kimball website. Many contents are summary of my learnings with different books and website.






No comments:

Post a Comment