Data vault modelling

Introduction

When looking to build out a new data lake, one of the most important factors is to establish the warehousing architecture that will be used as the foundation for the data platform. While there are several traditional methodologies to consider when establishing a new data lake (from Inmon and Kimball, for example), one alternative presents a unique opportunity: Data Vault. In this article, I will show basic knowledge of data vault and some pros and cons of it.

Overview of data vault

Data Vault is created by Dan Linstedt and his team at Lockheed Martin in the early 90s. It includes 3 main cores: Hub, Link and Satellite. Hub represents a core business concept, such as they represent Customer Id/Product Number/Vehicle identification number (VIN). Users will use a business key to get information about a Hub. Hubs don’t contain any context data or details about the entity. They only contain the defined business key and a few mandated Data Vault fields. A critical attribute of a Hub is that they contain only one row per key.

Meanwhile, A Link defines the relationship between business keys from two or more Hubs. Just like the Hub, a Link structure contains no contextual information about the entities. There should also be only one row representing the relationship between two entities.

In order to represent a relationship that no longer exists, we would need to create a satellite table off this Link table which would contain an is_deleted flag; this is known as an Effectivity Satellite.

Example of data vault

We can have an example of data vault model here. For example, we have relationship database as example:

Employee table:

employee_id(PK),
job_id(FK),
department_id(FK),
hire_Date,
salary,
first_name,
last_name,
manager_id

Job table:

job_id(PK),
job_name,
min_salary,
max_salary

Department table:

department_id(PK),
department_name,
location_id

Location table:

location_id(PK),
street,
city,
district

From these tables, we will create hub,link and sattelite:

Hub_employee:

HK_employee(hash key of employee_id),
employee_id,
load_dts,
source

Hub_job:

HK_job(hash key of job_id),
job_id,
load_dts,
source

Hub_department:

HK_department(hash key of department_id),
department_id,
load_dts,
source

Hub_location:

HK_location(hash key of location_id),
location_id,
load_dts,
source

Link_employee:

Emp_Job_Dep_HK(hash key of employee_id, job_id, department_id),
employee_id,
job_id,
department_id,
load_dts,
source

Link_department:

Dep_Loc_HK(hash key of department_id, location_id),
department_id,
location_id,
load_dts,
source

HAL_link:

HAL_HK(hash key of manager_id,employee_id),
manager_id,
employee_id,
load_Dts,
source

Sat_employee:

HK_employee(hash key of employee_id),
first_name,
last_name,
email,
phone_number

Sat_location:

HK_location(hash key of location_id),
street,city,
district

Each hub will represent for entity in database. The primary key of each table is hask key of business key. Then we will create dimension and fact table through join between link, hub and sat table.

Pros and cons of data vault

Pros: flexibility, maintainability and scalability both in terms of semantic complexity and sheer volume.

It aims to facilitate the above by introducing three major design principles that set it apart from an EDW based on 3NF or on dimensional modelling:

  • It decouples management of business keys from any attributes of business entities (Hubs/Links and their Satellites).
  • It expects any relationship between business entities within the system to be modelled as many-to-many (a link table is introduced for any such relationship).
  • It assumes that all attributes are maintained in a way, similar to type 2 slowly changing dimension of Kimball-style dimensional modelling.

These design principles allow the system to use MPP for any ETL processes (all entities can be loaded at the same time), facilitate Master Data Management (it offers great flexibility in adding/removing data sources) and provides a robust framework for recording historical data.

Cons:

  • Two to three-fold explosion of the number of tables compared to 3NF modelling
  • Data Vault EDW does not focus on read performance: large number of relationships between tables and complexity of joins often require separate layer of bridging tables to be maintained as materialised views of the data vault

References

sticker #1
Subscribe to Dwarves Memo

Receive the latest updates directly to your inbox.