Thoughts about Record Source in Data Vault Modelling

Today I was at the BDWM course at my customer. The teacher explained to us how we should use the SOURCE SYSTEM COMPONENT within the IBM Banking Data Warehouse Model (BDWM). The resamblence with the Record Source field in a Data Vault Datamodel stroke me.

Both are used, in essence, to provide information about which source system was responsible for delivering the information / data, but only one of them has a 3NF which is right in my (humble) opinion. I am actually saying that the Data Vault is not complete, from how I see it… Let me explain my view on this matter;

IBM’s BDWM contains a SOURCE SYSTEM COMPONENT for each row in, for instance, the Involved Party Identifier entity. The structure for this entity looks much like the structure for a HUB in a Data Vault Model. It contains the natural identifier, the providing source system, the type of identifier, and a datetime stamp on which the row was first loaded. For history reasons, there are also effective and enddate time stamps in this entity.


If you look at the standard elements for a HUB, you see a BUSINESS KEY, LOAD DATETIME STAMP, and a RECORD SOURCE. If we take a closer look at the elements SOURCE SYSTEM COMPONENT and RECORD SOURCE, these are both the delivering / providing / originating source systems. The only difference is that the SOURCE SYSTEM COMPONENT is identified by a surrogate key (OID) in the model and RECORD SOURCE within a DATA VAULT model contains the ‘natural key’ of a source system.

Does the natural key not get translated within the Data Vault as well ?!? No, it seems this natural key, is the only key that is allowed to be Data Vault Model Record Sourceused within the Data Vault datamodel. But doesn’t this RECORD SOURCE, also contain other descriptive information like the name of the source system, location of the source system, owner of the source system, etc., etc. And is the RECORD SOURCE / HUB combination (in this example) not a LINK between two HUBS, one with the business keys for e.g. Product and one with the business key for the source system.

In the figure I visualized the structure which, in my opinion, is the right presentation of the RECORD SOURCE attribute in the HUB. It is a LINK between the natural key of the HUB and the natural key for the source system. Most companies speak about their source systems in abbreviations or names, so it might be called a business keys, and these need to be in a HUB according to the Data Vault Philosophy.

The structure for the LINK might look like; 


Link LineItem ID HUBSequence ID Source RecordSequence ID Load DTS Source RecordLink Sequence ID
1 1 2 2007-11-05 22:30:00 3
2 2 2 2007-11-05 22:30:00 3
3 3 2 2007-11-05 22:30:00 3


The structure for the RECORD SOURCE HUB might look like; 
Source RecordSequence ID SourceName Load DTS Source RecordSequence ID
0 NULL 2007-11-05 22:30:00 1
1 Record Sources 2007-11-05 22:30:00 1
2 HUB source System 2007-11-05 22:30:00 1
3 LinkInfo HUB-Source 2007-11-05 22:30:00 1

Looking at the HUB ‘Record Source’, you see that Source Record Sequence ID = 1, is the source for all record source systems used within the Data Vault. It provides all the information about the RECORD SOURCE systems. In the LINK, there are combinations made between the Record Source and HUB. The record source for the HUB is a single Record Source System, namely HUB Source System (Sequence ID = 2), which contains all the business key information for the HUB. Finally the information about which HUB – Source Records are connected, comes from … this might be visualised during the load of the HUB.

Thinking about the whole idea of using a HUB-LINK-SOURCE RECORD HUB, it might be that SOURCE RECORD is candidate for a REFERENCE TABLE structure, because It is not possible to provide the source for the combination of HUB business keys and Record Sources in a single Source System. This information CAN be derived from the HUB, but is only valid untill the next HUB load is performed. So let’s update our picture then… Or is the picture still valid and does the LINK structure and example need some update…

I think the Record Source field in the HUB should point to the HUB Record Source without a LINK interconnecting these HUBs, it’s a reference table. But I would keep the history, so the Satelite information about the Record Source stays in…

Question;

  • Are there more people thinking about the Record Source field as I do, or am I the only one that tries to translate the business key in a sequence id, just for the smaller size of the record (More records can be record into one block!!!)
You can leave a response, or trackback from your own site.

6 Responses to “Thoughts about Record Source in Data Vault Modelling”

  1. Rob Mol says:

    For keeping the size small it will be enough to have a stand alone table.
    Why keep history? Maybe you want to track the release of the source system, but I doubt this is very useful (the release is more important when tracking the history of LINKS en SATS (also the LOAD DATE will help you to find the right release)).
    Another question: what will you do with the record source for LINKS and SATS (you cannot LINK a HUB Record source to a SAT).
    A last point: what about the load strategy? You want to load the LINKS for the Recordsource while loading the HUBS?
    Greetz,
    Rob

  2. [...] an article about Source System Component (BDWM) and the resamblance with the Record Source (See article). My colleague Rob Mol, asked a few questions which I did never answer, so here’s a post [...]

  3. Dan Linstedt says:

    Walter,

    Good to see you considering such options. I would like to raise the bar of thought in two directions:

    1) Metadata Vault which has been discussed in class, is definately needed. Now, what is a Metadata Vault? There are two flavors: technical metadata, and business metadata – so take your pick. They should be congruent and stored in the same place, however not enough work has been done (yet) to prove that this is a good concept…. Anyhow, on to the point: record sources can be “codes”, and can use reference tables – this is a no-brainer. They can also be “smart keys” and reference information in the Technical MetaVault. This is also a possibility

    2) Metrics Vault – this is something I’ve used for years. It’s technical metadata, but at a load-cycle level. It houses growth metrics, load metrics, and table set metrics. It also houses metadata about the record sources and how to decode them via SQL query.

    Now I’m not sure I understand your statement about record sources in link, and so on. But I must insist that some form of record source be housed in the link – it’s just as important to understand “which system first created the relationship” as it is to know “which system first created the business key”. Of course, the Satellites tell a historical story about the evolution of that relationship and business key over time.

    Hope this helps,
    Dan Linstedt

  4. AlexM says:

    Your blog is interesting!

    Keep up the good work!

  5. Bob says:

    Excelent. Your messages are really interesting. To have a good site you should not only to post smth, but do it interesting. You do your best.

  6. [...] my post about modelling the RECORD SOURCE in a data-vault-like way, in 2007 (See POST), I got a reaction by Rob Mol (see Comments on the POST). I reacted by posting my answers and [...]

Leave a Reply

Powered by WordPress & The Best MLM Companies