settingsLogin | Registersettings
Scalefree
Show Menu

Data Vault 2.0 and referential integrity

+2 votes
I have been using Data Vault modeling (a database modeling method designed to provide long-term storage of data coming from multiple operating systems. Here is my problem: I was adding some additional automation code to help support DV2.0 and ran into problems concerning the Referential Integrity (the property of data which requires each value of one attribute in a table to correspond to an attribute in a parallel table) in DV2.0. In DV1.0, I've always found RI to be constantly on, except for with the very largest system applications. In DV2.0, as I understand it, the same approach is applied. Referential Integrity in DV1.0 is much easier to work with and the ETL typically handles this with the primary distributions and lookups. However, in DV2.0, it's much easier to conduct parallel loading of your data. The key distribution bottleneck is taken away because you are able to use the hash key, foreign key and other elements. Just to clear up confusion, the hash key is part of the hash function, a function used to map digital data of arbitrary size to digital data of a fixed size. What this means is that Satellites or Links can be loaded prior to the Hubs if you so desire, but this naturally conflicts with RI, if you have it enabled.

I personally don't feel comfortable reverting to the 'disable before/enable after load' RI because I think that it adds additional problems when you take the parallel loading into account. You really do have the potential to easily jam up the system with DV2.0.

I was thinking about handling RI in a modeled, but disabled way (this would allow the optimizers to implement the SQL efficiency properly) and engaging housekeeping instead. Alternatively, I could allow Hub insertions to be added to the templates and processes for Satellites and Links. But I don't like that idea, because that will force you to handle all the data in the same pass. In my opinion, separate templates make a lot more sense.

What is your opinion? Do you have an experience that you could share?
asked Dec 21, 2013 in Relational by J_King (240 points)

3 Answers

+2 votes
Yes, I actually considered doing this myself. I'm in the right position; my project is practically ready to engage the hash key structure. I will be keeping the constraints in the database disabled at their lowest and am planning to keep all my loads separated. It's running Oracle, so I will need to check all the constraint options before making a final decision. I know for sure that there are some deferred constraint options available that I need to study to determine if I can use them instead of the disable/enable approach.
answered Jan 28, 2014 by eve_182 (430 points)
+1 vote
Coming from an "Agile BI" angle, I believe that loading from source to DV should be fully automated, with no cracks or loopholes. It should be allowed to run as a background process, like the operating system on computers does. It's always there, running, but should require little to no attention. I see it this way: all record fallout that needs additional inspection or manual processing has the same drawback that Kimball's "audit dimension" has; it cannot be predicted. There is no way of determining where your data will turn up or when, making the management of work processes a challenge. Almost before you have time to write up ten lines of computer script, it's become another recycling bin. My favorite principle concerning DV is the "load all of the data, all of the time" principle. DV 2.0 has a wider opportunity for parallel processing. I believe that its greatest gift to us programmers is the ability to decouple sequential dependencies in load processes. It would be a sad thing indeed if we can't find out a way to process "fall-out" with full automation.
answered Mar 7, 2014 by Mij (940 points)
+1 vote
In my mind, looking at data integrity, and all data quality issues, it is important to make a clear distinction between Business Data Vault (BDV), DV/EDW/Raw Vault and EDW+/Staging Out layers. It's common for business entity attributes to perform all of their calculations based on "nested if null" logic. They get this idea from various satellites fed from a wide variety of source systems, such as a foreign key. In your case, it is practically impossible to enforce RI rules at Data Vault layer. You can only do it in Business Data Vault. Additionally, trying to enforce the restraints RI poses will not allow you to manage processing based on defined tolerances. As an example, an entire file may fail to be processed because of too many row failures for a given data file.
The first reaction of most ETL/support engineers when they are notified that a job failed due to RI constraint violation is to go to the ETL tool, add a problem records filter and redeploy the program. As you can see, if you do this at DV or BDV layer, you can lose the ability to monitor a large portion of problem records. Downstream consumers are probably not ready to lose any of their data.

I believe that problems with the Referential Integrity should not be enforced at DV or BDV layers, although you should continuously monitor them. The cost of a solution will vary greatly, depending on pre-build functionality of your ETL tool as well as all data volumetrics.

Now, let me make a common misconception about data volume clear. The "data quality scorecard" is an excellent tool that some vendors will offer with their data integration platforms, but all that it really does is transfer either all or a very large hunk of your data to the ETL server. This action makes it nearly impossible to monitor large sets of data.
answered Apr 14, 2014 by peckheart (340 points)
Scalefree Scalefree

Upcoming Trainings

  • July 01 to 03 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • August 05 to 07 - Berlin (German) Data Vault 2.0 Boot Camp and Certification

  • September 9 to 13 - Hanover WWDVC EU

  • September 16 to 18 - Brussels (English) Data Vault 2.0 Boot Camp and Certification

  • October 21 to 23 - Hanover (German) Data Vault 2.0 Boot Camp and Certification

  • November 18 to 20 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • December 02 to 04 - Dusseldorf (German) Data Vault 2.0 Boot Camp and Certification

  • December 16 to 18 - Zurich (English) Data Vault 2.0 Boot Camp and Certification

  • Contact us for inhouse training engagements.
    Visual Data Vault: Logical Modelling for Data Vault - Free Download!
    Scalefree
    DataVault.guru is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
    Permissions beyond the scope of this license are available in our license.

    DataVault.guru | DWH Wiki | Recent questions RSS feed | Imprint | Provided by Michael Olschimke
    ...