In my mind, looking at data integrity, and all data quality
issues, it is important to make a clear distinction between Business Data Vault
/Raw Vault and EDW
+/Staging Out layers. It's common for business entity attributes to perform all of their calculations based on "nested if null" logic. They get this idea from various satellites
fed from a wide variety of source systems, such as a foreign key. In your case, it is practically impossible to enforce RI rules at Data Vault
layer. You can only do it in Business Data Vault
. Additionally, trying to enforce the restraints RI poses will not allow you to manage processing based on defined tolerances. As an example, an entire file may fail to be processed because of too many row failures for a given data file.
The first reaction of most ETL
/support engineers when they are notified that a job failed due to RI constraint violation is to go to the ETL
tool, add a problem records filter and redeploy the program. As you can see, if you do this at DV or BDV layer, you can lose the abil
ity to monitor a large portion of problem records. Downstream consumers are probably not ready to lose any of their data.
I believe that problems with the Referential Integrity should not be enforced at DV or BDV layers, although you should continuously monitor them. The cost of a solution will vary greatly, depending on pre-build functionality of your ETL
tool as well as all data volumetrics.
Now, let me make a common misconception about data volume clear. The "data quality
scorecard" is an excellent tool that some vendors will offer with their data integration platforms, but all that it really does is transfer either all or a very large hunk of your data to the ETL
server. This action makes it nearly impossible to monitor large sets of data.