settingsLogin | Registersettings
Scalefree
Show Menu

Hub Hash Key

+1 vote
Hi everyone, I have a question about hash keys in DV 2.0. I've seen them being referred to in other DV 2.0 discussions, but I don't understand when and how they should be applied.

Can someone please clarify this for me?

Thanks!
asked Dec 17, 2013 in Implementation by M.Smith (280 points)
recategorized Jan 9, 2015 by molschimke

3 Answers

+1 vote
 
Best answer
Hashes are used to resolve performance issues that crop up when working with large volumes and parallel processing. When dealing with systems that process tens of terabytes per hour or that distribute dependencies across multiple systems, sequences just can't scale to handle the workload. They also tend to cause “hot spots” in MPP systems like Teradata and Hadoop if they're used as the data layout key.

To satisfy the performance and NoSQL requirements of DV 2.0, they are commonly replaced with hashes. Because hashes are used to map the value of the sequence to an index, the hash has to conform to a specific set of rules that make it work in all cases, cross-platform and cross-ETL.
answered Jan 20, 2014 by Will_Lo (560 points)
+3 votes
The normal way to approach this from a purely practical standpoint is to create a DWH key that's mapped to the business key, the BK of course being unique. That sequence is then referenced in other tables, particularly Satellites and Links. When those tables are loaded, you use the BK to retrieve the DWH key value from a Hub and insert the DWH key in the Satellite or Link. Keep in mind that all of this data accessing will lead to heavy resource usage, which causes the MPP performance problems mentioned above.

Instead, in DV 2.0, you can use a hash of the BK in place of the DWH key. This prevents you from having to perform a lookup which avoids the large performance hits and frequent database accesses. Hashes can also be integrated cleanly with NoSQL and Hadoop, although that's an entirely different topic and far too extensive to cover here. There are plenty of hashing functions to choose from, but the one that you decide to go with depends on how much you're willing to risk a chance of collision. For example, some of the most commonly used hashing functions – MD5 and SHA1 – have both had proof-of-concept attacks that proved they were vulnerable to collisions. SHA1 is a bit more secure than MD5, but is still more vulnerable than many modern hashing functions.
answered Feb 23, 2014 by Jack.the.Machine (350 points)
Interesting! I have used the "classical" MD5, SHA1 etc. What modern hashing functions would you use?
+2 votes
Thank you for your answers! From what I understand, hash keys increase parallel processing power by not having to look up business keys in separate tables. This makes sense, but it also raises another question. When loading data into the Satellite, wouldn't you have to perform a lookup to find out if the business key is already in the Hub or Link, since Satellites only contain metadata related to the Hub records? How would I avoid having to do an additional lookup to get that information?
answered Mar 29, 2014 by David77 (300 points)
Scalefree Scalefree

Upcoming Trainings

  • July 01 to 03 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • August 05 to 07 - Berlin (German) Data Vault 2.0 Boot Camp and Certification

  • September 9 to 13 - Hanover WWDVC EU

  • September 16 to 18 - Brussels (English) Data Vault 2.0 Boot Camp and Certification

  • October 21 to 23 - Hanover (German) Data Vault 2.0 Boot Camp and Certification

  • November 18 to 20 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • December 02 to 04 - Dusseldorf (German) Data Vault 2.0 Boot Camp and Certification

  • December 16 to 18 - Zurich (English) Data Vault 2.0 Boot Camp and Certification

  • Contact us for inhouse training engagements.
    Visual Data Vault: Logical Modelling for Data Vault - Free Download!
    Scalefree
    DataVault.guru is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
    Permissions beyond the scope of this license are available in our license.

    DataVault.guru | DWH Wiki | Recent questions RSS feed | Imprint | Provided by Michael Olschimke
    ...