settingsLogin | Registersettings
Show Menu

Hash key collisions

+1 vote
Can collisions occur when using the hash key approach for Data Vault 2.0? While research indicates that the possibility of key-collision is extremely rare (1/2^128), what will happen if this actually happens, especially that now we are in the Big Data era with use cases of several terra bytes of data and a large range of keys being normal? Are there any alternative ways to mitigate the risks that could arise from the key-collision?

I have a suggestion to reduce the possibility of key-collision in Data Vault 2.0. We can add a collision-sequence field to the hash key field to make it unique and serve as the primary key for the hubs and foreign keys to the links and the satellites. The resulting business keys can be stored in satellites. Whenever a key-collision occurI'm concerned about the possibilities of hash key collisions. Even though the chance is extremely low (½^128 according to “DV2.0 and Hash Keys”), It's still almost guaranteed to happen. What if it causes catastrophic issues in our business logic? We can't really say we're 100% confident in our system when we choose to accept the risk, especially since we can't really provide a course of action if a one did occur.

I have an idea on how to handle this, and I would be interested to know what you think.

1. Add a new field alongside the key field that contains the collision sequence. Use the combination of the two fields as the Hubs primary key and as a foreign key to links and Sats.
2. Store the business keys in the satellites.
3. While loading the data (according to normal DV2.0 practice), calculate the key and set the value of the new field to 0.
4. Add a detection process in the background that scans the satellites where the business key changes for a given hash value.
5. If a problem is detected, insert a new hub record that uses the same hash value but increments the new field.
6. Return to the satellite and update the new field to match the new parent of the changed business key value.

The problem with this solution is that it doesn't address links. I'm worried that I might have to go back to the source systems and manually patch the link to point to the correct parent record based on the original relationships.

If you have any other approaches, I'd be glad to hear them.s, we can create a new hub-record with the same hash key, but with an incremented value of the collision-sequence. We can then update the collision-sequence in the satellite to match the new parent. We may still have to manually update the links to match the correct parent-hub. Do you think this approach would work? Are there any better approaches?
asked Nov 22, 2013 in Modelling by TheFlash (260 points)

4 Answers

+2 votes
I think generating a primary key based on a sequence would invalidate one of the main advantages of a hash key-based PK, which is to allow independent parallel loading of the different components. One solution I can think of is to have a routine triggered after load that checks each hub to see if different values (business keys) are getting the same hash key. This doesn't solve how to handle a problem once it's detected, but it could speed up the detection process itself.
answered Dec 1, 2013 by Franky (280 points)
+3 votes
This is the reason I chose SHA1. The chance of an issue like this occurring is about the same or less than the chance of a disk sector keeping the same hash after it goes bad, which is a risk we already accept. There are also plenty of widely used services such as GIT that rely on SHA1 and plenty of people trust them. Sure, there's always a chance that there could be an issue, but you have to figure while it's simple to have colliding integer keys, there hasn't yet been a single string collision in SHA1.

If it helps you sleep at night, you can make a simple detection job and run it nightly or weekly or however often you want. The SAT should provide enough context data to ID records that have the same hash, but different keys. Then when you go to move the data to the business vault or gold copy/star schema, you take actions to resolve the issue. If you want to go over-the-top you can add a second hash that hashes the reverse of the string. Except in really bizarre situations, the reverse string should have a different hash than the forward string, and by checking both you should be able to avoid any problems of this nature. Of course by this point you're just doubling the number of keys and might as well use SHA512.
answered Dec 10, 2013 by Harper (630 points)
0 votes

Or another thought, use a Hash Algorithm like SHA1 and store the hash keys at full length in the staging area.

In the Data Vault trim the Hash to a smaller length to reduce storage requirements, potentially even setting it to a really small length initially.

Detect Hash collision and prevent import of colliding records which then instigates an automated process that increases the length of the Hash Key in the target table and all dependant tables.

Once the new Tables and Loading processes are prepared loading processes switch over to using the new tables with the increased Hash Key length.

This gives the advantage of minimising storage requirements as the keys will only be as long as they need to be. It will also give you an early warning as to when you are running out of length of SHA1 and need to move to an even longer base Hash algorithm. The disadvantage may be sudden delay and slowness during this process on larger tables.

answered Nov 21, 2015 by Liam Whales
0 votes

I like this concept, it is by far the best solution I have seen yet to the issue of Hash Collision. Hash collision is in my opinion a massive issue that cannot be simply left to chance or avoidance of it being done by using a more complex (and therefore much larger HashKey, meaning more storage, more load on the server). When a collision occurs what do you do, continue with all data bar the collided record until you can re-work your whole data vault to work with a new length hash key. What if that record just happens to be Business critical? I would hate to be supporting a system with such a ticking time bomb.

I would slightly change your approach though, not sure if I've just miss read/ miss understood your approach but I would do the following.

Add a null able collision sequence field into the Staging table for each HashKey.

Have a process as a pre-step in the updating of each Hub / Link which initially runs against the Staging Data and calculates the collision sequence, where it is null.

Data with complete keys can then be imported as normal to the relevant Hub/Link carrying their collision sequence, so any new data would be omitted until the next import.

The only issue I see with this approach is that you cannot guarantee the same collision sequence order will be generated if the data is reloaded. Which ruins a large portion of the benefits of the HashKey's.

Unless, you used a secondary hashing algorithm to generate a collision Hash, for those records with Hash collisions and used a zeroed Collision key if there was no collision. Due to the small chance of hitting a hash this could be as small as a 1 byte hash, which should give you plenty of time to sort out a more complex base hashing key length. This I think would make it completely repeatable, although this has all been off the top of my head after reading your post.

answered Nov 21, 2015 by Liam Whales
Scalefree Scalefree

Upcoming Trainings

  • July 01 to 03 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • August 05 to 07 - Berlin (German) Data Vault 2.0 Boot Camp and Certification

  • September 9 to 13 - Hanover WWDVC EU

  • September 16 to 18 - Brussels (English) Data Vault 2.0 Boot Camp and Certification

  • October 21 to 23 - Hanover (German) Data Vault 2.0 Boot Camp and Certification

  • November 18 to 20 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • December 02 to 04 - Dusseldorf (German) Data Vault 2.0 Boot Camp and Certification

  • December 16 to 18 - Zurich (English) Data Vault 2.0 Boot Camp and Certification

  • Contact us for inhouse training engagements.
    Visual Data Vault: Logical Modelling for Data Vault - Free Download!
    Scalefree is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
    Permissions beyond the scope of this license are available in our license. | DWH Wiki | Recent questions RSS feed | Imprint | Provided by Michael Olschimke