settingsLogin | Registersettings
Scalefree
Show Menu

NoSQL Data Source Offers Too Many / Too Weak Business Keys

0 votes
The title says it:  The new (source-mirror DW staging from a) NoSQL system from which I intend to design and load a Raw (Key-Aligned) Data Vault 2, not only does not enforce referential integrity, but is also itself weak in terms of placement of, and validity of, business keys.  This DV will assuredly become the authoritative system of record around which this organization understands it's operations and runs the business.  It will quickly become the data repository for operational monitoring, compliance and mission-critical reporting and, of course, analytics.

I see three related challenges:
1. The identifier fields actually define something less than a true Business Key would identify, and are also unable individually to identify a unique record, so I'll describe them herein as 'helper-identifiers'.  As an example, Customer_ID is not what business stakeholders consider as a Customer, but is more aptly defined as "a system user who has initiated one or more of the processes in an intended sequence of website registration, login, card-authentication, offer qualification, offer acceptance, and purchase." As such, the logic of a 'Customer" business keys would be, in it's simplest exressoin... 'SELECT Customer_ID ... WHERE AuthStatusCode = 52 (meaning 'completed the sequential process enough for the business to call it 'Customer'  More on this in a moment.

2.  The NoSQL system's equivalent of tables often have a weakly defined granularity, with a large number of what I've described above as 'helper-identifier' columns that "hint" at or, when used as a composite, get close to being unique record identifier, but still do not fully achieve uniqueness.  The NoSQL team seem to have piled on many such identifers not only as "convenient" foreign keys, but also in hope that they would help define granularity.  As I said before, although they help to approach granularity (as a composite alternate key), then often do not, even as a composite, achieve it.  So, on top of these identifier fields, I often also see a single PK that does by itself, define a unique record, but which is then referenced in few, if any, tables which should logically be dependent and use it as foreign key.

3.  Nested JSON, flattened into my source-mirror staging tables, and with a logical granularity that, although flattened into a single field in my source mirror staging tables, often contains nested levels finer than the granularity of the fields table itself.  Ouch!

I regard this as a fairly typical 'real world' NoSQL data situation, wherein a NoSQL platform's inherent ability to relax strict data integrity rules in the name of high availability and parallel processing, allows an un-disciplined data model to take root.

I think I know how I want to design my DV2 around this, but I'd like to hear from you people.

If you like, you can consider / critique the following approach I'm considering, or perhaps you've already encountered this, and you have another method.  To summarize my considered solution, I'll just say that I might...

Raw Data Vault: (Key aligned, but no soft business rules applied)
* Most source tables look to me like the will become Link Satellites.
* Because of the above-described scenario, I foresee few, if any, Satellites dependent direct on Hubs.
* Links will relate many Hubs together according to units of work.

Business Data Vault:  
* Before looking at point-in-time tables, bridge tables for performance, I will likely use various applications of Hubs, Links, Satellite Applications ( see Chap. 5 of Dan's & Michael's EXCELLENT DV2 book! ) in order to either simplify the DV schema or wrangle it into shape for consistency.  Yes, I know I'm being vague here.
* Here's where I'm still just thinking about what to do:  To achieve real business keys, I envision either creating a second set of Hubs with de-duplicated records, or perhaps just employing 'Same-As Links' for this purpose.
* JSON Parsing and/or Normalizing:  If I'm going to do it, the Business Vault feels like the place for it.  I suppose an ideal solution is for the underlying platform to support extensive JSON parsing on the fly, so that future changes in the JSON, such as nesting levels going ever deeper in the future, do not break our shiny new data warehouse,

Once I gain clarity on these upcoming steps (with your help, that is), I also need to figure out how and when to perform the following:
* Migration of legacy EDW data (not DV):  How much profiling and business-key discovery must I do NOW on that legacy system before finalizing my Raw DV2 design and initiate loading from the new NoSQL data source?  This is particularly daunting to me, because from what I'm told, the legacy DW system was built around a set of legacy business rules from an increasingly outdated business model.  As such, I fully expect that the business keys I will end up discovering there will be very different, perhaps even inconsistent, from those I'm working on discovering from the new NoSQL system now, so it's hard to scope these discovery efforts into Agile time-boxes.  Since I've not fully read the Agile-Scrum-DV2 material in Dan's an Michael's new book, perhaps it offers specific guidance on this.  True?

Anyway, I hope that some members here have encountered similar scenarios, and I look forward to learning a thing or two.  Thoughts?
asked Jan 29, 2016 in Modelling by danielupton (130 points)

1 Answer

+1 vote
Hi Daniel,
In most cases, this would require me to come on-site and assist with the model design process.  Generally these questions are handled through white-board sessions with the Subject Matter Expert and the implementation team, and at least one business analyst present.  I will do my best to give you "shoot from the hip" reactions / responses, but they will be FAR from what you actually need to implement properly.   Be that as it may, here goes my gut reaction / best guess:

Point a: ALL data (at some point of origination) *must* contain a unique identifier
be-it mechanical (surrogate), or business focused (smart key with meaning).  If the data is arriving in NoSQL - or any other Landing Zone, or staging area without the original unique identifier - then it will be necessary to get the business to push back on the SLA (service level agreements) to enforce a higher quality of data arriving from their service providers.  Without this - NO amount of warehousing on ANY platform can be done (but you know this already).

That said: If the NoSQL team is NOT carrying forward a truly unique identifier for the entire "file" / record / JSON object or XML object, then one must be constructed.  In most cases - this usually equates to some form of Hash Key creation inside Hadoop.  But again, see point A above.  It *must* ultimately be pushed back up to the business to force a resolution, even pushed back in to the NoSQL team...  This screams a lack of Master Data Management (initiative), and a lack of Good Governance at the enterprise level.  It also screams of missing hierarchies for enterprise concepts... an enterprise data alignment problem - most definitely.

Regarding a flattened staging area... wouldn't do that, nope.  Not if you already have a NoSQL hadoop instance in place.  I would recommend leveraging as a stage, Hive Internal structures.  That would be my first stop.  Then, using Squoop (scoop?) to get the data out - but not in a JSON hierarchical format... no no no, apply the hard rule of normalization (this is taught in my CDVP2 class in great depth)  I would also use Hive Internal to build out and attach necessary Hash Keys, then replicate the normalized structured data IN to the relational DB to enable sub-second query response times.

Now: to try to answer your technical questions...
For the Raw Vault... you are mostly correct in your assumptions based on what I've read in your post.  To go deeper or beyond, would require me to be on-site and work through example data sets, and ask appropriate deep dive questions.

There is a new evolution coming with regard to Hubs (or rather I'm returning to best practices for data modeling), that falls in line with the rules and standards.  John Giles discussed this on his blog in some posts titled: Universal Data Vault (with my blessing).  He will be presenting at WWDVC 2016 - you won't want to miss his talk.   But all that to say, there are ways to consolidate (carefully) some hubs, and in that situation remove SOME of the links that might have resulted otherwise.  But again, this is not written in to the new book (ran out of time and space in the binding), but I'm happy to bring these techniques on-site.

Regarding JSON on the fly...  The IDEAL place for what you want to do is: Hive & Hadoop, leave the JSON as JSON, build the Hubs & Links directly in Hive, and then join to the JSON Satellites on demand.  Just one problem with this approach: Performance.  Sure - you can try to put all of this in memory (spark??)  but I doubt it will ALL fit.  So that said, if you need dedicated sub-second query response times, you have no choice (today), but to move that data to a structured RDBMS - which means: normalization at some point.   Note: Sanjay Pande will be releasing a new hands-on DV on Hadoop / HIve class for http://LearnDataVault.com very very soon that covers some of these things.  I also teach some of these elements including pros and cons in the newly revamped CDVP2 course.

Regarding the Legacy Migration.... There is a path, and it's an incremental one.  I've done this with my customer Commonwealth Bank in Australia, and a number of other organizations.  There is too much information there to cover here (unfortunately).  All I can say (pressed for time at the moment) is:  start small,  don't worry about the legacy business keys UNLESS they can be traced to actual business meaning and at one point, originated in the actual source system.  If they originated in the legacy DW, then they *might* need to be carried in as simple Satellite attributes, rather than hubs & links.  Also, when starting small, build new in the DV, as requests come in, link to the old DW until the new DV can provide equivalent results, then and only then sunset that piece of the old DW.   Depreciate the asset over time - lowers the risk for everyone.

Hope this helps,
Dan Linstedt
PS: Feel free to contact me for further engagement
answered Mar 4, 2016 by dlinstedt (380 points)
Thank you for the detailed response, Dan!
Scalefree Scalefree

Upcoming Trainings

  • July 01 to 03 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • August 05 to 07 - Berlin (German) Data Vault 2.0 Boot Camp and Certification

  • September 9 to 13 - Hanover WWDVC EU

  • September 16 to 18 - Brussels (English) Data Vault 2.0 Boot Camp and Certification

  • October 21 to 23 - Hanover (German) Data Vault 2.0 Boot Camp and Certification

  • November 18 to 20 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • December 02 to 04 - Dusseldorf (German) Data Vault 2.0 Boot Camp and Certification

  • December 16 to 18 - Zurich (English) Data Vault 2.0 Boot Camp and Certification

  • Contact us for inhouse training engagements.
    Visual Data Vault: Logical Modelling for Data Vault - Free Download!
    Scalefree
    DataVault.guru is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
    Permissions beyond the scope of this license are available in our license.

    DataVault.guru | DWH Wiki | Recent questions RSS feed | Imprint | Provided by Michael Olschimke
    ...