The title says it: The new (source-mirror DW staging from a) NoSQL
system from which I intend to design
and load a Raw (Key-Aligned) Data Vault
2, not only does not enforce referential integrity, but is also itself weak in terms of placement of, and validity of, business keys. This DV will assuredly become the authoritative system of record around which this organization understands it's operations and runs the business. It will quickly become the data repository for operational monitoring, compliance and mission-critical reporting
and, of course, analytics.
I see three related challenges:
1. The identifier fields actually define something less than a true Business Key would identify, and are also unable individually to identify a unique record, so I'll describe them herein as 'helper-identifiers'. As an example, Customer_ID is not what business stakeholders consider as a Customer, but is more aptly defined as "a system user who has initiated one or more of the processes
in an intended sequence of website registration, login, card-authentication, offer qualification, offer acceptance, and purchase." As such, the logic of a 'Customer" business keys would be, in it's simplest exressoin... 'SELECT Customer_ID ... WHERE AuthStatusCode = 52 (meaning 'completed the sequential process enough for the business to call it 'Customer' More on this in a moment.
2. The NoSQL
system's equivalent of tables often have a weakly defined granularity
, with a large number of what I've described above as 'helper-identifier' columns that "hint" at or, when used as a composite, get close to being unique record identifier, but still do not fully achieve uniqueness. The NoSQL
team seem to have piled on many such identifers not only as "convenient" foreign keys, but also in hope that they would help define granularity
. As I said before, although they help to approach granularity
(as a composite alternate key), then often do not, even as a composite, achieve it. So, on top of these identifier fields, I often also see a single PK that does by itself, define a unique record, but which is then referenced in few, if any, tables which should logically be dependent and use it as foreign key.
3. Nested JSON, flattened into my source-mirror staging tables, and with a logical granularity
that, although flattened into a single field in my source mirror staging tables, often contains nested levels finer than the granularity
of the fields table itself. Ouch!
I regard this as a fairly typical 'real world' NoSQL
data situation, wherein a NoSQL
platform's inherent abil
ity to relax strict data integrity rules in the name of high availabil
ity and parallel processing, allows an un-disciplined data model to take root.
I think I know how I want to design
my DV2 around this, but I'd like to hear from you people.
If you like, you can consider / critique the following approach I'm considering, or perhaps you've already encountered this, and you have another method. To summarize my considered solution, I'll just say that I might...
Raw Data Vault
: (Key aligned, but no soft business rules applied)
* Most source tables look to me like the will become Link Satellites
* Because of the above-described scenario, I foresee few, if any, Satellites
dependent direct on Hubs
will relate many Hubs
together according to units of work.
Business Data Vault
* Before looking at point-in-time tables, bridge tables
for performance, I will likely use various applications of Hubs
Applications ( see Chap. 5 of Dan's & Michael's EXCELLENT DV2 book! ) in order to either simplify the DV schema
or wrangle it into shape for consistency. Yes, I know I'm being vague here.
* Here's where I'm still just thinking about what to do: To achieve real business keys, I envision either creating a second set of Hubs
with de-duplicated records, or perhaps just employing 'Same-As Links
' for this purpose.
* JSON Parsing and/or Normalizing: If I'm going to do it, the Business Vault feels like the place for it. I suppose an ideal solution is for the underlying platform to support extensive JSON parsing on the fly, so that future changes
in the JSON, such as nesting levels going ever deeper in the future, do not break our shiny new data warehouse
Once I gain clarity on these upcoming steps (with your help, that is), I also need to figure out how and when to perform the following:
* Migration of legacy EDW
data (not DV): How much profiling and business-key discovery must I do NOW on that legacy system before finalizing my Raw DV2 design
and initiate loading from the new NoSQL
data source? This is particularly daunting to me, because from what I'm told, the legacy DW system was built around a set of legacy business rules from an increasingly outdated business model. As such, I fully expect that the business keys I will end up discovering there will be very different, perhaps even inconsistent, from those I'm working on discovering from the new NoSQL
system now, so it's hard to scope these discovery efforts into Agile time-boxes. Since I've not fully read the Agile-Scrum-DV2 material in Dan's an Michael's new book, perhaps it offers specific guidance on this. True?
Anyway, I hope that some members here have encountered similar scenarios, and I look forward to learning a thing or two. Thoughts?