settingsLogin | Registersettings
Scalefree
Show Menu

String data in DV

+1 vote
I'm trying to solve an issue with storing strings in a Data Vault. Based on your experiences, what's considered standard practice? Using a generic string data type like NVARCHAR makes sense, but what should I use specifically? Should I use NVARCHAR(MAX) to account for strings with unknown lengths? Should I use NVARCHAR(4000) which is the largest non-pointered chunk that can fit on a page? Or maybe NVARCHAR(450) so the strings can still be indexed?

As an example, say you have a hub called VEHICLE and you're using a vehicle identification number (VIN) as the business key. The vehicle identification number averages to about 17 characters, but it could be higher or lower for some vehicles (in rare, internal cases). At some point down the line, we decide to start using 20 characters for the VIN. Rather than rebuild the vault, how can we prepare for a situation like that?

On a related note, how many people actually mirror data types from the source to the vault (as opposed to leaving all data types as NVARCHAR)?
asked Dec 8, 2013 in Raw Data Vault by Jackhammer (290 points)
recategorized Jan 9, 2015 by molschimke

2 Answers

+1 vote
From my experience, it's done on a case-by-case basis. You have a lot of factors including the hardware, platform, architecture and DBMS vendor. For instance, I wouldn't use NVARCHAR(MAX) when dealing with SQL Server unless you have a strategy for compression and timeline implementation. If you take a jack-of-all-trades approach, you'll run into a lot of issues further down the road.

I like to use NVARCHAR(MAX) in error Sats, but in SQL Server I prefer to type my domains appropriately. NVARCHAR(MAX) might go over the 900 byte max index length, especially on clustered index fields.
answered Jan 2, 2014 by DV_Peter (320 points)
I ran some lab tests using MS SQL 2014's new CLUSTERED COLUMNSTORE INDEX and found out it uses some crazy compression to index every column in the table. I created an NVARCHAR(4000) column in a sample software table – not only did it store just the unique keys in each column, but it compressed a 35GB table with 80M rows to less than 500MB!

With this, I should be able to default each business key in the hubs and each attribute in the satellites to NVARCHAR(4000) while only having to focus on the rare cases where there's a string longer than 4000 characters. This shouldn't be a problem since Satellites contain metadata and the chance of a long string is rare enough to not be an issue.

I get that this could vary for other platforms, especially since clustered columnstore is an enterprise feature, but this should work for us.
+1 vote
When dealing with long string fields, you have to consider basic block size. This might sound old-fashioned, but there are a lot of clients using older hardware who need to account for resource usage. You also need to account for keys and indexes, which could be comprised of multiple composite fields. By adding in big text fields the IO handling becomes much less efficient. Honestly, using NVARCHAR when dealing with ASCII characters is just wasting IO resources.

We typically reserve NVARCHAR for open-ended text fields (descriptions, etc.). For code fields, it depends on your specific circumstances. For example, NVARCHAR supports Unicode without problems, but it isn't necessary if the source system is older and deals with ASCII strings instead of Unicode strings or something else . On the other hand, you'll probably want to use NVARCHAR with countries that have non-ASCII or Unicode strings such as Russia, China and the Czech Republic.

What we usually do is pre-define a few standard string lengths (10, 20, 50, 100, 250, 1000 and 4000) then assign a group based on the current max field length. If the length is close to the limit, then it's moved up a level.
answered Jan 27, 2014 by ADeWit (550 points)
Scalefree Scalefree

Upcoming Trainings

  • July 01 to 03 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • August 05 to 07 - Berlin (German) Data Vault 2.0 Boot Camp and Certification

  • September 9 to 13 - Hanover WWDVC EU

  • September 16 to 18 - Brussels (English) Data Vault 2.0 Boot Camp and Certification

  • October 21 to 23 - Hanover (German) Data Vault 2.0 Boot Camp and Certification

  • November 18 to 20 - Amsterdam (English) Data Vault 2.0 Boot Camp and Certification

  • December 02 to 04 - Dusseldorf (German) Data Vault 2.0 Boot Camp and Certification

  • December 16 to 18 - Zurich (English) Data Vault 2.0 Boot Camp and Certification

  • Contact us for inhouse training engagements.
    Visual Data Vault: Logical Modelling for Data Vault - Free Download!
    Scalefree
    DataVault.guru is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
    Permissions beyond the scope of this license are available in our license.

    DataVault.guru | DWH Wiki | Recent questions RSS feed | Imprint | Provided by Michael Olschimke
    ...