Our mentor Rushabh B. Mehta has pointed me to an interesting Web site http://www.danlinstedt.com/, where you can read about a modeling technique the author named Data Vault Modeling (DVM).

I have not heard of this technique before. I read all of the five articled on the Data Vault technique, starting with http://www.tdan.com/view-articles/5054/ (you can find further links in the first article). I read slowly (not really thoroughly) articles 1, 2, 3, and 4, and briefly 5 (I am not interested in this comment in the ETL process, which is described in article 5). Ok, let me start with my comments.

I did not like the style of the author. There are many quite strong assertions without a proof. For example, in the basic article about the technique, the author claims that one of the benefits of the DVM is “in-database Data Mining”. I could not find any proof for this statement in further reading. I deal with Data Mining quite a lot. I could not find how DVM could help me with Data Mining; on the contrary, I found the DVM model very unsuitable for Data Mining. In article “Data Vault Series 1 – Data Vault Overview” the author claims “The Data Vault is based on mathematical principles”. I could not find those principles in further reading. Probably they are somewhere; at least I would expect a link.

The basics of the model can be described with three types of tables: Hubs, Links and Satellites. Hubs are the entities, with PKs, business keys and lineage info only. Links are resolution of n-ary relationships between Hubs. Satellites add additional attributes to Hubs and Links. What does this mean in terms of standard Dim and Fact tables? Dim tables are Hubs with one or more Satellites, Fact tables are Links with one or more Satellites. Without much ado, I would say that the only big difference to the dimensional model here is the separation of attributes from entities into separate tables. I do not think this is something new; I have seen many models like this in the past, and the authors did not claim this is something new that would need a patent. Yes, the author wanted to get a patent, and fortunately the patent has been denied as of January 2005. Anyway, to find all info you have in a Fact table, for example, you have to join the Link table with its Satellites. Simplification?

The DVM is focused on one problem mainly. The problem DVM is trying to solve is changes in the dimensional model, namely adding a new dimension and thus changing the grain of some fact tables. The solution is not very impressive. What happens if you add a Hub, i.e. a dimension? The author simply outdates an existing Link table and adds a new one, with additional FK, and then claims it is much easier solution that modifying an existing Fact table in a DW. Well… what happens to the history? Can you still do the analysis over time, comparing the data from two Link tables? I did not find an answer to that question. And, of course, nowadays it is not so painful modifying the grain of the Fact table anymore. You can simply add additional column which allows NULL values (meta-data operation only, so it works in TB databases), and then solve the problem in SSAS with handling the Unknown member. And you have a simple model that gives you all of the history. Finally, you can also model your Star schema the same way, by adding a new Fact table when the level of granularity changes. Therefore, I am not really impressed with this one and only solution the author really provides.

The author solves all of the problems with dimensions with Hub and their Satellite tables, again on a very basic way. For SCDs, simply split the attributes in associated Satellite tables based on their frequency of changes. Well… what is so new and revolutionary with this idea? AFAIK similar solution suggested Ralph Kimball for, how he called at that time, Monster dimensions in late eighties. For merging from multiple sources, the only suggested solution is to insert in Hubs all of the source members. And that’s it. So how do you get a complete story for a single customer? The author admits it would be useful to have a single row for a single customer, however, he also says this would be “out of the scope” of those articles. Oh well, it is so easy to deal with real problems in this way. Another nice simplification, directly copied from one of the texts, one of the basic rules of the “Data Vault” modeling: “Hub business keys never change.” So what if the business key changes? Well… I could not find any answer, it is simply not allowed by the author. Sancta simplicitas!

The author claims the “Data Vault” model is better than 3rd NF in many ways. I could not find a proof for this claim in his text. My conclusion is the author thinks it is better because of the Link tables, which are better than intermediate tables in resolving many-to-many relationships. Well… the author simply does not understand that the Relational model is not limited to binary relationships only. You can safely model relationships of any arity, without breaking not only 3rd, but also 5th NF, as long as you understand that you need to have uniqueness on at least n-1 foreign keys in the n-ary relationship. BTW, the author claims his model uses some mathematics, but I could not find any mathematics in his writings; however, Dr. Terry Halpin really mathematically proved that n-ary relationships are fully normalized if the uniqueness is on at least on n-1 keys.

The author talks all of the time bout surrogate, natural, business, smart keys. The author does not realize we have only keys, without any attribute. And, of course, so-called “smart” keys are usually the most stupid keys you can select, giving you enormous troubles with changes. Hmm… Wasn’t the main purpose of the “Data Vault” model to simplify dealing with changes?

The author spends a lot of time on the join problem. Of course, joins are slow in existing RDBMS products. The author admits that this problem is even worse in his model, as you need additional joins even for Dim and Fact tables. And what are the suggested solutions? I am mentioning just one of the proposals. The author proposes adding a Star schema above the Data Vault schema. Sic! So finally you have OLTP schema, “Data Vault” schema, and the Dimensional schema. So why do you need the intermediate “Data vault” schema? Where is the simplification?

Ok, I am going to stop here. I hope I did not overlook something important in DVM. From what I read, I am far less than impressed.

 

Dejan Sarka