Our Blogs
English
Español
Skip Navigation Links
News
Our Team
Mentoring
Dejan Sarka
Go to home page
Go Search

 
Links
SLODUG

Our Blogs > English > Dejan Sarka
Bleeding Edge Conference

I am very proud I can finally officially announce the Bleeding Edge conference. This is a conference I am preparing together with Matevž Gačnik and Dušan Zupančič. It is a conference prepared by developers for developers. This one-day highly technical event is going to take place on October 1st, 2008, in Portorož, Slovenia. Save the date, or even hurry up with registration!

Data Modeling and Creativity

No doubt data modeling is a process where you can add a lot of creativity. No doubt there is not a single solution for a business problem with relational model. However, does this mean that data modeling is more art than science? How much creativity is right?

I would say that database design is definitely not an art. It is based on science. Of course, you need some creativity. But is creativity limited to artists only? In any profession you need some creativity. However, would you buy a system that your company is going to depend on from an »artist«? A good model has predictable behavior. If the behavior is not predictable, it is not because db designing would be art; it is because the designer does not know her/his profession. I think people abuse word "art" just because they do not know about design enough.

I met so many times very »creative« models. There is a whole branch of modelers I call "inventors"; however, do please note that I mean this in a sarcastic way. It is very simple: if you do not know a lot, you are forced to invent. Interestingly, they mention "theory" a lot as well. Basically, they are "improving" the "theory". Typical examples would be people who invent "XML databases", "OO databases" and similar, and by the way they criticize relational model without really understanding it.

However, do we have to follow the established ways, without using brains? There is another kind of modelers I call "theorists". There are modelers who never use their mind and always follow the authorities in the area, explaining they are following the "theory". Note that they typically mix the real theory, which has foundations in science, mathematics, with words of well-known authorities. For example, types 1, 2, and 3 for handling the SCD problem have nothing to do with any theory; they are just proposals by Ralph Kimball, and since Kimball is an authority, this kind of modelers does not dare to use own brains to try to find a better solution.

So what is the right mix of established ways vs. creativity? First of all, you should always use your brains and common sense. A real intellectual doubts in everything. Do please doubt in words of authorities, scientists, politicians… Try to find better solutions. However, use extreme caution before you claim you invented something. It was probably already invented for thousand times; it is probably just you who does not know anything about it. Do not forget: if you don't know, you are doomed on inventing. Or on calling you an "artist".

Data Vault Modeling Critics

Our mentor Rushabh B. Mehta has pointed me to an interesting Web site http://www.danlinstedt.com/, where you can read about a modeling technique the author named Data Vault Modeling (DVM).

I have not heard of this technique before. I read all of the five articled on the Data Vault technique, starting with http://www.tdan.com/view-articles/5054/ (you can find further links in the first article). I read slowly (not really thoroughly) articles 1, 2, 3, and 4, and briefly 5 (I am not interested in this comment in the ETL process, which is described in article 5). Ok, let me start with my comments.

I did not like the style of the author. There are many quite strong assertions without a proof. For example, in the basic article about the technique, the author claims that one of the benefits of the DVM is "in-database Data Mining". I could not find any proof for this statement in further reading. I deal with Data Mining quite a lot. I could not find how DVM could help me with Data Mining; on the contrary, I found the DVM model very unsuitable for Data Mining. In article "Data Vault Series 1 - Data Vault Overview" the author claims "The Data Vault is based on mathematical principles". I could not find those principles in further reading. Probably they are somewhere; at least I would expect a link.

The basics of the model can be described with three types of tables: Hubs, Links and Satellites. Hubs are the entities, with PKs, business keys and lineage info only. Links are resolution of n-ary relationships between Hubs. Satellites add additional attributes to Hubs and Links. What does this mean in terms of standard Dim and Fact tables? Dim tables are Hubs with one or more Satellites, Fact tables are Links with one or more Satellites. Without much ado, I would say that the only big difference to the dimensional model here is the separation of attributes from entities into separate tables. I do not think this is something new; I have seen many models like this in the past, and the authors did not claim this is something new that would need a patent. Yes, the author wanted to get a patent, and fortunately the patent has been denied as of January 2005. Anyway, to find all info you have in a Fact table, for example, you have to join the Link table with its Satellites. Simplification?

The DVM is focused on one problem mainly. The problem DVM is trying to solve is changes in the dimensional model, namely adding a new dimension and thus changing the grain of some fact tables. The solution is not very impressive. What happens if you add a Hub, i.e. a dimension? The author simply outdates an existing Link table and adds a new one, with additional FK, and then claims it is much easier solution that modifying an existing Fact table in a DW. Well... what happens to the history? Can you still do the analysis over time, comparing the data from two Link tables? I did not find an answer to that question. And, of course, nowadays it is not so painful modifying the grain of the Fact table anymore. You can simply add additional column which allows NULL values (meta-data operation only, so it works in TB databases), and then solve the problem in SSAS with handling the Unknown member. And you have a simple model that gives you all of the history. Finally, you can also model your Star schema the same way, by adding a new Fact table when the level of granularity changes. Therefore, I am not really impressed with this one and only solution the author really provides.

The author solves all of the problems with dimensions with Hub and their Satellite tables, again on a very basic way. For SCDs, simply split the attributes in associated Satellite tables based on their frequency of changes. Well... what is so new and revolutionary with this idea? AFAIK similar solution suggested Ralph Kimball for, how he called at that time, Monster dimensions in late eighties. For merging from multiple sources, the only suggested solution is to insert in Hubs all of the source members. And that's it. So how do you get a complete story for a single customer? The author admits it would be useful to have a single row for a single customer, however, he also says this would be "out of the scope" of those articles. Oh well, it is so easy to deal with real problems in this way. Another nice simplification, directly copied from one of the texts, one of the basic rules of the "Data Vault" modeling: "Hub business keys never change." So what if the business key changes? Well... I could not find any answer, it is simply not allowed by the author. Sancta simplicitas!

The author claims the "Data Vault" model is better than 3rd NF in many ways. I could not find a proof for this claim in his text. My conclusion is the author thinks it is better because of the Link tables, which are better than intermediate tables in resolving many-to-many relationships. Well... the author simply does not understand that the Relational model is not limited to binary relationships only. You can safely model relationships of any arity, without breaking not only 3rd, but also 5th NF, as long as you understand that you need to have uniqueness on at least n-1 foreign keys in the n-ary relationship. BTW, the author claims his model uses some mathematics, but I could not find any mathematics in his writings; however, Dr. Terry Halpin really mathematically proved that n-ary relationships are fully normalized if the uniqueness is on at least on n-1 keys.

The author talks all of the time bout surrogate, natural, business, smart keys. The author does not realize we have only keys, without any attribute. And, of course, so-called "smart" keys are usually the most stupid keys you can select, giving you enormous troubles with changes. Hmm... Wasn't the main purpose of the "Data Vault" model to simplify dealing with changes?

The author spends a lot of time on the join problem. Of course, joins are slow in existing RDBMS products. The author admits that this problem is even worse in his model, as you need additional joins even for Dim and Fact tables. And what are the suggested solutions? I am mentioning just one of the proposals. The author proposes adding a Star schema above the Data Vault schema. Sic! So finally you have OLTP schema, "Data Vault" schema, and the Dimensional schema. So why do you need the intermediate "Data vault" schema? Where is the simplification?

Ok, I am going to stop here. I hope I did not overlook something important in DVM. From what I read, I am far less than impressed.

Data Warehousing in SQL Server 2008 Presentation

 

I am also uploading the What's New in SQL Server 2008 Data Warehousing presentation from the 13th NT Conference. Again, the demos are zipped and the zip file is renamed in order to be able to upload it.

Mastering Data Quality with SQL Server 2008 Presentation

I promised to upload the Data Quality presentation and demos on both conferences, 13th NTK in Slovenia and TechEd Developers in Orlando, FL. You can find the presentation here and demos here. Note that demos are zipped and the zip file extension is changed to .txt in order to e able to upload it to the SPPS site.

13th NT Conference

As usual, I am posting a short comment on our most important IT event of a year.

With more than 2,000 attendees, this was again an incredible event for such a small country. I have to congratulate local MS subsidiary again; IMO they are the best event organizers in the world.

The conference itself has changed substantially. It is not strictly technical anymore. About 30% of presentations were aimed on soft skills. It seems that this fact did not please technical attendees; however, the soft skills presentations were evaluated very well, and therefore obviously exists the other audience. Apparently MS Slovenia noticed the trends and reacted accordingly. Nevertheless, this fact also opens space for another, smaller and more technical conference in Slovenia.

I am not talking about technical conference without a reason. Together with MS Slovenia, Dušan Zupančič and Matevž Gačnik we are preparing another high-end developer's event called Bleeding Edge on October 1st. Developers, database developers and advanced DBAs, stay tuned and save the date!

Basketball, Beer and Age

These days I am in Seattle, attending MVP Summit. Everything is fine, works, and I am glad to be here, with Dev team and friends. However, a funny thing happened to me on Sunday; it was one of those things Europeans have hard time to understand. I had opportunity to experience a person to follow the rules so strictly it becomes senseless.

On Sunday, I watched Supersonics basketball game, together with Fernando, his daughter, and Salvador. During the break, we wanted to have a beer. The woman that was selling it asked me for a picture id. I gave her my personal card. Apparently, she was confused, because it was not from USA. She asked me for passport, which, of course, I did not have with me. Well, without passport, she simply did not want to sell a beer to me. However, since Fernando's daughter had her passport, she was allowed to buy a beer, although she is about 25 years younger than me. Even more, she could by two beers, and after we moved around the corner in order to get out of sight of the rule-enforcing dragon, I finally got my beerJ

SQL Server 2008 Launch in Slovenia

On March 19th, we had a joined launch of Windows Server 2008, SQL Server 2008 and Visual Studio 2008 in Portorož, Slovenia. I had a presentation on Business Intelligence in SQL Server 2008. I promised I would post slides and demos in my blog. However, I got information from Microsoft Slovenia that they are going to prepare a conference CD and send it to all attendees. Since this was a Microsoft conference, they own the content. Therefore, I am not going to post my presentation separately; if you attended the conference, you should receive the CD quite soon.

Is MDX Language Underestimated?

I am somehow surprised on the small number of BI developers and BI DBAs with MDX knowledge. I have a feeling that this language is underestimated, or may I say missunderestimatedJ I know most of the OLAP analyses are done graphically, with client tools. However, there are many business questions and problems you simply cannot resolve graphically. This is the place where MDX strength comes in, with queries and expressions. Let me enlist couple of problems:

  • You need a report for top 10 customers, and then for each customer top 5 products purchased.
  • You have financial cube, with balance measure showing outstanding customers' debts. Balance is a semi-additive measure, so you use LastNonEmpty aggregate function. However, most of customers have balance zero, so your aggregate shows zero, unless you find a specific customer with outstanding debt. Drilling down for each customer separately is tedious and far from regular OLAP experience.
  • You need to create a KPI. For any KPI, the trend value usually includes comparison of current with some historical data. How do you create a KPI?

The solution for all three problems is MDX. For the first problem, you can create a MDX query, i.e. a MDX SELECT statement. For the second one, you can create MDX expression to find only customers with outstanding debts and add it as a named set to your cube. With this named set, querying is simple again, and possible with graphical client tools like Excel 2007. For the third problem, you need, of course, a MDX expression that uses MDX functions that find historical members, and compare their value to the value of the current member.

All together, I think it is time for advanced BI implementers to learn MDX.

Logistic Regression Scorecard

In just couple of days, I have been asked twice about the same problem – how to get the Logistic Regression formula from the mining model content. Terefore, it makes sense to write a blog about itJ

Well, you can use a hard and a simple solution :-)

   

A hard solution would be to go through quite complicated process. You could create a linked server from SQL Server to Analysis Services, and then use T-SQL OPENQUERY function to browse the content of the model with DMX query inside. You would have to browse the model three times - to get output nodes, input nodes, and input attributes distribution. Then you need to calculate Z-scores from input attributes distribution to fit them in the formula. Then you could join all of the results in a final query. Something as this (pseudo code):

   

WITH   -- we need four CTEs

 OutputNodes AS      -- Output nodes CTE

  (SELECT col_list     -- T-SQL Select

      FROM OPENQUERY(linkedAS,

            'SELECT FLATTENED col_list FROM model.CONTENT    -- DMX Select to get output nodes
              WHERE NODE_TYPE=23')),

 InputNodes AS        -- Input nodes CTE

  (SELECT col_list     -- T-SQL Select

      FROM OPENQUERY(linkedAS,

            'SELECT FLATTENED col_list FROM model.CONTENT    -- DMX Select to get input nodes
              WHERE NODE_TYPE=21')),

 AttrDistribution AS   -- Input attributes distribution

  (SELECT col_list     -- T-SQL Select

      FROM OPENQUERY(linkedAS,

            'SELECT FLATTENED NODE_DISTRIBUTION        -- DMX Select to get input attributes distribution

               FROM model.CONTENT
              WHERE NODE_TYPE=24')),

 AttrZScore AS

  (SELECT col_list    -- This is pure T-SQL on previous CTE - calculate Z-scores here

     FROM AttrDistribution)

SELECT col_list       -- final T-SQL Select to put everything together

  FROM OutputNodes

       INNER JOIN InputNodes

        ON condition

       INNER JOIN AttrZScores

        ON condition;

 

The easy way would be to use something already preparedJ The MS Data Mining team has already prepared additional AS CLR stored procedure, which does all of this calculations for you. You can freely download it from SqlServerDataMining.com, if you register on the site. Besides the procedure you need, you can download also a prepared SSRS report - scorecard for logistic regression models. The link to the article is this: http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=102&Id=41. On the
page, you will find the link to download the source code (search for "The download package is available here and it contains a stored procedure that:". In the article, you will find detailed instructions how to build and deploy the procedure.

1 - 10 Next

 ‭(Hidden)‬ Admin Links