These days I am in Seattle, attending MVP Summit. Everything is fine, works, and I am glad to be here, with Dev team and friends. However, a funny thing happened to me on Sunday; it was one of those things Europeans have hard time to understand. I had opportunity to experience a person to follow the rules so strictly it becomes senseless.
On Sunday, I watched Supersonics basketball game, together with Fernando, his daughter, and Salvador. During the break, we wanted to have a beer. The woman that was selling it asked me for a picture id. I gave her my personal card. Apparently, she was confused, because it was not from USA. She asked me for passport, which, of course, I did not have with me. Well, without passport, she simply did not want to sell a beer to me. However, since Fernando's daughter had her passport, she was allowed to buy a beer, although she is about 25 years younger than me. Even more, she could by two beers, and after we moved around the corner in order to get out of sight of the rule-enforcing dragon, I finally got my beerJ
On March 19th, we had a joined launch of Windows Server 2008, SQL Server 2008 and Visual Studio 2008 in Portorož, Slovenia. I had a presentation on Business Intelligence in SQL Server 2008. I promised I would post slides and demos in my blog. However, I got information from Microsoft Slovenia that they are going to prepare a conference CD and send it to all attendees. Since this was a Microsoft conference, they own the content. Therefore, I am not going to post my presentation separately; if you attended the conference, you should receive the CD quite soon.
I am somehow surprised on the small number of BI developers and BI DBAs with MDX knowledge. I have a feeling that this language is underestimated, or may I say missunderestimatedJ I know most of the OLAP analyses are done graphically, with client tools. However, there are many business questions and problems you simply cannot resolve graphically. This is the place where MDX strength comes in, with queries and expressions. Let me enlist couple of problems:
The solution for all three problems is MDX. For the first problem, you can create a MDX query, i.e. a MDX SELECT statement. For the second one, you can create MDX expression to find only customers with outstanding debts and add it as a named set to your cube. With this named set, querying is simple again, and possible with graphical client tools like Excel 2007. For the third problem, you need, of course, a MDX expression that uses MDX functions that find historical members, and compare their value to the value of the current member.
All together, I think it is time for advanced BI implementers to learn MDX.
In just couple of days, I have been asked twice about the same problem – how to get the Logistic Regression formula from the mining model content. Terefore, it makes sense to write a blog about itJ
Well, you can use a hard and a simple solution :-)
A hard solution would be to go through quite complicated process. You could create a linked server from SQL Server to Analysis Services, and then use T-SQL OPENQUERY function to browse the content of the model with DMX query inside. You would have to browse the model three times - to get output nodes, input nodes, and input attributes distribution. Then you need to calculate Z-scores from input attributes distribution to fit them in the formula. Then you could join all of the results in a final query. Something as this (pseudo code):
WITH -- we need four CTEs
OutputNodes AS -- Output nodes CTE
(SELECT col_list -- T-SQL Select
FROM OPENQUERY(linkedAS,
'SELECT FLATTENED col_list FROM model.CONTENT -- DMX Select to get output nodes WHERE NODE_TYPE=23')),
InputNodes AS -- Input nodes CTE
'SELECT FLATTENED col_list FROM model.CONTENT -- DMX Select to get input nodes WHERE NODE_TYPE=21')),
AttrDistribution AS -- Input attributes distribution
'SELECT FLATTENED NODE_DISTRIBUTION -- DMX Select to get input attributes distribution
FROM model.CONTENT WHERE NODE_TYPE=24')),
AttrZScore AS
(SELECT col_list -- This is pure T-SQL on previous CTE - calculate Z-scores here
FROM AttrDistribution)
SELECT col_list -- final T-SQL Select to put everything together
FROM OutputNodes
INNER JOIN InputNodes
ON condition
INNER JOIN AttrZScores
ON condition;
The easy way would be to use something already preparedJ The MS Data Mining team has already prepared additional AS CLR stored procedure, which does all of this calculations for you. You can freely download it from SqlServerDataMining.com, if you register on the site. Besides the procedure you need, you can download also a prepared SSRS report - scorecard for logistic regression models. The link to the article is this: http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=102&Id=41. On the page, you will find the link to download the source code (search for "The download package is available here and it contains a stored procedure that:". In the article, you will find detailed instructions how to build and deploy the procedure.
With this post, I am starting a new topic in my blogs. I am pretty sure that this topic is going to be very interesting in nearby future. I am talking about data quality.
I am dealing with IT solutions for more than 20 years. I have seen many different solutions, and many different customer databases. I am focusing on BI solutions, especially Data Mining and OLAP. I am experiencing the same problem over and over again: garbage in, garbage out. In a BI project, you typically spend about 70% or even more time with ETL process, especially with over viewing, inspecting, cleansing, and merging data. Most of my customers did not even know how much garbage they had in their data.
However, the data is the key asset of a company. The asset is not in buildings or machines. People are valuable asset; nevertheless, people change companies, jobs, careers, or even die. Data is the asset that remains in a company. In data, the complete knowledge of a company is hidden. Therefore, data quality is a crucial issue.
I plan to write articles on data quality. I will describe data quality dimensions and models. I will show methods you can use to measure data quality, to inspect data, to measure the amount of information hidden in the data and more, with code supported by SQL Server 2008. Of course, I hope this is not just a promise; I will try to find time to accomplish what I started with this post.
I am just reminding all developers on the European Silverlight Challenge. You can get really nice awards. For Slovenia, S.Q.M. is one of the sponsors. You can find details about Slovenian challenge here.
It seems that my »Beyond Relational« blog achieved quite good attention, thanks to Paul Nielsen. You can find interesting discussion at Paul's blog site. I will continue to follow the discussion there.
I have to point out another nonsense terminology. Unfortunately this time it is in Books OnLine, from Microsoft, talking about catalog views. BOL use terminology »inheritance« for relationship between catalog views. For example, if you check the sys.tables catalog view in BOL, you will find description of columns of the view:
The description starts with »inherited columns«.
Let us check the definition of the inheritance in OOP. For example, in MSDN, Visual basic Language Concepts, you can read this definition: »Derived classes inherit, and can extend, the properties, methods, and events of the base class«. Or, In C# Programming Guide, you can read: »The new class—the derived class—then gains all the non-private data and behavior of the base class in addition to any other data or behaviors it defines for itself«.
Inheritance means gaining knowledge, i.e. operations of the base class, not private data (data of an object should be private anyway in a good OOP). Now, what kind of operations do we have defined on views? And do views hide any data, do they have any private members? Let us check how this »inheritance« is implemented physically. With
EXEC sp_helptext 'sys.objects';
EXEC sp_helptext 'sys.tables';
You get the following definitions:
CREATE VIEW sys.objects AS
SELECT name, object_id, principal_id, schema_id, parent_object_id,
type, type_desc, create_date, modify_date,
is_ms_shipped, is_published, is_schema_published
FROM sys.objects$
and
CREATE VIEW sys.tables AS
SELECT o.name, o.object_id, o.principal_id, o.schema_id, o.parent_object_id,
o.type, o.type_desc, o.create_date, o.modify_date,
o.is_ms_shipped, o.is_published, o.is_schema_published,
lob.lobds AS lob_data_space_id,
rfs.indepid AS filestream_data_space_id,
o.property AS max_column_id_used,
o.lock_on_bulk_load, o.uses_ansi_nulls, o.is_replicated, o.has_replication_filter,
o.is_merge_published, o.is_sync_tran_subscribed, o.has_unchecked_assembly_data,
lob.intprop AS text_in_row_limit,
o.large_value_types_out_of_row
FROM sys.objects$ o
LEFT JOIN sys.sysidxstats lob ON lob.id = o.object_id AND lob.indid <= 1
LEFT JOIN sys.syssingleobjrefs rfs ON rfs.depid = o.object_id AND rfs.class = 42 AND rfs.depsubid = 0 -- SRC_OBJTOFSDS
WHERE o.type = 'U'
Therefore, »inheritance« is implemented as copy and paste of column names from the columns list (SELECT part)! This is a new definition of the »inheritance« for me. I would appreciate if the author would write more about this type of »inheritance«; it seems like thousands of developers worldwide have to re-learn OOP basics.
Well, this was, of course, sarcasm. This is just another example of incorrect usage of terminology. Unifying object and relational world is a big fad nowadays. It seems like anybody who has five minutes time wants to catch that train, and uses object terminology for relational databases without having basic knowledge and understanding of either, relational model and object-oriented programming.
I've seen couple of presentations about SQL Server 2008. I noticed that many speakers use term "beyond relational" when they are describing the new data types that SQL Server 2008 supports, especially the spatial and filestream types. I am quite surprised with this term. So far, I did not realize the relational model is data type dependent. I would definitely like to get from the people that use this expression the list of the "relational" data types. It seems like datetime types are relational and spatial types are not relational for them. I guess we could call them "Zweistein". Einstein actually unified space and time, and they managed to separate them againJ It also seems like the filestream type bothers them, because the values of this type are stored differently than values of other types. Well, I did not realize the relational model depends on physical storage as well.
Of course, this was a bit sarcastic. I would simply say that the people that use the "beyond relational" expression do not know exactly what they are talking about. Relational model is neither data type nor physical storage dependent. I would really appreciate if people who are talking about SQL Server would have some knowledge about relational model. Or maybe is this a fad, going beyond relational, because the relational model is "obsolete"?
I just realized I did not publish yet the Open Schema presentation I had in Denver in the PASS Summit 2007 conference. It is based on the presentation we had together with Miha Valencic last year on the NT conference. I am using this opportunity to thank Miha again; it was a wonderful experience working with him. You can find the materials here.