Contingency tables are used to examine the relationship between subjects’ scores on two qualitative or categorical variables. They show the actual and expected distribution of cases in a cross-tabulated (pivoted) format for the two variables. Here is an example of the actual and expected distribution of cases over the Gender column (on rows) and the MaritalStatus column (on columns) of the dbo.vTargetMail view from the AdvetureWorksDW2012 demo database:

Actual
Gender

Married

Single

Total

F

4745

4388

9133

M

5266

4085

9351

Total

10011

8473

18484

Expected
Gender

Married

Single

Total

F

4946

4187

9133

M

5065

4286

9351

Total

10011

8473

18484

If the columns are not contingent on the rows, then the rows and column frequencies are independent. The test of whether the columns are contingent on the rows is called the chi-square test of independence. The null hypothesis is that there is no relationship between row and column frequencies. Therefore, there should be no difference between the observed and expected frequencies. Contingency tables are the base for the chi-square test. However, even without the test, you might notice some relationship between two discrete variables just by seeing the contingency table.

Contingency tables are very simple to interpret, and therefore they are a very popular tool in statistics and data mining. Wouldn’t it be nice to have a possibility to create such a nice contingency table checking the association between NumberCarsOwned and BikeBuyer columns from the dbo.vTargetMail like the following screenshot shows with SQL Server tools?

Of course, I would not write this blog if I would not have a solution. The screenshot above is actually a screenshot of a SQL Server 2012 Reporting Services (SSRS) report. The whole story starts with the following query.

WITH

ObservedCombination_CTE
AS

(

SELECT
NumberCarsOwned
AS
OnRows,


BikeBuyer
AS
OnCols,


COUNT(*)
AS
ObservedCombination

FROM
dbo.vTargetMail

GROUP
BY
NumberCarsOwned, BikeBuyer

)

SELECT
OnRows, OnCols, ObservedCombination


,SUM(ObservedCombination)
OVER (PARTITION
BY
OnRows)
AS ObservedOnRows


,SUM(ObservedCombination)
OVER (PARTITION
BY
OnCols)
AS ObservedOnCols


,SUM(ObservedCombination)
OVER () AS ObservedTotal


,CAST(ROUND(SUM(1.0 *
ObservedCombination)
OVER (PARTITION
BY
OnRows)


*
SUM(1.0 *
ObservedCombination)
OVER (PARTITION
BY
OnCols)


/
SUM(1.0 *
ObservedCombination)
OVER (), 0) AS
INT) AS ExpectedCombination


,CAST(ROUND(100.0 *
ObservedCombination
/


SUM(ObservedCombination)
OVER (), 0) AS
INT)


AS
PctTotal


,REPLICATE(‘*’,


CAST(ROUND(100.0 *
ObservedCombination
/


SUM(ObservedCombination)
OVER (), 0) AS
INT))


AS
Histogram

FROM
ObservedCombination_CTE

ORDER
BY
OnRows, OnCols

Note that the query uses the new window functions in SQL Server 2012. The result of this query is not pivoted yet, however it contains all of the data needed.

OnRows

OnCols

Observed Combination

Observed On Rows

Observed On Cols

Observed Total

Expected Combination

PctTotal

Histogram

0

0

1551

4238

9352

18484

2144

8

********
0

1

2687

4238

9132

18484

2094

15

***************
1

0

2187

4883

9352

18484

2471

12

************
1

1

2696

4883

9132

18484

2412

15

***************
2

0

3868

6457

9352

18484

3267

21

*********************
2

1

2589

6457

9132

18484

3190

14

**************
3

0

951

1645

9352

18484

832

5

*****
3

1

694

1645

9132

18484

813

4

****
4

0

795

1261

9352

18484

638

4

****
4

1

466

1261

9132

18484

623

3

***

Next part is to create the report. Use the Matrix (OK, Tablix with row and column groups) control. The trick is what to put in the cells of the matrix. The following screenshot shows the expression used for the most detailed cell of the matrix.

Similar expression are used for other cells with data. The expression for the column total, i.e. for the cell that is the intersection of the [OnCols] column and Total row, is

=Sum(Fields!ObservedCombination.Value) & Chr(10) & Chr(13) &

Sum(Fields!ExpectedCombination.Value) & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedOnRows.Value), 2) & “%” & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedCombination.Value), 2) & “%” & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedOnRows.Value), 2) & “%” & Chr(10) & Chr(13)

The expression for the row totals is

=Sum(Fields!ObservedCombination.Value) & Chr(10) & Chr(13) &

Sum(Fields!ExpectedCombination.Value) & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedCombination.Value), 2) & “%” & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedOnCols.Value), 2) & “%” & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedOnCols.Value), 2) & “%” & Chr(10) & Chr(13)

The expression for the grand total cell is

=Sum(Fields!ObservedCombination.Value) & Chr(10) & Chr(13) &

Sum(Fields!ExpectedCombination.Value) & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedCombination.Value), 2) & “%” & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedCombination.Value), 2) & “%” & Chr(10) & Chr(13) &

Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedCombination.Value), 2) & “%” & Chr(10) & Chr(13)

The expression for the row header is

=Fields!OnRows.Value & ”    Observed Frequency” & Chr(10) & Chr(13) &

”        Expected Frequency” & Chr(10) & Chr(13) &

”        Row Percent” & Chr(10) & Chr(13) &

”        Column Percent” & Chr(10) & Chr(13) &

”        Total Percent” & Chr(10) & Chr(13) &

Chr(10) & Chr(13) &

”        Histogram”

The expression for the value of the linear pointer of the gauge is

= Round(Sum(100 * Fields!ObservedCombination.Value) /

Sum(Fields!ObservedTotal.Value), 2)

And the expression for the max value for the gauge scale is

=Max(Fields!PctTotal.Value, “ContTable”)

This all together looks like a lot of work. However, it is not that bad. The source query is not too complicated, and you can easily change the columns and the table used for the analysis, as their names appear once in the common table expression only. The report does not refer to the source names anywhere, so it is simple to copy it and use for a new report with a contingency table for two different columns. You just need to change the report header and the top left cell text of the matrix, which shows the columns used for the report.

 

Dejan Sarka