In one of my previous blogs I explained basic descriptive statistics procedures. When you prepare the data for Data Mining, there is never enough of over viewing and checking the data with statistical methods. Therefore, it would be nice to have more of them available in SQL Server. Very useful measures are the 3^{rd} and the 4^{th} population moments- Skewness and Kurtosis.

Skewness is a parameter that describes asymmetry in a random variable’s probability distribution. Skewness characterizes the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.

The formula for the skewness consists of sum of the deviations from the mean divided by standard deviation on the third degree. Additionally, the formula can include a corrective factor to shift the values of the skewness around 0, so 0 means no skewness at all. I am using the same formulas as MS Excel Skew function. So the formula for the skewness is:

Skew = n / (n-1) / (n-2) * SUM(((x-µ)/s)^{3})

Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution. The formula for the kurtosis consists of sum of the deviations from the mean divided by standard deviation on the fourth degree. Again, the formula can have some corrective factor to centralize the kurtosis around 0, and again I am using the same formula as MS Excel. The formula:

Kurt = n * (n+1) / (n-1) / (n-2) / (n-3) * SUM(((x-µ)/s)^{4}) – 3 * (n-1)^{2} / (n-2) / (n-3)

In this first part I am adding the Transact-SQL solutions. Note how I add the Mean and StDev values to each row of the outer table – using a cross join with a derived table that calculates them. I could calculate them with a CTE, but this solution with a cross join works on SQL 2000 as well. Still, I have two passes through the data. In next blog I will show you how you can calculate skewness and kurtosis with CLR UDAs in a single pass through the data.

USE AdventureWorksDW;

Go

— Skewness

SELECT Skew = SUM(((Age*1.0–m.mean)/m.[StDev])*((Age*1.0–m.mean)/m.[StDev])*((Age*1.0–m.mean)/m.[StDev]))

* MIN(m.corrfact1)

FROM vTargetMail v CROSS JOIN

(SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),

corrfact1 = COUNT(*)*1.0 / (COUNT(*)-1) / (COUNT(*)-2)

FROM vTargetMail v) AS m;

— Kurtosis

SELECT Kurt = SUM(SQUARE(SQUARE(((Age*1.0–m.mean)/m.[StDev])))) * MIN(m.corrfact2) – MIN(m.subfact)

FROM vTargetMail v CROSS JOIN

(SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),

corrfact2 = COUNT(*)*1.0 * (COUNT(*)+1) / (COUNT(*)-1) / (COUNT(*)-2) / (COUNT(*)-3),

subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)

FROM vTargetMail v) AS m;

GO

- Python for SQL Server Specialists Part 4: Python and SQL Server - April 24, 2018
- Python for SQL Server Specialists Part 3: Graphs and Machine Learning - April 11, 2018
- Python for SQL Server Specialists Part 2: Working with Data - March 22, 2018

Your description “Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution” is outdated and simply wrong.

Two simple counterexamples prove this point: (1) The beta(.5,1) distribution is infinitely peaked but has negative excess kurtosis, and (2) The Uniform distribution mixed with a Cauchy, with mixing probability .000001, appears perfectly flat-topped when graphed but has infinite kurtosis.

Kurtosis measures outliers (or outlier potential) relative to the normal distribution, and nothing about the peak of the distribution. Please see the current Wikipedia page for correct interpretations of kurtosis.