My good old friend Itzik Ben-Gan wrote an excellent article on interval queries in SQL Server (http://sqlmag.com/t-sql/sql-server-interval-queries) by using the Relational Interval Tree model. Based on the model developed by Kriegel, Pötke, and Seidl, and enhanced by Martin, Itzik fully developed a T-SQL solution. The solution is great, and makes interval queries efficient in all circumstances. However, the solution is quite complex. Itzik made a Microsoft Connect feature proposal (https://connect.microsoft.com/SQLServer/feedback/details/780746) to add SQL Server Engine support for interval queries. I fully agree that this would be the best solution; now when the theory is known and implementation made possible, it is time that Microsoft puts this in the database engine.

Unfortunately, it looks like temporal data support is not coming in SQL Server soon, at least not yet in the next version (2014). Therefore, it looks like we have to use our own solutions for a while. Itzik’s fantastic article made me think. I was wondering whether there could be some simpler solutions, even if they are not efficient in all circumstances, even if they are applicable for specific cases only. I found three solutions, and in addition, another good friend from SolidQ, Davide Mauri, presented me the fourth one. In this and next three blog posts I am going to present all four solutions. Please note that I am using the tables and data Itzik has prepared. In order to test the solutions, you can download the code from Itzik’s article by using the link at the beginning of this post.

Let me start with the classical solution. I am copying the code from Itzik’s article for creating and populating the table with 10,000,000 rows, and then for querying the middle of the table.

CREATE
TABLE
dbo.Intervals
(


id
INT
NOT
NULL,
lower
INT
NOT
NULL,
upper
INT
NOT
NULL,
CONSTRAINT
CHK_Intervals_upper_gteq_lower
CHECK(upper >=
lower)
);

INSERT

INTO
dbo.Intervals
WITH(TABLOCK)
(id,
lower, upper)

SELECT id,
lower, upper

FROM
dbo.Stage;

ALTER
TABLE
dbo.Intervals
ADD
CONSTRAINT
PK_Intervals
PRIMARY
KEY(id);

CREATE
INDEX
idx_lower
ON
dbo.Intervals(lower)
INCLUDE(upper);
CREATE
INDEX
idx_upper
ON
dbo.Intervals(upper)
INCLUDE(lower);
GO
— ~ 60 seconds needed for creating and populating the table

— query
SET
STATISTICS
IO
ON;
SET

STATISTICS
TIME
ON;
GO


— middle of data
DECLARE
@l
AS
INT =
5000000,
@u AS

INT
= 5000020;

SELECT
id
FROM
dbo.Intervals
WHERE
lower <=
@u
AND

upper
>= @l

OPTION (RECOMPILE);
— used to allow variable sniffing and prevent plan reuse
— logical reads: 11254, CPU time: 764 ms

As Itzik has shown, the problem with this query is that SQL Server can use only one index, and successfully eliminate rows that are not candidates for the result from one side only, and then scans the rest (approximately half of the) data. In Itzik’s and my case presented here, SQL Server used the idx_upper index to eliminate the intervals from the left, which satisfy the condition upper >= @l. Considering this limitation, usage of a single index for a single query, no matter whether it is idx_lower or idx_upper, I started to search for a solution which would use that index for elimination of the rows from both sides. The following figure shows an example of possible intervals.

Black color denotes the intervals in the table. The blue colored interval is the one I am checking for the overlaps. The intervals are sorted by the lower boundary, representing SQL Server’s usage of the idx_lower index. Eliminating intervals from the right side of the given (blue) interval is simple: just eliminate all intervals where the beginning is at least one unit bigger (more to the right) of the end of the given interval. You can see this boundary in the figure denoted with yellow line. However, eliminating from the left is more complex. In order to use the same index, the idx_lower index for eliminating from the left, I need to use the beginning of the intervals in the table in the WHERE clause of the query. I have to go to the left side away from the beginning of the given (blue) interval at least for the length of the longest interval in the table, which is marked with red color in the figure. The intervals that begin before the left yellow line cannot overlap with the given (blue) interval.

Of course, the figure could be turned around to cover the cases when the idx_upper index would be more useful. With this index, the elimination from the left is simple – eliminate all of the intervals which end at least one unit before the beginning of the given interval. This time, the elimination from the right is more complex – the end of the intervals in the table cannot be more to the right than the end of the given interval plus maximal length of all intervals in the table.

I created a new table with the same structure as the original Itzik’s table, with just one computed column (ilen) added. This column, which calculates the length of an interval, is persisted and indexed in order to speed up the search for the longest interval. The following code creates and populates this table.

CREATE
TABLE
dbo.IntervalsL
(


id
INT
NOT
NULL,
lower
INT
NOT
NULL,
upper
INT
NOT
NULL,
ilen
AS
upper

lower
PERSISTED,

CONSTRAINT
CHK_IntervalsL_upper_gteq_lower
CHECK(upper >=
lower)
);

INSERT

INTO
dbo.IntervalsL
WITH(TABLOCK)
(id,
lower, upper)

SELECT id,
lower, upper

FROM
dbo.Stage;

ALTER
TABLE
dbo.IntervalsL
ADD
CONSTRAINT
PK_IntervalsL
PRIMARY
KEY(id);

CREATE
INDEX
idx_lowerL
ON
dbo.IntervalsL(lower)
INCLUDE(upper);
CREATE
INDEX
idx_upperL
ON
dbo.IntervalsL(upper)
INCLUDE(lower);
CREATE
INDEX
ids_ilenL
ON
dbo.IntervalsL(ilen);
GO
— ~ 80 seconds needed for creating and populating the table

 

The creating and the population of the table was slightly longer. Of course, this is due to the creation of the additional index. In addition, because of the persisted computed column, the rows are longer and thus SQL Server needed to allocate more pages for the same 10,000,000 rows.

I find the maximal length of the intervals in advance and then store it in a variable. The query that checks for the overlaps in the middle of the data uses just slightly more complex WHERE clause, otherwise it is equal to the original query, as you can see from the following code.

— query
SET
STATISTICS
IO
ON;
SET

STATISTICS
TIME
ON;
GO

— middle of data
DECLARE
@l
AS
INT
= 5000000,
@u
AS
INT
= 5000020;

DECLARE
@max
AS
INT;
SET

@max
=


(SELECT
MAX(ilen)
AS maxlen

FROM

(SELECT
MAX(ilen)
AS ilen

FROM
dbo.IntervalsL

UNION

SELECT
@u

@l)
AS m1

);


SELECT
id
FROM
dbo.Intervals
WHERE
lower <=
@u
AND
lower
>=

@l

@max

AND
upper
>=
@l
AND
upper
<=
@u
+
@max
OPTION (RECOMPILE);
GO

— logical reads: 6 (3 + 3 to calculate the max length), CPU time: 0 ms

Note the performance of the query. With Itzik’s data, it needed only 6 logical reads, and was thus performing much better that even the fastest solution using the RI tree model, as you can see in the Itzik’s article. However, please note that this performance is the consequence of the specific data in the table. The maximal length of an interval is 20, and there are 10,000,000 intervals in the table. This way, SQL Server can very efficiently eliminate intervals from both sides. However, if there would be only one long interval in the table, the code would become much less efficient, because SQL Server would not be able to eliminate a lot of rows from one side, either left or right, depending which index it would use. For example, the following code sets the maximal length of the intervals in the table to a predefined size 100,000, and then executes the same query.

— Check with longer maximal interval
DECLARE
@l
AS
INT
= 5000000,
@u
AS
INT
= 5000020;

DECLARE
@max
AS
INT
= 100000;

SELECT
id
FROM
dbo.Intervals
WHERE
lower <=
@u

AND
lower
>=
@l

@max
AND
upper
>=
@l
AND
upper
<=
@u
+
@max
OPTION (RECOMPILE);
GO

— logical reads: 228, CPU time: 0 ms

The number of logical reads has grown substantially. In worst case, with a very long interval in the table, the performance of the query would be comparable to the performance of the classical solution presented in the beginning of this blog and in the beginning of the Itzik’s article.

To conclude, this solution might be very useful in case when your data has more or less uniform distribution of the lengths, or, as in the case with Itzik’s test data, when the distribution is not uniform but the maximal length is small. If this is the case, you might prefer this solution to the RI tree model solution, because it is much simpler. However, as I already stated in this blog post, Itzik’s solution does not depend on any data distribution. Therefore, if you don’t know or control the distribution of your data, you should go for the RI tree model solution.

I will show more solutions in the forthcoming blogs.

 

Dejan Sarka