Month: March 2009

Using Medians For Quartiling When NTILE Is Too Arbitrary

So, in a previous post on The Coolness of the T-SQL NTILE Function, I discuss that NTILE can perform quartiling (or any other –tiling not requiring a trip to Home Depot) for you. Very cool. However, there is something worth noting about this function that I discovered the hard way. I will continue with my quartile example.

If you are quartiling based on a single field (or a small number of fields) and there is a possibility that more than one record could have the same values in the fields by which you are quartiling, it is possible for records with the same value to end up in different quartiles. Doh.

The reason for this is that NTILE essentially just orders your records by the field specified in the ORDER BY clause and then places the first 25% of the records in quartile 1, the second 25% in quartile 2, etc. So, if you have 8 records, as in my example below, the first two records go in quartile 1, the second two go in quartile 2, etc.

Let’s see this in action to make sure I am not just making this stuff up.

USE tempdb

GO

 

CREATE TABLE TestValue

(

       Record char(1) NOT NULL 

     , Value1 int NOT NULL

)

 

INSERT INTO TestValue (Record, Value1) VALUES('A',2)

INSERT INTO TestValue (Record, Value1) VALUES('B',2)

INSERT INTO TestValue (Record, Value1) VALUES('C',2)

INSERT INTO TestValue (Record, Value1) VALUES('D',4)

INSERT INTO TestValue (Record, Value1) VALUES('E',6)

INSERT INTO TestValue (Record, Value1) VALUES('F',6)

INSERT INTO TestValue (Record, Value1) VALUES('G',8)

INSERT INTO TestValue (Record, Value1) VALUES('H',8)

 

In the code above, note that Record B and Record C have the same exact value in Value1.

/*

Quartiling with NTILE

*/

SELECT

       tv.Record

     , tv.Value1

     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

FROM TestValue tv

 

The query above produces this result set:

 

Record Value1      Quartile_NTILE

------ ----------- --------------------

A      2           1

B      2           1

C      2           2

D      4           2

E      6           3

F      6           3

G      8           4

H      8           4

 

Note that although Records B and C have the same Value1 value, NTILE puts them in different quartiles. As previously stated, doh. Since I am doing my ORDER BY only a single column for which duplicate values are present, NTILE makes somewhat arbitrary decisions within the subsets of each Value1 value and puts Record C in quartile 2 instead of 1. If uniqueness were assured on my Value1 field, then NTILE would be an easy and accurate way to go. But not so much in this case.

I decided that the median would be my champion. Itzik Ben-Gan provides the basis for my median calculation here: http://www.sqlmag.com/Article/ArticleID/49827/sql_server_49827.html.

I ended up calculating 3 median values:

1. Median (I called Median50) across Value1 for all records, forming the 50th percentile boundary and hence the boundary between quartiles 2 and 3.

2. Median (I called Median25) across records whose Value1 <= the Median50, forming the 25th percentile boundary and hence the boundary between quartiles 1 and 2.

3. Median (I called Median75) across records whose Value1 > the Median50, forming the 75th percentile boundary and hence the boundary between quartiles 3 and 4.

Here is the final code for calculating my own quartiles and ensuring that records with the same value end up in the same quartile:

;

WITH

 

/*

Calculate the median over all records

This forms the boundary between quartiles 2 and 3

This must be calculated first since it is needed to calculate the 

other quartile boundaries

*/

 

Median50 AS

(

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

)

,

MedianResults50 AS

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median50

FROM Median50

)

,

 

/*

Calculate the median over records whose value is <= the overall median

This forms the boundary between quartiles 1 and 2

*/

 

Median25 AS

(

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

INNER JOIN MedianResults50 m

ON m.JoinField = 1

WHERE Value1 <= m.Median50

)

,

MedianResults25 AS

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median25

FROM Median25

) 

,

 

/*

Calculate the median over records whose value is < the overall median

This forms the boundary between quartiles 3 and 4

*/

 

Median75 AS

(

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

INNER JOIN MedianResults50 m

ON m.JoinField = 1

WHERE Value1 > m.Median50

)

,

MedianResults75 AS

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median75

FROM Median75

)

 

/*

Final SELECT using a CASE statement to assign quartiles based on 

my various medians

*/

SELECT

       tv.Record

     , tv.Value1

     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

     , CASE

          WHEN tv.Value1 <= m25.Median25 THEN 1

          WHEN tv.Value1 BETWEEN m25.Median25 AND m50.Median50 THEN 2

          WHEN tv.Value1 BETWEEN m50.Median50 AND m75.Median75 THEN 3

          ELSE 4

       END AS MyQuartile

FROM TestValue tv

INNER JOIN MedianResults25 m25

ON m25.JoinField = 1

INNER JOIN MedianResults50 m50

ON m50.JoinField = 1

INNER JOIN MedianResults75 m75

ON m75.JoinField = 1

 

-- Cleaning up

DROP TABLE TestValue

 

Here is the result set produced:

 

Record Value1      Quartile_NTILE       MyQuartile

------ ----------- -------------------- -----------

A      2           1                    1

B      2           1                    1

C      2           2                    1

D      4           2                    2

E      6           3                    3

F      6           3                    3

G      8           4                    4

H      8           4                    4

 

You can see in the result set above that my median-style quartiling ensures that records with the same Value1 value end up in the same quartile.

Thus, in situations where the fields over which you are quartiling are guaranteed to produce unique values across your data set, then NTILE can be a great way to go. However, if there can be dupes, you may want to consider my approach above.