March 2009 – Mark Vaillancourt

Using Medians For Quartiling When NTILE Is Too Arbitrary

By Mark on March 26, 2009 • ( 5 Comments )

So, in a previous post on The Coolness of the T-SQL NTILE Function, I discuss that NTILE can perform quartiling (or any other –tiling not requiring a trip to Home Depot) for you. Very cool. However, there is something worth noting about this function that I discovered the hard way. I will continue with my quartile example.

If you are quartiling based on a single field (or a small number of fields) and there is a possibility that more than one record could have the same values in the fields by which you are quartiling, it is possible for records with the same value to end up in different quartiles. Doh.

The reason for this is that NTILE essentially just orders your records by the field specified in the ORDER BY clause and then places the first 25% of the records in quartile 1, the second 25% in quartile 2, etc. So, if you have 8 records, as in my example below, the first two records go in quartile 1, the second two go in quartile 2, etc.

Let’s see this in action to make sure I am not just making this stuff up.

USE tempdb

GO

CREATE TABLE TestValue

       Record char(1) NOT NULL

     , Value1 int NOT NULL

INSERT INTO TestValue (Record, Value1) VALUES('A',2)

INSERT INTO TestValue (Record, Value1) VALUES('B',2)

INSERT INTO TestValue (Record, Value1) VALUES('C',2)

INSERT INTO TestValue (Record, Value1) VALUES('D',4)

INSERT INTO TestValue (Record, Value1) VALUES('E',6)

INSERT INTO TestValue (Record, Value1) VALUES('F',6)

INSERT INTO TestValue (Record, Value1) VALUES('G',8)

INSERT INTO TestValue (Record, Value1) VALUES('H',8)

In the code above, note that Record B and Record C have the same exact value in Value1.

/*

Quartiling with NTILE

*/

SELECT

       tv.Record

     , tv.Value1

     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

FROM TestValue tv

The query above produces this result set:

Record Value1      Quartile_NTILE

------ ----------- --------------------

A      2           1

B      2           1

C      2           2

D      4           2

E      6           3

F      6           3

G      8           4

H      8           4

Note that although Records B and C have the same Value1 value, NTILE puts them in different quartiles. As previously stated, doh. Since I am doing my ORDER BY only a single column for which duplicate values are present, NTILE makes somewhat arbitrary decisions within the subsets of each Value1 value and puts Record C in quartile 2 instead of 1. If uniqueness were assured on my Value1 field, then NTILE would be an easy and accurate way to go. But not so much in this case.

I decided that the median would be my champion. Itzik Ben-Gan provides the basis for my median calculation here: http://www.sqlmag.com/Article/ArticleID/49827/sql_server_49827.html.

I ended up calculating 3 median values:

1. Median (I called Median50) across Value1 for all records, forming the 50^th percentile boundary and hence the boundary between quartiles 2 and 3.

2. Median (I called Median25) across records whose Value1 <= the Median50, forming the 25^th percentile boundary and hence the boundary between quartiles 1 and 2.

3. Median (I called Median75) across records whose Value1 > the Median50, forming the 75^th percentile boundary and hence the boundary between quartiles 3 and 4.

Here is the final code for calculating my own quartiles and ensuring that records with the same value end up in the same quartile:

WITH

/*

Calculate the median over all records

This forms the boundary between quartiles 2 and 3

This must be calculated first since it is needed to calculate the

other quartile boundaries

*/

Median50 AS

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

MedianResults50 AS

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median50

FROM Median50

/*

Calculate the median over records whose value is <= the overall median

This forms the boundary between quartiles 1 and 2

*/

Median25 AS

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

INNER JOIN MedianResults50 m

ON m.JoinField = 1

WHERE Value1 <= m.Median50

MedianResults25 AS

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median25

FROM Median25

/*

Calculate the median over records whose value is < the overall median

This forms the boundary between quartiles 3 and 4

*/

Median75 AS

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

INNER JOIN MedianResults50 m

ON m.JoinField = 1

WHERE Value1 > m.Median50

MedianResults75 AS

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median75

FROM Median75

/*

Final SELECT using a CASE statement to assign quartiles based on

my various medians

*/

SELECT

       tv.Record

     , tv.Value1

     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

     , CASE

          WHEN tv.Value1 <= m25.Median25 THEN 1

          WHEN tv.Value1 BETWEEN m25.Median25 AND m50.Median50 THEN 2

          WHEN tv.Value1 BETWEEN m50.Median50 AND m75.Median75 THEN 3

          ELSE 4

       END AS MyQuartile

FROM TestValue tv

INNER JOIN MedianResults25 m25

ON m25.JoinField = 1

INNER JOIN MedianResults50 m50

ON m50.JoinField = 1

INNER JOIN MedianResults75 m75

ON m75.JoinField = 1

-- Cleaning up

DROP TABLE TestValue

Here is the result set produced:

Record Value1      Quartile_NTILE       MyQuartile

------ ----------- -------------------- -----------

A      2           1                    1

B      2           1                    1

C      2           2                    1

D      4           2                    2

E      6           3                    3

F      6           3                    3

G      8           4                    4

H      8           4                    4

You can see in the result set above that my median-style quartiling ensures that records with the same Value1 value end up in the same quartile.

Thus, in situations where the fields over which you are quartiling are guaranteed to produce unique values across your data set, then NTILE can be a great way to go. However, if there can be dupes, you may want to consider my approach above.

Mark Vaillancourt

I drink coffee and know things

Month: March 2009

Using Medians For Quartiling When NTILE Is Too Arbitrary