So, in a previous post on The Coolness of the T-SQL NTILE Function, I discuss that NTILE can perform quartiling (or any other –tiling not requiring a trip to Home Depot) for you. Very cool. However, there is something worth noting about this function that I discovered the hard way. I will continue with my quartile example.

If you are quartiling based on a single field (or a small number of fields) and there is a possibility that more than one record could have the same values in the fields by which you are quartiling, it is possible for records with the same value to end up in different quartiles. Doh.

The reason for this is that NTILE essentially just orders your records by the field specified in the ORDER BY clause and then places the first 25% of the records in quartile 1, the second 25% in quartile 2, etc. So, if you have 8 records, as in my example below, the first two records go in quartile 1, the second two go in quartile 2, etc.

Let’s see this in action to make sure I am not just making this stuff up.

`USE tempdb`

`GO`

CREATE TABLE TestValue

(

Record char(1) NOT NULL

, Value1 int NOT NULL

)

INSERT INTO TestValue (Record, Value1) VALUES('A',2)

INSERT INTO TestValue (Record, Value1) VALUES('B',2)

INSERT INTO TestValue (Record, Value1) VALUES('C',2)

INSERT INTO TestValue (Record, Value1) VALUES('D',4)

INSERT INTO TestValue (Record, Value1) VALUES('E',6)

INSERT INTO TestValue (Record, Value1) VALUES('F',6)

INSERT INTO TestValue (Record, Value1) VALUES('G',8)

INSERT INTO TestValue (Record, Value1) VALUES('H',8)

In the code above, note that Record B and Record C have the same exact value in Value1.

`/*`

`Quartiling with NTILE`

`*/`

`SELECT`

tv.Record

, tv.Value1

, NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

`FROM TestValue tv`

The query above produces this result set:

Record Value1 Quartile_NTILE

`------ ----------- --------------------`

A 2 1

B 2 1

C 2 2

D 4 2

E 6 3

F 6 3

G 8 4

H 8 4

Note that although Records B and C have the same Value1 value, NTILE puts them in different quartiles. As previously stated, doh. Since I am doing my ORDER BY only a single column for which duplicate values are present, NTILE makes somewhat arbitrary decisions within the subsets of each Value1 value and puts Record C in quartile 2 instead of 1. If uniqueness were assured on my Value1 field, then NTILE would be an easy and accurate way to go. But not so much in this case.

I decided that the median would be my champion. Itzik Ben-Gan provides the basis for my median calculation here: http://www.sqlmag.com/Article/ArticleID/49827/sql_server_49827.html.

I ended up calculating 3 median values:

1. Median (I called Median50) across Value1 for all records, forming the 50^{th} percentile boundary and hence the boundary between quartiles 2 and 3.

2. Median (I called Median25) across records whose Value1 <= the Median50, forming the 25^{th} percentile boundary and hence the boundary between quartiles 1 and 2.

3. Median (I called Median75) across records whose Value1 > the Median50, forming the 75^{th} percentile boundary and hence the boundary between quartiles 3 and 4.

Here is the final code for calculating my own quartiles and ensuring that records with the same value end up in the same quartile:

;

`WITH`

`/*`

`Calculate the median over all records`

`This forms the boundary between quartiles 2 and 3`

`This must be calculated first since it is needed to calculate the `

`other quartile boundaries`

`*/`

`Median50 AS`

(

` SELECT Record, Value1,`

ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

` FROM TestValue`

)

,

`MedianResults50 AS`

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median50

`FROM Median50`

)

,

`/*`

`Calculate the median over records whose value is <= the overall median`

`This forms the boundary between quartiles 1 and 2`

`*/`

`Median25 AS`

(

` SELECT Record, Value1,`

ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

` FROM TestValue`

INNER JOIN MedianResults50 m

`ON m.JoinField = 1`

`WHERE Value1 <= m.Median50`

)

,

`MedianResults25 AS`

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median25

`FROM Median25`

)

,

`/*`

`Calculate the median over records whose value is < the overall median`

`This forms the boundary between quartiles 3 and 4`

`*/`

`Median75 AS`

(

` SELECT Record, Value1,`

ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

` FROM TestValue`

INNER JOIN MedianResults50 m

`ON m.JoinField = 1`

`WHERE Value1 > m.Median50`

)

,

`MedianResults75 AS`

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median75

`FROM Median75`

)

`/*`

`Final SELECT using a CASE statement to assign quartiles based on `

`my various medians`

`*/`

`SELECT`

tv.Record

, tv.Value1

, NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

` , CASE`

WHEN tv.Value1 <= m25.Median25 THEN 1

WHEN tv.Value1 BETWEEN m25.Median25 AND m50.Median50 THEN 2

WHEN tv.Value1 BETWEEN m50.Median50 AND m75.Median75 THEN 3

` ELSE 4`

END AS MyQuartile

`FROM TestValue tv`

INNER JOIN MedianResults25 m25

`ON m25.JoinField = 1`

INNER JOIN MedianResults50 m50

`ON m50.JoinField = 1`

INNER JOIN MedianResults75 m75

`ON m75.JoinField = 1`

`-- Cleaning up`

DROP TABLE TestValue

Here is the result set produced:

Record Value1 Quartile_NTILE MyQuartile

`------ ----------- -------------------- -----------`

A 2 1 1

B 2 1 1

C 2 2 1

D 4 2 2

E 6 3 3

F 6 3 3

G 8 4 4

H 8 4 4

You can see in the result set above that my median-style quartiling ensures that records with the same Value1 value end up in the same quartile.

Thus, in situations where the fields over which you are quartiling are guaranteed to produce unique values across your data set, then NTILE can be a great way to go. However, if there can be dupes, you may want to consider my approach above.

Categories: T-SQL

Hey Mark,

How about a DENSE_RANK() sol’n?

;WITH ValueTest AS (

select tv.Record,

tv.Value1,

DENSE_RANK() OVER(ORDER BY tv.Value1 ASC) AS Ranking

FROM TestValue tv

),

ValueAggregate AS (

SELECT

Ranking,

NTILE(4) OVER (ORDER BY Ranking) AS NTILEMod

FROM ValueTest

GROUP BY Ranking

)

SELECT

VT.Record,

VT.Value1,

NTILE(4) OVER(ORDER BY VT.Value1 ASC) AS Quartile_NTILE,

VA.NTILEMod

FROM ValueTest AS VT

INNER JOIN ValueAggregate AS VA

ON VT.Ranking = VA.Ranking

LikeLike

Actually, the dense rank is just an unnecessary distraction:

;WITH ValueAggregate AS (

SELECT

Value1,

NTILE(4) OVER (ORDER BY Value1) AS NTILEGrouped

FROM TestValue

GROUP BY Value1

)

SELECT

TV.Record,

TV.Value1,

NTILE(4) OVER(ORDER BY TV.Value1 ASC) AS Quartile_NTILE,

VA.NTILEGrouped

FROM TestValue AS TV

INNER JOIN ValueAggregate AS VA

ON TV.Value1 = VA.Value1

LikeLike

Interesting. I will have to play with this for sure. Thanks Mr. Strom.

LikeLike

Great site. A lot of useful information here. I’m sending it to some friends!

LikeLike

Thanks a ton, pell grants.

LikeLike