# Tag: median

## Using Medians For Quartiling When NTILE Is Too Arbitrary

So, in a previous post on The Coolness of the T-SQL NTILE Function, I discuss that NTILE can perform quartiling (or any other –tiling not requiring a trip to Home Depot) for you. Very cool. However, there is something worth noting about this function that I discovered the hard way. I will continue with my quartile example.

If you are quartiling based on a single field (or a small number of fields) and there is a possibility that more than one record could have the same values in the fields by which you are quartiling, it is possible for records with the same value to end up in different quartiles. Doh.

The reason for this is that NTILE essentially just orders your records by the field specified in the ORDER BY clause and then places the first 25% of the records in quartile 1, the second 25% in quartile 2, etc. So, if you have 8 records, as in my example below, the first two records go in quartile 1, the second two go in quartile 2, etc.

Let’s see this in action to make sure I am not just making this stuff up.

`USE tempdb`

`GO`

` `

`CREATE TABLE TestValue`

`(`

`       Record char(1) NOT NULL `

`     , Value1 int NOT NULL`

`)`

` `

`INSERT INTO TestValue (Record, Value1) VALUES('A',2)`

`INSERT INTO TestValue (Record, Value1) VALUES('B',2)`

`INSERT INTO TestValue (Record, Value1) VALUES('C',2)`

`INSERT INTO TestValue (Record, Value1) VALUES('D',4)`

`INSERT INTO TestValue (Record, Value1) VALUES('E',6)`

`INSERT INTO TestValue (Record, Value1) VALUES('F',6)`

`INSERT INTO TestValue (Record, Value1) VALUES('G',8)`

`INSERT INTO TestValue (Record, Value1) VALUES('H',8)`

In the code above, note that Record B and Record C have the same exact value in Value1.

`/*`

`Quartiling with NTILE`

`*/`

`SELECT`

`       tv.Record`

`     , tv.Value1`

`     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE`

`FROM TestValue tv`

` `

`The query above produces this result set:`

` `

`Record Value1      Quartile_NTILE`

`------ ----------- --------------------`

`A      2           1`

`B      2           1`

`C      2           2`

`D      4           2`

`E      6           3`

`F      6           3`

`G      8           4`

`H      8           4`

Note that although Records B and C have the same Value1 value, NTILE puts them in different quartiles. As previously stated, doh. Since I am doing my ORDER BY only a single column for which duplicate values are present, NTILE makes somewhat arbitrary decisions within the subsets of each Value1 value and puts Record C in quartile 2 instead of 1. If uniqueness were assured on my Value1 field, then NTILE would be an easy and accurate way to go. But not so much in this case.

I decided that the median would be my champion. Itzik Ben-Gan provides the basis for my median calculation here: http://www.sqlmag.com/Article/ArticleID/49827/sql_server_49827.html.

I ended up calculating 3 median values:

1. Median (I called Median50) across Value1 for all records, forming the 50th percentile boundary and hence the boundary between quartiles 2 and 3.

2. Median (I called Median25) across records whose Value1 <= the Median50, forming the 25th percentile boundary and hence the boundary between quartiles 1 and 2.

3. Median (I called Median75) across records whose Value1 > the Median50, forming the 75th percentile boundary and hence the boundary between quartiles 3 and 4.

Here is the final code for calculating my own quartiles and ensuring that records with the same value end up in the same quartile:

`;`

`WITH`

` `

`/*`

`Calculate the median over all records`

`This forms the boundary between quartiles 2 and 3`

`This must be calculated first since it is needed to calculate the `

`other quartile boundaries`

`*/`

` `

`Median50 AS`

`(`

`  SELECT Record, Value1,`

`    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum`

`  FROM TestValue`

`)`

`,`

`MedianResults50 AS`

`(`

`SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median50`

`FROM Median50`

`)`

`,`

` `

`/*`

`Calculate the median over records whose value is <= the overall median`

`This forms the boundary between quartiles 1 and 2`

`*/`

` `

`Median25 AS`

`(`

`  SELECT Record, Value1,`

`    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum`

`  FROM TestValue`

`INNER JOIN MedianResults50 m`

`ON m.JoinField = 1`

`WHERE Value1 <= m.Median50`

`)`

`,`

`MedianResults25 AS`

`(`

`SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median25`

`FROM Median25`

`) `

`,`

` `

`/*`

`Calculate the median over records whose value is < the overall median`

`This forms the boundary between quartiles 3 and 4`

`*/`

` `

`Median75 AS`

`(`

`  SELECT Record, Value1,`

`    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum`

`  FROM TestValue`

`INNER JOIN MedianResults50 m`

`ON m.JoinField = 1`

`WHERE Value1 > m.Median50`

`)`

`,`

`MedianResults75 AS`

`(`

`SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median75`

`FROM Median75`

`)`

` `

`/*`

`Final SELECT using a CASE statement to assign quartiles based on `

`my various medians`

`*/`

`SELECT`

`       tv.Record`

`     , tv.Value1`

`     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE`

`     , CASE`

`          WHEN tv.Value1 <= m25.Median25 THEN 1`

`          WHEN tv.Value1 BETWEEN m25.Median25 AND m50.Median50 THEN 2`

`          WHEN tv.Value1 BETWEEN m50.Median50 AND m75.Median75 THEN 3`

`          ELSE 4`

`       END AS MyQuartile`

`FROM TestValue tv`

`INNER JOIN MedianResults25 m25`

`ON m25.JoinField = 1`

`INNER JOIN MedianResults50 m50`

`ON m50.JoinField = 1`

`INNER JOIN MedianResults75 m75`

`ON m75.JoinField = 1`

` `

`-- Cleaning up`

`DROP TABLE TestValue`

` `

`Here is the result set produced:`

` `

`Record Value1      Quartile_NTILE       MyQuartile`

`------ ----------- -------------------- -----------`

`A      2           1                    1`

`B      2           1                    1`

`C      2           2                    1`

`D      4           2                    2`

`E      6           3                    3`

`F      6           3                    3`

`G      8           4                    4`

`H      8           4                    4`

You can see in the result set above that my median-style quartiling ensures that records with the same Value1 value end up in the same quartile.

Thus, in situations where the fields over which you are quartiling are guaranteed to produce unique values across your data set, then NTILE can be a great way to go. However, if there can be dupes, you may want to consider my approach above.