Category: T-SQL

Transact-SQL related musings

TIE Fighters and sys.types: “There’s Too Many of Them!”

Y-Wing As the Y-Wing pilot in Star Wars: Episode VI – Return of the Jedi proclaimed, sometimes there’s too many of them. He was referring to TIE Fighters during the assault on the second Death Star. I echoed the sentiment while querying the system catalog for a client’s database. I was building the foundation for a data dictionary (at the end of the project, unfortunately) and started by pulling a list of all the fields in the database along with their data types. I was dismayed when my tables returned more records than there were fields. There were too many of them.

To show why this happenned, I will start by creating a simple table.

 

CREATE TABLE dbo.TooManyOfThem

(

      RecordID int NOT NULL

    , RecordName nvarchar(100) NULL

)

 

Notice that I have an nvarchar field in there. It turns out, that was the culprit. Let’s query the system catalog to return the columns from this table along with their associated data types.

SELECT

      o.name AS TableName

    , c.name As ColumnName

    , t.name AS DataTypeName

FROM sys.columns c

INNER JOIN sys.objects o

ON o.object_id = c.object_id

INNER JOIN sys.types t

ON t.system_type_id = c.system_type_id 

WHERE o.name = 'TooManyOfThem'

 

Since my TooManyOfThem table has two columns, I would expect to get two records returned by this query. Here’s is what I get.

TableName            ColumnName      DataTypeName

-------------------- --------------- ------------

TooManyOfThem        RecordID        int

TooManyOfThem        RecordName      nvarchar

TooManyOfThem        RecordName      sysname 

 

(3 row(s) affected)

 

I have a bad feeling about this. These aren’t the results I’m looking for. Are my fields riding single file to hide their numbers?

My RecordName field is showing up more than once. Notice that the second occurrence has DataTypeName of “sysname.” I reached out with my feelings and modified my query to get more fields returned.

SELECT

      o.name AS TableName

    , c.name As ColumnName

    , c.system_type_id AS C_system_type_id

    , c.user_type_id AS C_user_tpe_id

    , t.system_type_id AS T_system_type_id

    , t.user_type_id AS T_user_type_id

    , t.name AS DataTypeName

FROM sys.columns c

INNER JOIN sys.objects o

ON o.object_id = c.object_id

INNER JOIN sys.types t

ON t.system_type_id = c.system_type_id 

WHERE o.name = 'TooManyOfThem'

 

When I run this new query, I get the following results.

TableName            ColumnName      C_system_type_id C_user_tpe_id T_system_type_id T_user_type_id DataTypeName

-------------------- --------------- ---------------- ------------- ---------------- -------------- ------------

TooManyOfThem        RecordID        56               56            56               56             int

TooManyOfThem        RecordName      231              231           231              231            nvarchar

TooManyOfThem        RecordName      231              231           231              256            sysname

 

(3 row(s) affected)

 

We see that in sys.types, the system_type_id of 231 leads to both “nvarchar” and “sysname.” Let’s write a query to return the types whose system_type_id values appear more than once.

SELECT

      system_type_id

    , user_type_id

    , name

FROM sys.types

WHERE system_type_id IN

(

SELECT

      system_type_id

FROM sys.types

GROUP BY system_type_id

HAVING COUNT(*) > 1

)

We get the following results.

system_type_id user_type_id name

-------------- ------------ ---------------

240            128          hierarchyid

240            129          geometry

240            130          geography

231            231          nvarchar

231            256          sysname

 

(5 row(s) affected)

 

As we see, the system_type_id is not unique in sys.types. The user_type_id, however, does not repeat. So, I change my join to the sys.types view from system_type_id to user_type_id.

SELECT

      o.name AS TableName

    , name As ColumnName

    , name AS DataTypeName

FROM sys.columns c

INNER JOIN sys.objects o

ON o.object_id = c.object_id

INNER JOIN sys.types t

ON t.user_type_id = c.user_type_id 

WHERE o.name = 'TooManyOfThem'

 

Now I get the results I’m looking for.

TableName            ColumnName      DataTypeName

-------------------- --------------- ------------

TooManyOfThem        RecordID        int

TooManyOfThem        RecordName      nvarchar

 

(2 row(s) affected)

 

So, the path you choose will certainly affect your destiny. Here’s another path for you: Robot Chicken Star Wars. Seriously. It’s some of the funniest stuff I’ve ever seen.

 

 

 

 

 

A Hint on Using Indexed Views For Read Performance

Some time ago, I was on a project in which we implemented a report model. The data was located in several different databases, none of which were very conducive to reporting. The consequence was that the views upon which we based our entities for the model required lots of joins and performed poorly. We decided to implement indexed views in order to boost the read performance and improve the experience for the users. With indexed views, the data returned by the view is persisted in a clustered index, just like a clustered table, which can have great benefits in read performance.

NOTE: You should research and understand the requirements and cost of using Indexed Views before implementing them in your environment. Indexed views incur a cost in terms of write performance, so you will want to test extensively to make sure the gain on reads is worth that write cost.

You can read details about Creating Indexed Views here.

One thing I noticed is that when I first queried the indexed views, the performance didn’t really change. The reason is that the query optimizer typically defaults to expanding the view definition to access the underlying objects directly. I will demonstrate this in a moment. I did find, however, if my index on the view covered all the columns necessary to satisfy my query, then the optimizer did in fact go ahead and use the index on the view instead of expanding.

You can read details about Resolving Indexed Views here.

Let’s demonstrate this behavior I mentioned above where the indexed view does not cover the query.

CREATE TABLE dbo.Musketeer

(

      MusketeerID int NOT NULL 

    , MusketeerName varchar(20) NOT NULL

)

 

CREATE TABLE dbo.Lackey

(

      LackeyID int NOT NULL

    , MusketeerID int NOT NULL

    , LackeyName varchar(20) NOT NULL

)

 

CREATE TABLE dbo.Sword

(

      SwordID int NOT NULL

    , MusketeerID int NULL

    , SwordName varchar(20) NOT NULL

)

 

INSERT INTO dbo.Musketeer VALUES(1,'Athos')

INSERT INTO dbo.Musketeer VALUES(2,'Porthos')

INSERT INTO dbo.Musketeer VALUES(3,'Aramis')

 

INSERT INTO dbo.Lackey VALUES(1,1,'Grimaud')

INSERT INTO dbo.Lackey VALUES(2,2,'Mousqueton')

INSERT INTO dbo.Lackey VALUES(3,3,'Bazin')

 

INSERT INTO dbo.Sword VALUES(1,1,'Cutlass')

INSERT INTO dbo.Sword VALUES(2,2,'Rapier')

INSERT INTO dbo.Sword VALUES(3,3,'Epee')

 

In the code above, we create our tables and insert some data. I chose The Three Musketeers today. NOTE: I made up the data regarding the Sword preferences.

Let’s create our view.

CREATE VIEW dbo.vMusketeerInfo WITH SCHEMABINDING

AS

SELECT

      m.MusketeerID

    , m.MusketeerName

    , l.LackeyName

    , s.SwordName

FROM dbo.Musketeer m

INNER JOIN dbo.Lackey l

ON l.MusketeerID = m.MusketeerID

INNER JOIN dbo.Sword s

ON s.MusketeerID = m.MusketeerID

 

Let’s query our view, returning the Actual Execution Plan:

SELECT 

      MusketeerID

    , MusketeerName

    , LackeyName

    , SwordName      

 FROM dbo.vMusketeerInfo

 

/* RESULTS

MusketeerID MusketeerName  LackeyName  SwordName

----------- -------------- ----------- --------------------

1           Athos          Grimaud     Cutlass

2           Porthos        Mousqueton  Rapier

3           Aramis         Bazin       Epee

 

(3 row(s) affected)

*/

 

The plan returned is the following:

image

 

Notice that the Optimizer is doing a Table Scan on each table referenced.

Now let’s create an index on our view:

CREATE UNIQUE CLUSTERED INDEX ix_vMusketeerInfo_MusketeerID

ON dbo.vMusketeerInfo (MusketeerID)

GO

 

Great. Now that we have an index on our view, we should be all set, right? Let’s query the view again.

SELECT 

      MusketeerID

    , MusketeerName

    , LackeyName

    , SwordName      

 FROM dbo.vMusketeerInfo

 

/* RESULTS

MusketeerID MusketeerName  LackeyName  SwordName

----------- -------------- ----------- --------------------

1           Athos          Grimaud     Cutlass

2           Porthos        Mousqueton  Rapier

3           Aramis         Bazin       Epee

 

(3 row(s) affected)

*/

 

OK. We get the same results, as one would expect. How about the plan?

image

 

The plan is the same. The Optimizer is still performing Table Scans on each table. SQL Server has expanded the view and accessed the underlying objects directly. If you want to make sure that you are going to use the unexpended view, you can add the WITH (NOEXPAND) hint.

SELECT 

      MusketeerID

    , MusketeerName

    , LackeyName

    , SwordName      

 FROM dbo.vMusketeerInfo WITH (NOEXPAND)

 

/* RESULTS

MusketeerID MusketeerName  LackeyName  SwordName

----------- -------------- ----------- --------------------

1           Athos          Grimaud     Cutlass

2           Porthos        Mousqueton  Rapier

3           Aramis         Bazin       Epee

 

(3 row(s) affected)

*/

Again, the results are the same. However, we have a new plan.

image

 

The WITH (NOEXPAND) hint dictates to SQL Server that the view is not to be expanded to the underlying objects. Thus, in our implementation, we achieved a tremendous gain in READ performance. We did incur a cost on write performance, but in our implementation, the benefits to reads greatly outweighed that write cost.

Finding Foreign Keys

No, this is not a new initiative by United States Immigration and Customs Enforcement. I am talking databases.

I recently found myself in a situation where I wanted to find all the tables that referenced a particular table via foreign key relationships. I had heard that relying on the built-in dependency tracking (right-clicking the object in SSMS and choosing View Dependencies) was not always reliable. Indeed, SQL Server MVP Aaron Bertrand (Twitter) wrote a great blog post about ways in which sysdepends, sp_depends, and sys.sql_dependencies can lose touch with reality.

I figured I could find the info I wanted by querying the system catalog views. Here is the script I wrote using AdventureWorks as the sample DB.

USE AdventureWorks

GO

 

DECLARE 

      @ReferencedSchema varchar(1000)

    , @ReferencedObject varchar(1000)

 

SET @ReferencedSchema = 'HumanResources'

SET @ReferencedObject = 'Employee'

 

SELECT 

      rs.name + '.' + ro.name AS ReferencedObject

    , rc.name AS ReferencedColumn

    , ps.name + '.' + po.name AS ParentObject

    , pc.name AS ParentColumn

    , co.name AS ConstraintName

FROM sys.foreign_key_columns fkc

INNER JOIN sys.objects co

ON co.object_id = fkc.constraint_object_id

INNER JOIN sys.objects ro

ON ro.object_id = fkc.referenced_object_id

INNER JOIN sys.schemas rs

ON rs.schema_id = ro.schema_id

INNER JOIN sys.columns rc

ON rc.object_id = ro.object_id

AND rc.column_id = fkc.referenced_column_id

INNER JOIN sys.objects po

ON po.object_id = fkc.parent_object_id

INNER JOIN sys.schemas ps

ON ps.schema_id = po.schema_id

INNER JOIN sys.columns pc

ON pc.object_id = po.object_id

AND pc.column_id = fkc.parent_column_id

WHERE rs.name = @ReferencedSchema

AND ro.name = @ReferencedObject

ORDER BY 

      ps.name

    , po.name

 

Here are the results I get.

image

You can see that there are 7 foreign key constraints that reference the HumanResources.Employee table in my copy of Adventureworks. You can see that Parent object to which each of these constraints belong. And you can see the actual name of each constraint.

Feel free to take that query and tweak it to add information you would find useful. Also feel free to post it in comments here. I really want my blog to be a two-way street and would love to encourage debate.

I also encourage you to play around with querying the system catalog views. You can find some really cool stuff in there.

Jinkies! Passing MultiValue SSRS Parameters To Stored Procedures

SQL Server Reporting Services provides a great way to allows report users to select more than one value from the available values list in parameters, creating a multivalue parameter.

SSRS dynamically creates a comma separated list of the values selected by the user to pass in to the data source.

Great.

It also dynamically creates the Select All option at the top of the available values list.

Also great.

And, it is really easy to enable this feature.

Also also great.

But this great functionality does have a limitation right out of the box. You cannot pass a multivalue parameter to a stored procedure (see Using Single-Value and Multivalue Parameters).

Ruh roh. Not great.

In the past, I have had the data source sproc return all of the records and then used a filter on the data region to only show the records matching the values selected in the parameter. It got the job done, but returning records you are not going to use is suboptimal, to be sure.

Now, you may be thinking, dear reader, that I am a moron for choosing the above title for this post and then almost right away saying that you cannot do it. Well, provided that your database source is SQL 2005 or later, you actually CAN do it through the beauty of XML. And what is really really cool to me is that the amount of work you must perform to achieve this is miniscule. AND, it does not require much XML expertise, which I must confess, I currently lack. I have seen solutions for this obstacle involving the creation of user-defined functions, but I find this XML solution far more elegant and so easy to implement.

As I often do, I must shout out some thanks to Jason Strate (blog: www.jasonstrate.com, twitter: @stratesql) for showing me this.

You may be asking why I am so fond of having reporting data sources be stored procedures instead of just writing the query right into the report. While that is not truly the scope of this post, I will say that I really like having data access code in the database. Also, the security implications are great, since with sprocs, users need only Execute permission on the sproc and none against the underlying tables. There are other reasons, too, but those two are enough for now.

Setting Up The Data

Let’s start off by creating a table to house our data. I am using AdventureWorks (to make it easy) but creating my own schema (to keep things clean). NOTE: I looked but could not find any ScoobyDoo references in AdventureWorks.

USE AdventureWorks

GO

 

CREATE SCHEMA MarkVSQL

GO

 

CREATE TABLE MarkVSQL.ScoobyDooCharacter

(

      CharacterID int not null IDENTITY(1,1)

    , CharacterName varchar(10) not null

)

 

INSERT INTO MarkVSQL.ScoobyDooCharacter VALUES('Fred')

INSERT INTO MarkVSQL.ScoobyDooCharacter VALUES('Daphne')

INSERT INTO MarkVSQL.ScoobyDooCharacter VALUES('Velma')

INSERT INTO MarkVSQL.ScoobyDooCharacter VALUES('Shaggy')

INSERT INTO MarkVSQL.ScoobyDooCharacter VALUES('Scooby Doo')

NOTE: You may be thinking that Scrappy Doo is conspicuously absent from my list. There is a good reason for that. I can’t stand Scrappy Doo. He drives me nuts. I would rather bite down on a metal fork really hard with my front teeth than watch Scrappy Doo. </Scrappy Doo>

Creating Available Values List

Let’s create the stored procedure that will supply our Character parameter with available values. This is where we start to really have some fun.

CREATE PROCEDURE MarkVSQL.GetMysteryIncMemberList

 

AS

 

SELECT

      CharacterName

    , '<Character ID="'

        + CONVERT(varchar,CharacterID)

        + '"/>' AS CharacterIDXML

FROM MarkVSQL.ScoobyDooCharacter

This stored procedure returns the CharacterName along with the CharacterID field that has been modified to return the CharacterID value formatted as an xml node instead of just the int. Here is the fabulous output of our MarkVSQL.GetMysteryIncMemberList sproc.

CharacterName CharacterIDXML

------------- -------------------

Fred          <Character ID="1"/>

Daphne        <Character ID="2"/>

Velma         <Character ID="3"/>

Shaggy        <Character ID="4"/>

Scooby Doo    <Character ID="5"/>

Creating Data Source Sproc 

Now we create the stored procedure that will supply our report with data.

CREATE PROCEDURE MarkVSQL.GetMysteryIncMembers

 

    @CharacterID varchar(max)

    

AS

 

DECLARE @CharacterIDXML xml

 

SET @CharacterIDXML = REPLACE(@CharacterID,',','')

 

SELECT

      c.CharacterName

FROM MarkVSQL.ScoobyDooCharacter c

WHERE c.CharacterID IN

    (SELECT c.value('@ID','int') FROM @CharacterIDXML.nodes('//Character') AS x(c))

There are several things to note about this proc:

  • The parameter is a varchar(max) to allow as much space as we can for the string of comma separated XML nodes that SSRS will pass in.
  • We then declare an XML typed variable, @CharacterIDXML.
  • We remove the commas (which SSRS automatically adds as value separators) from the contents of the @CharacterID parameter and put the resulting list of XML nodes in the @CharacterIDXML variable.
  • Our WHERE clause uses the XML “nodes” method to shred out the CharacterID values from the XML.

Let’s run our MarkVSQL.GetMysteryIncMembers sproc with various parameters as they would be passed in from SSRS.

In this example, we just want to return Fred.

EXEC MarkVSQL.GetMysteryIncMembers '<Character ID="1"/>'

We get the following results:

CharacterName

-------------

Fred

 

(1 row(s) affected)

Now let’s get Fred and Daphne together. I think it is no secret that there is a spark between them.

EXEC MarkVSQL.GetMysteryIncMembers '<Character ID="1"/>,<Character ID="2"/>'

Here they are.

CharacterName

-------------

Fred

Daphne

 

(2 row(s) affected)

Now let’s get all of those meddling kids and their dog.

EXEC MarkVSQL.GetMysteryIncMembers '<Character ID="1"/>,<Character ID="2"/>,<Character ID="3"/>,<Character ID="4"/>,<Character ID="5"/>'

Here they are, in all of their mystery solving glory.

CharacterName

-------------

Fred

Daphne

Velma

Shaggy

Scooby Doo

 

(5 row(s) affected)

The configuration of the parameter in SSRS is nothing out of the ordinary, so I don’t feel it is necessary to go into the step by step process here. You can go here if you need more guidance on the SSRS piece, you can go here: Adding Parameters to Your Report. The article in this hyperlink is for 2008 but includes a hyperlink to the 2005 version as well. The configuration goals are the same, but interface differences between 2005 and 2008 make the exact steps a little different.

You just configure the @CharacterID parameter of the report to do the following:

  • Allow multiple values
  • Get its available values from the MarkVSQL.GetMysteryIncMemberList sproc.
  • Use the CharacterName field as the “Label” field so that the user is presented with the list of names.
  • Use the CharacterIDXML field as the “Value” field so that our XML nodes in a comma separated string are passed to the stored MarkVSQL.GetMysteryIncMembers sproc, just like in our examples above.

That’s really all there is to it. Roh boy!

Cool Tool For Formatting Code

Long time no blog.

A colleague showed me this cool code formatting tool the other day. It is somewhat configurable and is right within my price range. I still end up making a few edits for my preference, but the vast majority of the formatting I care about can be done with the click of a button. I have been working with SQL code that has never really followed any standard so this provides a great head start.

It costs $0 to use it right on the website. With the exchange rate as of July 8,2009, that translates to 0 Euros, too, I think.

It appears that you can also purchase a downloadable desktop version, if you wish.

Here is the link: http://www.dpriver.com/pp/sqlformat.htm

Happy formatting.

Enforcing Uniqueness On A Nullable Column

I recently found myself needing to enforce uniqueness on a column which was nullable. To make things interesting, there were already thousands of records for which the field was already NULL and would remain NULL. A unique index on the column was not possible since I had more than one NULL value. I therefore had to come up with another option. After many deep, profound, brain things inside my head, I had come up with a horribly convoluted solution involving duct tape, a Swiss Army knife, 17 sporks, and a hamster named Ginger. I think we can all see why this would not be the best option.

A colleague pointed me to this blog, featuring a comment by Mr. Adam Machanic.

http://weblogs.sqlteam.com/brettk/archive/2005/04/20/4592.aspx

It involved creating a view which returns only non-NULL values for the column in question and then placing an index on that view. Fabtastic! WAY easier than the solution I was hatching myself. Note that there are rules regarding the creation of indexed views to be aware of. Those rules are detailed here: http://msdn.microsoft.com/en-us/library/ms188783.aspx.

Here is how this works.

We start with a database table containing super heroes. One thing to note about these particular super heroes is that none of them have sidekicks.

USE tempdb

GO

 

CREATE TABLE Superhero

(

       SuperheroID int NOT NULL IDENTITY

     , SuperheroName varchar(25) NULL

)

 

INSERT INTO Superhero (SuperheroName) VALUES('Superman')

INSERT INTO Superhero (SuperheroName) VALUES('Wonder Woman')

INSERT INTO Superhero (SuperheroName) VALUES('Spiderman')

 

SELECT 

       SuperheroID

     , SuperheroName

FROM Superhero

 

/*

RESULTS:

 

SuperheroID SuperheroName

----------- -------------------------

1           Superman

2           Wonder Woman

3           Spiderman

 

*/

 

Now, we need to import super heroes with sidekicks and bring their sidekicks into our table as well. We are also tasked with ensuring that no two super heroes can have the same sidekick name.

Let’s add a SidekickName column to our table and toss some sidekicked superheroes into our table.

ALTER TABLE Superhero ADD SidekickName varchar(25)

 

INSERT INTO Superhero (SuperheroName, SidekickName) VALUES('Batman', 'Robin')

INSERT INTO Superhero (SuperheroName, SidekickName) VALUES('Dangermouse', 'Penfold')

 

SELECT 

       SuperheroID

     , SuperheroName

     , SidekickName

FROM Superhero

 

/*

RESULTS:

 

SuperheroID SuperheroName             SidekickName

----------- ------------------------- -------------------------

1           Superman                  NULL

2           Wonder Woman              NULL

3           Spiderman                 NULL

4           Batman                    Robin

5           Dangermouse               Penfold

 

*/

 

So now we have some sidekicks and some superheroes who tend to work alone and have NULL in the SidekickName field. Just to show I am not making things up when it comes to unique indexes, let’s go ahead and try putting a unique index on the Superhero table and see what happens.

CREATE UNIQUE NONCLUSTERED INDEX ixSidekick ON Superhero (SidekickName ASC)

 

Here is the error returned:

Msg 1505, Level 16, State 1, Line 51

The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.Superhero' and the index name 'ixSidekick'. The duplicate key value is (<NULL>).

The statement has been terminated.

 

Blast!

< insert villain-style monologue here concerning the feebleness of the superhero of your choice >

Let’s create our view. There are two important things to notice about this statement below. The view must be created using the WITH SCHEMABINDING option. Also, although I have not been using the two-part table name dbo.Superhero thus far, I MUST do so in order to create this schemabound view.

CREATE VIEW dbo.vSuperheroSidekick

WITH SCHEMABINDING

AS

SELECT

       SuperheroID

     , SuperheroName

     , SidekickName

FROM dbo.Superhero

WHERE SidekickName IS NOT NULL

 

Let’s query our view.

 

SELECT

       SuperheroID

     , SuperheroName

     , SidekickName

FROM vSuperheroSidekick

 

/*

RESULTS:

 

SuperheroID SuperheroName             SidekickName

----------- ------------------------- -------------------------

4           Batman                    Robin

5           Dangermouse               Penfold

 

*/

 

Fabtastical. So far so good. Now this is where it gets cool. Let’s create our UNIQUE CLUSTERED index on our new view.

CREATE UNIQUE CLUSTERED INDEX ix_vSuperheroSidekick_SidekickName_Unique

ON vSuperheroSidekick (SidekickName ASC)

GO

 

Now, let try inserting a new superhero, Safetyrat, with his diminutive sidekick, Penfold (sound familiar to you?):

 

INSERT INTO Superhero (SuperheroName, SidekickName) VALUES('Safetyrat', 'Penfold')

 

We get the following error:

Msg 2601, Level 14, State 1, Line 1

Cannot insert duplicate key row in object 'dbo.vSuperheroSidekick' with unique index 'ix_vSuperheroSidekick_SidekickName_Unique'.

The statement has been terminated.

 

HAZZAH! The attempt to inject a lame mockery of the ace, the greatest, has been foiled by our fabulous indexed view. Take THAT Baron Greenback!

NOTE: In my situation, I was dealing with a table that did not have a lot of writes, so that added overhead of maintaining this indexed view was not a concern. You will want to take this into consideration in your own environment.

Don’t forget to clean up after yourself:

DROP VIEW dbo.vSuperheroSidekick

DROP TABLE Superhero

Using Medians For Quartiling When NTILE Is Too Arbitrary

So, in a previous post on The Coolness of the T-SQL NTILE Function, I discuss that NTILE can perform quartiling (or any other –tiling not requiring a trip to Home Depot) for you. Very cool. However, there is something worth noting about this function that I discovered the hard way. I will continue with my quartile example.

If you are quartiling based on a single field (or a small number of fields) and there is a possibility that more than one record could have the same values in the fields by which you are quartiling, it is possible for records with the same value to end up in different quartiles. Doh.

The reason for this is that NTILE essentially just orders your records by the field specified in the ORDER BY clause and then places the first 25% of the records in quartile 1, the second 25% in quartile 2, etc. So, if you have 8 records, as in my example below, the first two records go in quartile 1, the second two go in quartile 2, etc.

Let’s see this in action to make sure I am not just making this stuff up.

USE tempdb

GO

 

CREATE TABLE TestValue

(

       Record char(1) NOT NULL 

     , Value1 int NOT NULL

)

 

INSERT INTO TestValue (Record, Value1) VALUES('A',2)

INSERT INTO TestValue (Record, Value1) VALUES('B',2)

INSERT INTO TestValue (Record, Value1) VALUES('C',2)

INSERT INTO TestValue (Record, Value1) VALUES('D',4)

INSERT INTO TestValue (Record, Value1) VALUES('E',6)

INSERT INTO TestValue (Record, Value1) VALUES('F',6)

INSERT INTO TestValue (Record, Value1) VALUES('G',8)

INSERT INTO TestValue (Record, Value1) VALUES('H',8)

 

In the code above, note that Record B and Record C have the same exact value in Value1.

/*

Quartiling with NTILE

*/

SELECT

       tv.Record

     , tv.Value1

     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

FROM TestValue tv

 

The query above produces this result set:

 

Record Value1      Quartile_NTILE

------ ----------- --------------------

A      2           1

B      2           1

C      2           2

D      4           2

E      6           3

F      6           3

G      8           4

H      8           4

 

Note that although Records B and C have the same Value1 value, NTILE puts them in different quartiles. As previously stated, doh. Since I am doing my ORDER BY only a single column for which duplicate values are present, NTILE makes somewhat arbitrary decisions within the subsets of each Value1 value and puts Record C in quartile 2 instead of 1. If uniqueness were assured on my Value1 field, then NTILE would be an easy and accurate way to go. But not so much in this case.

I decided that the median would be my champion. Itzik Ben-Gan provides the basis for my median calculation here: http://www.sqlmag.com/Article/ArticleID/49827/sql_server_49827.html.

I ended up calculating 3 median values:

1. Median (I called Median50) across Value1 for all records, forming the 50th percentile boundary and hence the boundary between quartiles 2 and 3.

2. Median (I called Median25) across records whose Value1 <= the Median50, forming the 25th percentile boundary and hence the boundary between quartiles 1 and 2.

3. Median (I called Median75) across records whose Value1 > the Median50, forming the 75th percentile boundary and hence the boundary between quartiles 3 and 4.

Here is the final code for calculating my own quartiles and ensuring that records with the same value end up in the same quartile:

;

WITH

 

/*

Calculate the median over all records

This forms the boundary between quartiles 2 and 3

This must be calculated first since it is needed to calculate the 

other quartile boundaries

*/

 

Median50 AS

(

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

)

,

MedianResults50 AS

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median50

FROM Median50

)

,

 

/*

Calculate the median over records whose value is <= the overall median

This forms the boundary between quartiles 1 and 2

*/

 

Median25 AS

(

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

INNER JOIN MedianResults50 m

ON m.JoinField = 1

WHERE Value1 <= m.Median50

)

,

MedianResults25 AS

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median25

FROM Median25

) 

,

 

/*

Calculate the median over records whose value is < the overall median

This forms the boundary between quartiles 3 and 4

*/

 

Median75 AS

(

  SELECT Record, Value1,

    ROW_NUMBER() OVER(ORDER BY Value1) AS RowNum

  FROM TestValue

INNER JOIN MedianResults50 m

ON m.JoinField = 1

WHERE Value1 > m.Median50

)

,

MedianResults75 AS

(

SELECT 1 AS JoinField,CAST(AVG(Value1) AS dec(7,3)) AS Median75

FROM Median75

)

 

/*

Final SELECT using a CASE statement to assign quartiles based on 

my various medians

*/

SELECT

       tv.Record

     , tv.Value1

     , NTILE(4) OVER(ORDER BY tv.Value1 ASC) AS Quartile_NTILE

     , CASE

          WHEN tv.Value1 <= m25.Median25 THEN 1

          WHEN tv.Value1 BETWEEN m25.Median25 AND m50.Median50 THEN 2

          WHEN tv.Value1 BETWEEN m50.Median50 AND m75.Median75 THEN 3

          ELSE 4

       END AS MyQuartile

FROM TestValue tv

INNER JOIN MedianResults25 m25

ON m25.JoinField = 1

INNER JOIN MedianResults50 m50

ON m50.JoinField = 1

INNER JOIN MedianResults75 m75

ON m75.JoinField = 1

 

-- Cleaning up

DROP TABLE TestValue

 

Here is the result set produced:

 

Record Value1      Quartile_NTILE       MyQuartile

------ ----------- -------------------- -----------

A      2           1                    1

B      2           1                    1

C      2           2                    1

D      4           2                    2

E      6           3                    3

F      6           3                    3

G      8           4                    4

H      8           4                    4

 

You can see in the result set above that my median-style quartiling ensures that records with the same Value1 value end up in the same quartile.

Thus, in situations where the fields over which you are quartiling are guaranteed to produce unique values across your data set, then NTILE can be a great way to go. However, if there can be dupes, you may want to consider my approach above.

Using Dynamic Values in the PIVOT Operator

I recently worked on a report for a client in which a date input parameter was used to determine the end dates of the previous four calendar quarters. Each time the report ran, it returned data for all four of those quarters.

I ended up using PIVOT in the course of the stored procedure for the report. The first time I did so, I hard-coded the four quarter end dates I needed to work with. This worked fine for my testing, but I needed the PIVOT to use whatever four quarters end dates were derived from the input parameter. I was fortunate that I knew ahead of time that I would always be pivoting on four and only four date values.

With that consistency in mind, I was able to create a common tables expression with a derived column to consistently “alias” the first quarter as 1, the second quarter as 2, and so forth.

Here is the breakdown of how this works using data on the dining habits of Hobbits. In this example, the report involved requires looking at the first three days of the month. This provides the consistency required to make this particular technique work. It could be possible to modify this procedure to work on value with less consistency but I have not yet tried it.

-- Create the Hobbit table to house the meal data

CREATE TABLE Hobbit

(

        Hobbit varchar(20)

      , MealDate date

      , MealCount int

)

 

-- Populate Hobbit table with meal data

INSERT INTO Hobbit VALUES('Bilbo', '2009-01-01', 4)

INSERT INTO Hobbit VALUES('Bilbo', '2009-01-02', 5)

INSERT INTO Hobbit VALUES('Bilbo', '2009-01-03', 3)

INSERT INTO Hobbit VALUES('Bilbo', '2009-02-01', 6)

INSERT INTO Hobbit VALUES('Bilbo', '2009-02-02', 7)

INSERT INTO Hobbit VALUES('Bilbo', '2009-02-03', 3)

INSERT INTO Hobbit VALUES('Samwise', '2009-01-01', 4)

INSERT INTO Hobbit VALUES('Samwise', '2009-01-02', 8)

INSERT INTO Hobbit VALUES('Samwise', '2009-01-03', 7)

INSERT INTO Hobbit VALUES('Samwise', '2009-02-01', 4)

INSERT INTO Hobbit VALUES('Samwise', '2009-02-02', 8)

INSERT INTO Hobbit VALUES('Samwise', '2009-02-03', 7)

INSERT INTO Hobbit VALUES('Meriadoc', '2009-01-01', 6)

INSERT INTO Hobbit VALUES('Meriadoc', '2009-01-02', 9)

INSERT INTO Hobbit VALUES('Meriadoc', '2009-01-03', 12)

INSERT INTO Hobbit VALUES('Meriadoc', '2009-02-01', 4)

INSERT INTO Hobbit VALUES('Meriadoc', '2009-02-02', 8)

INSERT INTO Hobbit VALUES('Meriadoc', '2009-02-03', 7)

 

-- Pivot with hard coded January values

SELECT

        Hobbit

      , [2009-01-01]

      , [2009-01-02]

      , [2009-01-03]

FROM 

(

SELECT 

        Hobbit

      , MealDate

      , MealCount

FROM Hobbit) h

PIVOT

(

SUM(MealCount)

FOR MealDate IN

(

        [2009-01-01]

      , [2009-01-02]

      , [2009-01-03]

)

) AS pvt

 

Results of January hard coded PIVOT:

 

Hobbit               2009-01-01  2009-01-02  2009-01-03

-------------------- ----------- ----------- -----------

Bilbo                4           5           3

Meriadoc             6           9           12

Samwise              4           8           7

 

-- Pivot with hard coded February values

SELECT

        Hobbit

      , [2009-02-01]

      , [2009-02-02]

      , [2009-02-03]

FROM 

(

SELECT 

        Hobbit

      , MealDate

      , MealCount

FROM Hobbit) h

PIVOT

(

SUM(MealCount)

FOR MealDate IN

(

        [2009-02-01]

      , [2009-02-02]

      , [2009-02-03]

)

) AS pvt

 

Results of February hard coded PIVOT:

 

Hobbit               2009-02-01  2009-02-02  2009-02-03

-------------------- ----------- ----------- -----------

Bilbo                6           7           3

Meriadoc             4           8           7

Samwise              4           8           7

 

As you can see from this pivot tables above, Hobbits eat often. The hard coded date values form the column headers. This is nice and all but requires a code change to report on a different set of dates.

Here we see how you can make a CTE with a derived column to pass in to the PIVOT operator. This allows the actual PIVOT to remain exactly the same while reporting on different dates.

-- Varbiable to hold the month number on which to PIVOT

DECLARE @MealMonth int

 

SET @MealMonth = 1 -- Using 1 to prepresent January

 

-- Common table expression to derive MealDay column

;

WITH HobbitMealDays AS

(

SELECT

        Hobbit

      , MealDate

      , DAY(MealDate) AS MealDay -- Using day datepart as integer value

      , MealCount

FROM Hobbit

WHERE MONTH(MealDate) = @MealMonth

)

 

SELECT

        Hobbit

      , [1] AS Day1 -- Alias to stand in for hard coded column headers

      , [2] AS Day2

      , [3] AS Day3

FROM 

(

SELECT 

        Hobbit

      , MealDay

      , MealCount

FROM HobbitMealDays) h

PIVOT

(

SUM(MealCount)

FOR MealDay IN

(

        [1] -- these values are known possible values of MealDay derived column

      , [2]

      , [3]

)

) AS pvt

 

Results of dynamic using @Mealmonth = 1 to represent January in PIVOT:

 

Hobbit               Day1        Day2        Day3

-------------------- ----------- ----------- -----------

Bilbo                4           5           3

Meriadoc             6           9           12

Samwise              4           8           7

 

Just to make the comparison easier, here are the hard coded January PIVOT results again.

Results of January hard coded PIVOT:

 

Hobbit               2009-01-01  2009-01-02  2009-01-03

-------------------- ----------- ----------- -----------

Bilbo                4           5           3

Meriadoc             6           9           12

Samwise              4           8           7

 

If you set @MealMonth = 2 and run the PIVOT again, you will get the February results:

Results of dynamic using @Mealmonth = 2 to represent February in PIVOT:

 

 

Hobbit               Day1        Day2        Day3

-------------------- ----------- ----------- -----------

Bilbo                6           7           3

Meriadoc             4           8           7

Samwise              4           8           7

 

Again, for ease of comparison, here are the hard coded February results again.

Results of February hard coded PIVOT:

 

Hobbit               2009-02-01  2009-02-02  2009-02-03

-------------------- ----------- ----------- -----------

Bilbo                6           7           3

Meriadoc             4           8           7

Samwise              4           8           7

 

-- Cleaning up

DROP TABLE Hobbit

GO

 

As you can see above, the “dynamic” PIVOT returns the exact same values as the hard coded versions but offers the flexibility of reporting on different periods without changing code. You could just reference a parameter in the CTE instead of a variable as I did.

Have fun.

Things You Know Now…

I was tagged by Jason Strate for cool blog chain. I guess I’m it now.

  

Get the Last Date in a Calendar Quarter

I needed to accurately calculate the end dates for the four most recently completed calendar quarters prior to the date provided. I found this snippet online and thought it was a really slick way to accomplish finding the quarter end date. I tried to find the site again so i could give props to the person who posted it but cannot. So, my clever friend, I salute thee.

Let’s start by declaring our date variable and setting it to a fine day in history. My parents told me this was my birthday. I don’t really remember but I will give them the benefit of the doubt.

I have dissected the expression below.

DECLARE @testdate date

 

SET @testdate = '1975-01-26'

 

SELECT

 

-- Get the number of calendar quarters since 0

        DATEDIFF(qq,0,@testdate) AS Step1

 

-- Add 1 quarter

      , DATEDIFF(qq,0,@testdate)+1 AS Step2

 

-- Add the number of quarters in Step2 above to 0 to get the earliest date of next quarter

      , DATEADD(qq, DATEDIFF(qq,0,@testdate)+1,0) AS Step3

 

-- Subtract 3 milliseconds to get to the last datetime value of current quarter

      , dateadd(ms,-3,DATEADD(qq, DATEDIFF(qq,0,@testdate)+1,0)) AS Step4

 

-- Since i was working in SQL 2008, which supports separate data types for DATE and TIME, 

-- I converted the result in Step4 to DATE

      , CONVERT(date,dateadd(ms,-3,DATEADD(qq, DATEDIFF(qq,0,@testdate)+1,0))) AS QuarterEndDate

The query results are as follows:

Step1       Step2       Step3                   Step4                   QuarterEndDate

----------- ----------- ----------------------- ----------------------- --------------

300         301         1975-04-01 00:00:00.000 1975-03-31 23:59:59.997 1975-03-31

 

So, with this nifty expression, we have determined the end date of the quarter for my birthday (no gifts necessary).

We can also change the number of quarters added (or even subtracted) in Step2 to get the quarter end date for future or past quarters, too.

Make room for FLOOR and CEILING

On the same report referenced above, I also had to implement a business rule by rounding a value up to the nearest thousand. A colleague suggested the CEILING function, which returns the smallest integer that is greater than or equal to the provided numeric expression. As it turn out, the FLOOR function does the exact opposite, returning the largest integer that is less than or equal to the numeric expression.

Let’s look at both functions in action.

-- Let's create a tiny table

CREATE TABLE Smurf

(

        Smurf varchar(10)

      , HeightInApples dec(2,1)

)

 

-- Populate the table with a small number of records

INSERT INTO Smurf VALUES('Brainy',3.1)

INSERT INTO Smurf VALUES('Hefty',3.5)

INSERT INTO Smurf VALUES('Painter', 3.7)

 

-- Watch these smurfy functions do their thing

SELECT

        Smurf

      , HeightInApples

      , FLOOR(HeightInApples) AS FloorValue

      , CEILING(HeightInApples) AS CeilingValue

FROM Smurf

 

Results:

 

Smurf      HeightInApples    FloorValue  CeilingValue

---------- ----------------- ----------- -------------

Brainy     3.1               3           4

Hefty      3.5               3           4

Painter    3.7               3           4

 

-- Cleaning up after yourself is smurftastic

DROP TABLE Smurf

Notice that all values were rounded down by the FLOOR function and up by the CEILING function.

For my report, I divided my input values by 1000, ran them through CEILING, and the multiplied by 1000. This rounded my values up to the nearest thousand. I just figured the example above would be a little more fun.

I am tagging Dan English

The Coolness of the T-SQL NTILE Function

I had need of calculating metrics across an industry and then returning the quartile in which each company scored. So, i wrote up a function that took in the MinValue, MaxValue, and CompanyMetricValue as inputs and returned the quartile for the Company. Great. Then, when asking  someone else about this using this new function, the response was, “Why not just use NTILE?” Ug.

Here is the description from SQL 2008 Books Online:

Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row,  NTILE returns the number of the group to which the row belongs.

NTILE (integer_expression)    OVER ( [ <partition_by_clause> ] < order_by_clause > )

So, using NTILE, I was able to do the following:

SELECT NTILE(4) OVER(ORDER BY CompanyMetricValue ASC) AS Quartile

In the example above, i want quarters, so I use 4 as the (integer expression).

Lesson Learned: Before inventing a new wheel, check Books Online for wheels that already exist. 🙂