One of the most handy features introduced in SQL 2005 were the
ranking functions; ROW_NUMBER(), RANK(), and DENSE_RANK(). For anyone
who hasn’t been introduced to these syntactic gems, here’s a quick
rundown (for those of you who are very familiar with these functions
already, feel free to read through, or skip right down to “There’s No
Such Thing as a Free Ride” below).
OK – so the general syntax for any one of these commands is more or less the same:
So from the example above, we can see that:
There’s No Such Thing as a Free Ride
Ranking functions are not only useful for simple ranking – they’re also great for solving complex problems. In fact, once you get to know them, you’ll find that you’re using them for waaaaaay more than just ranking. In anything from splitting strings to deleting duplicates, ranking functions are the cat’s meow.
But just like most things inlife
SQL Server, less is more, when you can get away with it. For example,
let’s say that you need to get the top order (by TotalDue) for each
Customer in AdventureWorks (AdventureWorks2008 in my examples below).
You can definitely use ROW_NUMBER (or any of the other ranking
functions, for that matter) to do this:
But there are some caveats to the MAX solution – most notably, how in the world can we get the top 5 orders for each customer? Well…the short answer is, we can’t. We need to change the query up, and in doing so, we’re once again going to incur a sort. Once we get beyond a query that the MAX or MIN tricks can satisfy – for instance, if we need to fetch the top 5 orders for each customer, we may as well take advantage of the ease of coding, and the improved readability of the Row_Number solution. If we want a solution for the “top 5” problem without invoking a ranking function, we’re going to end up with something like this:
Now that said, my tests are against a single data set only, and based upon the nature of your data, your mileage may vary. As a general rule, I would use aggregate functions if I’m only looking for the highest or lowest data point in a series, and a ranking function for anything that can’t be solved by simple aggregation.
OK – so the general syntax for any one of these commands is more or less the same:
ROW_NUMBER() OVER ([<partition_by_clause>] <order_by_clause>) RANK() OVER ([<partition_by_clause>] <order_by_clause>) DENSE_RANK() OVER ([<partition_by_clause>] <order_by_clause>)So the PARTITION BY part is optional, but everything else is required. An example of a non partitioned, and then a partitioned ROW_NUMBER() clause are listed below:
ROW_NUMBER() OVER (ORDER BY TotalDue DESC) ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC)The difference between the three functions is best explained using an example. Here’s the data I’m using for this example, in case you want to follow the bouncing ball at home ;-)
CREATE TABLE OrderRanking ( OrderID INT IDENTITY(1,1) NOT NULL, CustomerID INT, OrderTotal decimal(15,2) ) INSERT OrderRanking (CustomerID, OrderTotal) SELECT 1, 1000 UNION SELECT 1, 500 UNION SELECT 1, 650 UNION SELECT 1, 3000 UNION SELECT 2, 1000 UNION SELECT 2, 2000 UNION SELECT 2, 500 UNION SELECT 2, 500 UNION SELECT 3, 500I’ll use the following (admittedly ugly) query to demonstrate the difference between each function:
SELECT *, ROW_NUMBER() OVER (ORDER BY OrderTotal DESC) AS RN, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RNP, RANK() OVER (ORDER BY OrderTotal DESC) AS R, RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RP, DENSE_RANK() OVER (ORDER BY OrderTotal DESC) AS DR, DENSE_RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS DRP FROM OrderRanking ORDER BY OrderTotal DESCExcuse the terrible aliases. Anything longer and the code snippets and output in this blog entry get really, really ugly. When we run the query, this is what we get:
So from the example above, we can see that:
- ROW_NUMBER() assigns sequential numbers to each partition in a result set (an unpartitioned result set simply has a single partition), based upon the order of the results as specified in the ORDER BY clause. If you look carefully, you’ll see that the values in column RN are based upon a simple sort of TotalDue, while the values in Column RNP (Row_Number partitioned) are first partitioned or “grouped” by CustomerID, and then numbered by TotalDue, with the row number resetting on change of customer.
- Contrary to popular belief, RANK() does not sort rows based upon how bad they smell. RANK() does much the same thing as ROW_NUMBER(), only it acknowledges ties in the columns specified in the ORDER BY clause, and assigns them the same rank. Where a tie occurs (as was the case for orders 6/3, and 1/5/8), the numbers that would otherwise have been “used up” are skipped, and numbering resumes at the next available number. As you can see, RANK() leaves a gap whenever there is a tie.
- DENSE_RANK() doesn’t like gaps. It’s more of an Abercrombie & Fitch kind of function (ba-dum-ching!). Ohhhhh…that was terrible. My sense of humour may give me up for lent. You might follow it. Anyway….DENSE_RANK() “fills in the gaps”. It starts from the next number after a tie occurs, so instead of 1, 2, 3, 3, 5 you get 1, 2, 3, 3, 4.
There’s No Such Thing as a Free Ride
Ranking functions are not only useful for simple ranking – they’re also great for solving complex problems. In fact, once you get to know them, you’ll find that you’re using them for waaaaaay more than just ranking. In anything from splitting strings to deleting duplicates, ranking functions are the cat’s meow.
But just like most things in
SELECT soh.* FROM (SELECT CustomerID, SalesOrderID, TotalDue, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC) AS RowNumber FROM Sales.SalesOrderHeader) AS soh WHERE soh.RowNumber = 1The WHERE soh.RowNumber = 1 restricts our results to the top order for each customer. Lovely. And the really beautiful thing about this is, if you need the top 2 orders for each customer, or 3, or 4, or x, all you need to do is replace the = 1 with <=2 (for example), and you’re good to go. Now, with that in mind, let’s look at this query:
SELECT soh.CustomerID, soh.TotalDue FROM Sales.SalesOrderHeader soh JOIN (SELECT CustomerID, MAX(TotalDue) AS MaxTotalDue FROM Sales.SalesOrderHeader GROUP BY CustomerID) AS ttls ON soh.CustomerID = ttls.CustomerID AND soh.TotalDue = ttls.MaxTotalDueIf you plug this bad boy in, and run it, you might be surprised by the outcome. It’s actually about half as expensive as the Row_Number solution – but why? Well, as you may or may not know, sorts can be very, very expensive in SQL Server. If we’re only fetching the highest $ sales order for a given customer, the MAX solution does it without a sort, whereas the ROW_NUMBER solution needs to sort (the ORDER BY clause is mandatory, remember).
But there are some caveats to the MAX solution – most notably, how in the world can we get the top 5 orders for each customer? Well…the short answer is, we can’t. We need to change the query up, and in doing so, we’re once again going to incur a sort. Once we get beyond a query that the MAX or MIN tricks can satisfy – for instance, if we need to fetch the top 5 orders for each customer, we may as well take advantage of the ease of coding, and the improved readability of the Row_Number solution. If we want a solution for the “top 5” problem without invoking a ranking function, we’re going to end up with something like this:
SELECT soh.CustomerID, soh.TotalDue FROM Sales.SalesOrderHeader soh WHERE soh.SalesOrderID IN (SELECT TOP 5 SalesOrderID FROM Sales.SalesOrderHeader soh2 WHERE soh2.CustomerID = soh.CustomerID ORDER BY TotalDue DESC)Which in this case is a very, very crappy alternative to a ranking function. Not only is it uglier, but the query plan isn’t nearly as efficient, and the execution times were consistently about 20% longer in my tests.
Now that said, my tests are against a single data set only, and based upon the nature of your data, your mileage may vary. As a general rule, I would use aggregate functions if I’m only looking for the highest or lowest data point in a series, and a ranking function for anything that can’t be solved by simple aggregation.
No comments:
Post a Comment