4 Apr 2013

The trouble of combining DISTINCT with ORDER BY

Motivation


Yesterday one of my coworkers from India asked for my help in migrating the following SQL statement from Sybase to DB2.


SELECT DISTINCT c1, c2 FROM t ORDER BY c3

When this statement is run in DB2 (assuming the appropriate definition of "T") the following error is being returned:

SQL0214N An expression in the ORDER BY clause in the following position, or starting with "C3" in the "ORDER BY" clause is not valid. Reason code = "2".

Why does DB2 raise this error?

The reason is simple: It's not self evident what the answer should be!

To look into this issue a bit close and come up with a possible interpretation and implementation of the request we look at an example

Guestbook


My pals and I like to go out for lunch to a chine mall next to IBM. Most of the time we go the same vendor, so you might say we are regulars. The ladies taking the orders know us, and if one of us doesn't join we get a jovial: "Where is your friend today?"

As a business it pays to keep track of your regular guests. If they don't come for a while the reasons could be innocuous (vacation, sickness, travel etc) or they could be relevant to you as the owner. Maybe the food isn't as good anymore, or the menu has gotten boring. So a reasonable query may be:

Who are the guests I haven't seen for the longest time?

Here is the table and some data to illustrate this example:


CREATE TABLE guests(name VARCHAR(20), firstname VARCHAR(20), visit DATE);
INSERT INTO guests VALUES
('Jones'   , 'Jeremy'  , '2012-04-13'), 
('Beaver'  , 'Boris'   , '2012-04-14'),
('Gervais' , 'Gerald'  , '2012-04-13'),
('Jones'   , 'Jeremy'  , '2012-04-15'),
('Beaver'  , 'Boris'   , '2012-04-15'),
('Chambers', 'Charlene', '2012-04-15'),
('Dreyfus' , 'Don'     , '2012-04-14'),
('Jones'   , 'Jeremy'  , '2012-04-14'),
('Eakes'   , 'Edwin'   , '2012-04-14'),
('Fox'     , 'Frauke'  , '2012-04-14'),
('Dreyfus' , 'Don'     , '2012-04-13');

It is easy enough to provide a sorted list of customers with the oldest visits showing first:


SELECT  name, firstname FROM guests ORDER BY visit;
NAME                 FIRSTNAME           
-------------------- --------------------
Jones                Jeremy              
Gervais              Gerald              
Dreyfus              Don                 
Beaver               Boris               
Dreyfus              Don                 
Jones                Jeremy              
Eakes                Edwin               
Fox                  Frauke              
Jones                Jeremy              
Beaver               Boris               
Chambers             Charlene            

11 rows were retrieved.

Well, we now know that Jeremy visited a long time ago, but he actually keeps coming back.
Gerald, on the other hand came only once and never returned.
What happens when we add a DISTINCT to eliminate the duplicates?


SELECT  DISTINCT name, firstname FROM guests ORDER BY visit;
SQL0214N  An expression in the ORDER BY clause in the following position, or starting with "VISIT" in the "ORDER BY" clause is not valid.  Reason code = "2".

Right... we knew this would happen.
To get this query to compile we need to separate the ORDER BY from the DISTINCT.


SELECT DISTINCT name, firstname 
  FROM (SELECT  name, firstname FROM guests ORDER BY visit);
NAME                 FIRSTNAME           
-------------------- --------------------
Beaver               Boris               
Chambers             Charlene            
Dreyfus              Don                 
Eakes                Edwin               
Fox                  Frauke              
Gervais              Gerald              
Jones                Jeremy              

7 rows were retrieved.

The ORDER BY got lost in the outer query. This is to be expected and according to the rules of SQL.
The output is actually sorted by name which is a side effect of the DISTINCT processing.
We could flip the order around and first do DISTINCT, but then we would have to include the "VISIT" column which would only cause us to eliminate duplicate same-day visits.

So, are we stuck? Not at all!
There are two ways to attack this problem. One without and one with OLAP:

Going old school


The classic approach requires us to do a join. We will look up the most recent visit per customer and eliminate all other rows of that same customer:


SELECT name, firstname FROM guests g 
  WHERE visit IN (SELECT MAX(visit) FROM guests 
                   WHERE g.name = guests.name AND
                         g.firstname = guests.firstname)
  ORDER BY visit;

NAME                 FIRSTNAME           
-------------------- --------------------
Gervais              Gerald              
Dreyfus              Don                 
Eakes                Edwin               
Fox                  Frauke              
Jones                Jeremy              
Beaver               Boris               
Chambers             Charlene            

7 rows were retrieved.

That did the job, but it requires a nested loop join.
Although an index on (name, firstname, visit DESC) might to wonders to speed up this query.

OLAP to the rescue


Another approach to solve the problem uses some very basic OLAP function.
Essentially we will number each persons visits from the newest to the oldest starting with "1" and then eliminate everything but the first entry per person.
If that sounds familiar you are right. In this blog I did use the technique before for de-duping.
In the end this is precisely what we are doing here.


SELECT name, firstname 
  FROM(SELECT ROW_NUMBER() OVER(PARTITION BY name, firstname ORDER BY visit DESC) AS rn, 
     name, firstname, visit
         FROM guests)
  WHERE rn = 1
  ORDER BY visit;
NAME                 FIRSTNAME           
-------------------- --------------------
Gervais              Gerald              
Dreyfus              Don                 
Eakes                Edwin               
Fox                  Frauke              
Beaver               Boris               
Chambers             Charlene            
Jones                Jeremy              

7 rows were retrieved.

So how does this relate to DISTINCT and ORDER BY again?


Note how in the previous examples I used MAX(visit) and ORDER BY visit DESC?
That was appropriate for the business problem.
We picked the most recent version of the (name, firstname) tuple for the distinct processing.
But we could have taken the oldest (MIN(visit) and ORDER BY visit ASC respectively).
In fact a smart DBMS could process these semantics much faster since the same ordering used for ROW_NUMBER can be used to produce the result set.
We could also legally have taken the "first" or "last" row touched.
The SQL Standard does not tell which semantics would be right, so it's safer to return an error, then what might be perceived as a wrong result.

No comments:

Post a Comment