A couple of days ago, Aaron Bertrand posted about a method for calculating medians in SQL Server 2005 using the ROW_NUMBER function in conjunction with the COUNT aggregate. This method (credited to Itzik Ben-Gan) is interesting, but I discovered an even better way to attack the problem in Joe Celko's Analytics and OLAP in SQL.
Rather than using a COUNT aggregate in conjunction with the ROW_NUMBER function, Celko's method uses ROW_NUMBER twice: Once with an ascending sort, and again with a descending sort. The output rows can then be matched based on the ascending row number being within +/- 1 of the descending row number. This becomes clearer with a couple of small examples:
|
A |
1 |
5 |
|
B |
2 |
4 |
|
C |
3 |
3 |
|
D |
4 |
2 |
|
E |
5 |
1 |
In the first table (even number of rows), the median rows are B and C. These can be matched based on [Ascending Column] IN ([Descending Column] + 1, [Descending Column] - 1). In the second table (odd number of rows), the median row is C, which is matched where [Ascending Column] = [Descending Column]. Note that in the second table, the match criteria for the first table does not apply -- so the generic expression to match either case is the combination of the two: [Ascending Column] IN ([Descending Column], [Descending Column] + 1, [Descending Column] - 1).
We can apply this logic within the AdventureWorks database to find the median of the "TotalDue" amount in the Sales.SalesOrderHeader table, for each customer:
|