Thursday, July 15, 2010

Counting Parents and Children with Count Distinct

The aggregate functions in SQL Server (min, max, sum, count, average, etc.) are great tools for reporting and business analysis. But sometimes, you need to tweak them just a little bit to get exactly the results you need. For example, if your manager came to you and asked for a report on how many sales have been made to your clients and how large they were, would you know how to get the data you need efficiently? Mark ran into something like this recently and here's the approach he took to solve the problem.

A typical, simple table structure for tracking sales orders includes a Customer table with name and address information, an OrderHeader table with the Customer's ID, Order Date and other details about the order as a whole, and an OrderDetail table with the detail line items for the order. To simplify our example, we will eliminate the separate Customer table and just focus on the OrderHeader and OrderDetail tables.

To set the stage, run the BuildSampleTables.sql script. Now that we have some sample data to work with, let's discuss our options. We're looking for three measurements. First, how many orders have been placed by each client? Second, how many total line items have been ordered by each client? And third, how much money do all those orders add up to?
You could run separate queries on the OrderHeader and OrderDetail tables, but that would be more work than is necessary, and besides, it makes you look like a rookie. Not to mention that it does not lend itself toward use in a SQL Reporting Services report, which you know they're going to ask for sooner or later. So, you figure there must be a way to answer it all in one query. It briefly crosses your mind that you can probably get the results using a subquery, but you know that can have serious performance issues. So let's see what else we can come up with.
You immediately recognize that you are looking for some record counts and sums, based on two related tables, organized by Customer, so we're probably looking at a GROUP BY statement and a JOIN. Let's start forming the basics and see how close we get to our goal. Our initial attempt might look like the following:
COUNT(H.OrderID) as OrderCount,
COUNT(D.DetailID) as LineCount,
SUM(D.LineAmount) as TotalAmount
OrderHeader H
JOIN OrderDetail D ON H.OrderID = D.OrderID

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CustomerName              OrderCount  LineCount   TotalAmount
------------------------- ----------- ----------- -----------
ABC Manufacturing                   9           9      387.93
ACME Supplies                       2           2      115.81
SQLTeam                            12          12      862.77
This is a good start, but as we look at the results, we see that the OrderCount and the LineCount are the same. Now that would imply that each order had only one line item which seems highly unlikely, and a quick review of the data proves it is not so. And on top of that, the Order Count looks unreasonably high. The reason for this is that when you join a child table to its parent, in a one-to-many relationship, the fields of the parent table are repeated for each child row, thus expanding the total number of rows to be that of the number of children, with repeating OrderIDs. You can confirm this by removing the GROUP BY and the aggregate functions.
Seeing the repeating OrderIDs gives you an idea -- if only there were some way to count only the distinct OrderHeader record IDs. And sure enough, that is exactly what we're going to do. We can insert the word DISTINCT inside the COUNT function of the header record OrderID, and we end up with:
H.CustomerName,COUNT(DISTINCT H.OrderID) as OrderCount,COUNT(D.DetailID) as LineCount,SUM(D.LineAmount) as TotalAmountFROM
OrderHeader HJOIN OrderDetail D ON H.OrderID = D.OrderIDGROUP BY

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

CustomerName              OrderCount  LineCount   TotalAmount
------------------------- ----------- ----------- -----------
ABC Manufacturing                   3           9      387.93
ACME Supplies                       1           2      115.81
SQLTeam                             2          12      862.77 
And now, we see that the OrderCount is accurate, and the other numbers remained the same. A simple and clean solution, just the way we like them. For added neatness, you could add an ORDER BY H.CustomerName clause to clean up the display, or at least do some sorting in your final report.

No comments:

Post a Comment