We've all made big mistakes in our career - and so we all know that one thing worse than making a big mistake, is making a big mistake that goes undetected for a long time!
The following is an excerpt from a lesson plan in our SetFocus SQL Server and Business Intelligence Master's Programs.
Consider the following: suppose we have 3 tables in a SQL Server database: JobMaster, JobHours, and JobMaterials. They represent a simple example of multiple one-to-many relationships. A single job in JobMaster can have multiple timesheet rows (or none at all) in JobHours, and multiple material purchase rows (or none at all) in JobMaterials.
So our tables might look something like this:
So there are four possibilities:
- A job can have both hours and material purchases (Job A)
- A job can have hours but no material purchases (Job B)
- A job can have material purchases but no hours (Job C)
- A job (likely a new job) may not have data in either child table (Job D)
Now suppose we need to write a query that retrieves each Job in JobMaster, along with the sum of Hours Worked (from JobHours) and the sum of Material Purchases (from JobMaterials). The result set should show all jobs - even those with no child data, and would look something like this:
JobDescription TotalHoursWorked TotalPurchaseAmount
Many SQL developers, recognizing that a LEFT JOIN is required to show all jobs, might write the code as follows (below). The code performs a LEFT JOIN between the parent table and two child tables, and aggregates (SUMs) the HoursWorked and PurchaseAmount from the two tables. This produces the result set in the result set window (below).
At first glance, the output above appears correct.
- Job B did not have material purchase amounts, so we see a NULL for the aggregation, and the correct number of 26 hours (7 + 7 + 6 + 6) for the aggregation of hours worked.
- Job C did not have hours worked, so we see a NULL for the aggregation, and the correct number of $30,000 ($10,000 + $20,000) for the aggregation of material purchases.
- Job D is a new job and doesn't have any child data, so we see NULLs for both aggregations.
At this point, some might be tempted to "bless" the code and results. Surely, the one Job we didn't test (with both hours and purchase amount) is the simplest scenario, and "has" to be correct, right? :)
Well, you probably see where this is going! Take a closer look at the raw data for Job A.
- The output shows 60 hours, though the raw data contains 30 hours ( 8 + 8 + 7 + 7)
- The output shows $12,000 in purchase amount, though the raw data contains $3,000 worth ($1,000 + $2,000). Talk about double (and quadruple) counting!
Further, we notice something else - there were 4 Job Hour rows for Job A, and the material purchase aggregation was off by a factor of 4.
Additionally, there were 2 Material Purchase rows for Job A, and the Job hour aggregation was off by a factor of 2.
COINCIDENCE? No, it's not a coincidence. As it turns out, when we use multiple JOIN statements (either INNER or OUTER joins), and try to perform multiple aggregations on multiple one-to-many relationships, you risk overstating the aggregations. In this case, the 2 instances of a JOIN between JobMaster and JobMaterials (for Job A) wind up joining 2 instances of Job A to the Job Hours table - thus the double-counting of hours. Likewise, the 4 instances of a JOIN between JobMaster and JobHours (for Job A) wind up joining 4 instances of Job A to the Job Materials table - thus overstating the aggregation of material purchases by a factor of 4.
I know developers who have written the same general code as the code above - and didn't realize until much later (usually, when a user or customer discovered high number) that this was a problem waiting to happen.
OK, so we've uncovered the problem - what's the solution?
Before we look at a solution, we need to realize one thing - just like a single long code listing, it's often difficult (or impossible) to solve a problem in one self-contained SQL statement. In this specific situation, we need to "pre-aggregate" JobHours and JobMaterials independently of each other, in some type of subquery. Then we can JOIN the pre-aggregations to the JobMaster as a 1-1 relationship.
In the code below, we can use the Common Table Expression capability in SQL Server 2005 to create 2 CTEs that pre-aggregate Hours and Materials. Then in the next line of code, we perform LEFT JOINs between JobMaster and the 2 CTEs, and that produces the correct results.
Of course, there's often more than 1 way to write a query. Some might prefer to write a 2 correlated subqueries, each of which retrieves a single aggregated amount for every job in the outer query (see below). Of course, there are always debates about whether a correlated query might run a bit slower and might take up additional resources. We cover these issues in the Master's Program classes.
If there's a moral of this story, it's that sometimes our safest assumption might be the very root of the problem!