In one particular report, we count groups of people carrying out actions. We then add and subtract the counts of those actions to get a total number of people doing aggregate actions.
I'm going to make up a hypothetical example. Let's say I'm counting the actions of grocery shoppers who bought certain items.
- 32 people bought paper towels
- 17 people bought napkins
- 12 people bought chicken
- 5 people bought crackers
Unfortunately, that is most likely incorrect! Why? A person can very easily buy both napkins and paper towerls.
Let's say I failed to mention this fact:
- 7 of the people who bought napkins also bought both napkins and paper towels
SELECT COUNT(customer_id) AS FROM ( SELECT customer_id FROM products WHERE product = 'paper towels' UNION SELECT customer_id FROM products WHERE product = 'napkins' ) unioned_products
The reason this works is because duplicate customer_id values are thrown away. UNION works by sorting both tables and then merging them together, throwing away any duplicates that it finds. In the second Venn diagram above, people who bought both napkins and paper towels were doubly counted when in fact they should have been singly counted.
Rest assured, this is an imagined scenario that came about because of a real situation. Our reports at work were most definitely reporting the wrong numbers and now we have to go back and rewrite a lot of them with this specific problem in mind.
What will you do the next time your CEO comes to you and asks you how -3 people bought your products this month? Embarassing. Just tell 'em "I inherited the report from the last report writer." Then fix it as fast as you can!