Thursday, March 13, 2014

Set Theory Mathematics: The Problem with Sums of Sums and Averages of Averages

Recently, it has come to my attention that the reports that have been written prior to my coming have an itsy bitsy flaw. Let me explain.

In one particular report, we count groups of people carrying out actions. We then add and subtract the counts of those actions to get a total number of people doing aggregate actions.

I'm going to make up a hypothetical example. Let's say I'm counting the actions of grocery shoppers who bought certain items.

  • 32 people bought paper towels
  • 17 people bought napkins
  • 12 people bought chicken 
  • 5 people bought crackers
Now, let's say I want to know how many people bought paper products. The answer should be easy. 

32 + 17 = 49

Right? That would be true if the sets were mutually exclusive, like the Venn diagram to the left. That means the sets don't overlap ever.

Unfortunately, that is most likely incorrect! Why? A person can very easily buy both napkins and paper towerls.

Let's say I failed to mention this fact:
  • 7 of the people who bought napkins also bought both napkins and paper towels
So the more accurate answer is:

32 + 17 - 7 = 42

Why does this happen? Because of the count of counts and the average of averages and the sum of sums (or any combination thereof) are not the same! It's a mathematical fact of life, and we need to take it into account whenever we're dealing with sets. 

Sets, like scalar values, can be "added together" and "subtracted" in a manner of speaking. There are lot of special terms and symbols used in set mathematics that I won't get into here today. What is relevant now is how this all applies to database sets, otherwise known as tables.

If you want to get the proper count of two intersecting (or exceptional) tables, you will need to use the set operators, UNION, INTERSECT, and EXCEPT. 

If you want to (distinctly) add two tables together, use UNION. If you want to get the rows where the tables intersect, use INTERSECT. Finally, if you want to subtract two tables, use EXCEPT.

Once you've completed your set operation, then you can summarize it. But if you want accurate numbers, never summarize before the set operation.

Here's some pseudo-sql as an example:

 COUNT(customer_id) AS 
   product = 'paper towels' 


   product = 'napkins'

) unioned_products

The reason this works is because duplicate customer_id values are thrown away. UNION works by sorting both tables and then merging them together, throwing away any duplicates that it finds. In the second Venn diagram above, people who bought both napkins and paper towels were doubly counted when in fact they should have been singly counted.

Rest assured, this is an imagined scenario that came about because of a real situation. Our reports at work were most definitely reporting the wrong numbers and now we have to go back and rewrite a lot of them with this specific problem in mind.

What will you do the next time your CEO comes to you and asks you how -3 people bought your products this month? Embarassing. Just tell 'em "I inherited the report from the last report writer." Then fix it as fast as you can! 

No comments:

Post a Comment