Space Vatican

Ramblings of a curious coder

Counting on Both Hands

This is more of an SQL hint than anything else, but it’s something I’ve found useful a number of times. If SQL scares you, now’s the time to turn back (or crack open a beer – dutch courage).

More often than not if I’m not displaying or editing something then I’m counting. How many unpaid invoices are there, how many customers to I have who were active in the last 6 months etc…

I’m sure I’m not teaching anyone anything if I say that Rails has some helpers for this, for example

1
2
3
Invoice.count :all, :conditions => {:status => 'unpaid'}
Customer.count :all, :conditions => ["updated_at > ?", 6.months.ago]
Invoice.sum :total, :conditions => {:status => 'unpaid'}

From top to bottom this counts the number of unpaid invoices, the number of customers updated in the last 6 months and the total value of all unpaid invoices. Pretty much any option you can stick in a find you can also use with the calculation helpers, for example:

1
2
  Customer.sum :weight, :joins => {:orders => :products}
  Customer.sum :weight, :joins => {:orders => :products}, :group => 'customers.id'

This sums the weight of all the items ordered by our customers (ignoring for now the quantity of a product in a given order). The second example groups this by customer.

But what if you want to sum (or count) more than one thing? For example I might want to know the number and the total value of the unpaid invoices. You could of course just do one and then the other but that feels a little wasteful: we’re asking the database to scan over the corresponding invoices once and then we turn around and ask it do it again. Luckily with some raw sql it’s not hard to do this in one go:

1
  connection.select_all "SELECT count(*), SUM(total) from invoices where status='unpaid'"

But what if you wanted to count more than one thing? For example I might want a count of all outstanding invoices and a count of those with non trivial value (say more than $10). Again easy enough to write as two queries, but it would be nice to get it all back in one go.

The key to this is understanding the link between summation and counting. To use a technical term, we’re interested in indicator functions. If we have: – a set X (which in our terms corresponds to all the rows in a table (or to be more correct, all the rows your query would return if it had no WHERE clause) – a subset A (the rows our WHERE clause would select). – a function I such that I(x) is 1 if x is in A and 0 if not

Then the cardinality (the number of things in it) of A is the sum of I(x) for x ranging over X.

That sounds complicated, but if you think about it, all it is saying is that to count the rows in a table you could use

1
SELECT SUM(1) from foos

Here our indicator function is dead simple: it always returns 1.

These three queries return the same thing

1
2
3
  SELECT COUNT(*) from invoices where status='pending'
  SELECT SUM(1) from invoices where status='pending'
  SELECT SUM( IF(status='pending', 1, 0) from invoices

The third form is the one we’re interested in. Again it shouldn’t be hard to spot why it works: for each row that matches we count 1, for all others we count 0. When we add these all up we’re just going to get the number of times we counted 1, i.e. the number of matching rows. We wouldn’t want to use that on its own (it won’t use an index and COUNT(*) probably takes some shortcuts) but in the context of our problem it’s just what we need:

1
  SELECT COUNT(*) as count, SUM(IF(total<10, 1, 0)) as small_invoices where status = 'pending'

will return the number of unpaid invoices and the number of invoices whose total is less than 10 dollars. Here our IF functions are our indicator functions: they return 1 if the condition evaluates to true and 0 if not.

Is it actually useful?

It’s going to depend on your data and queries. The ones shown here are probably too simple for it to be of any use since they also had to be easy to explain. When you push some of your conditions from the WHERE clause into the IF statements you are effectively stopping the database from using any indexes to solve those conditions. This can hurt you, but obviously if you weren’t using (or don’t have) indexes that the database could have used for those conditions then you haven’t lost anything.

The other basic premise is that if the database going over some set of data it might as well be counting more than one thing. So if in order to find rows satisfying condition A the database needs to scan some subset X and in order to satisfy condition B the database also needs to scan that same subset X then you’re onto a winner. If on the other hand the two are completely distinct (or if X is hopelessly big) then you won’t be saving much.

Typically I’ve used this the most in reporting style applications where having applied a common set of conditions I want to count the number of occurrences of a large number of features. Not something to be using willy-nilly, but a neat trick to have in your toolbox. Use it wisely. And as with all performance things, don’t do things blindly just because you read somewhere that it’s faster. Profile, measure and so on before and after and come to your own conclusions.