Counting on Both Hands

Aug 17th, 2008

This is more of an SQL hint than anything else, but it’s something I’ve found useful a number of times. If SQL scares you, now’s the time to turn back (or crack open a beer - dutch courage).

More often than not if I’m not displaying or editing something then I’m counting. How many unpaid invoices are there, how many customers to I have who were active in the last 6 months etc…

I’m sure I’m not teaching anyone anything if I say that Rails has some helpers for this, for example

Invoice.count :all, :conditions => {:status => 'unpaid'}
Customer.count :all, :conditions => ["updated_at > ?", 6.months.ago]
Invoice.sum :total, :conditions => {:status => 'unpaid'}

From top to bottom this counts the number of unpaid invoices, the number of customers updated in the last 6 months and the total value of all unpaid invoices. Pretty much any option you can stick in a find you can also use with the calculation helpers, for example:

  Customer.sum :weight, :joins => {:orders => :products}
  Customer.sum :weight, :joins => {:orders => :products}, :group => 'customers.id'

This sums the weight of all the items ordered by our customers (ignoring for now the quantity of a product in a given order). The second example groups this by customer.

But what if you want to sum (or count) more than one thing? For example I might want to know the number and the total value of the unpaid invoices. You could of course just do one and then the other but that feels a little wasteful: we’re asking the database to scan over the corresponding invoices once and then we turn around and ask it do it again. Luckily with some raw sql it’s not hard to do this in one go:

  connection.select_all "SELECT count(*), SUM(total) from invoices where status='unpaid'"

But what if you wanted to count more than one thing? For example I might want a count of all outstanding invoices and a count of those with non trivial value (say more than $10). Again easy enough to write as two queries, but it would be nice to get it all back in one go.

The key to this is understanding the link between summation and counting. To use a technical term, we’re interested in indicator functions. If we have: - a set X (which in our terms corresponds to all the rows in a table (or to be more correct, all the rows your query would return if it had no WHERE clause) - a subset A (the rows our WHERE clause would select). - a function I such that I(x) is 1 if x is in A and 0 if not

Then the cardinality (the number of things in it) of A is the sum of I(x) for x ranging over X.

That sounds complicated, but if you think about it, all it is saying is that to count the rows in a table you could use

SELECT SUM(1) from foos

Here our indicator function is dead simple: it always returns 1.

These three queries return the same thing

  SELECT COUNT(*) from invoices where status='pending'
  SELECT SUM(1) from invoices where status='pending'
  SELECT SUM( IF(status='pending', 1, 0) from invoices

The third form is the one we’re interested in. Again it shouldn’t be hard to spot why it works: for each row that matches we count 1, for all others we count 0. When we add these all up we’re just going to get the number of times we counted 1, i.e. the number of matching rows. We wouldn’t want to use that on its own (it won’t use an index and COUNT(*) probably takes some shortcuts) but in the context of our problem it’s just what we need:

  SELECT COUNT(*) as count, SUM(IF(total<10, 1, 0)) as small_invoices where status = 'pending'

will return the number of unpaid invoices and the number of invoices whose total is less than 10 dollars. Here our IF functions are our indicator functions: they return 1 if the condition evaluates to true and 0 if not.

Is it actually useful?

It’s going to depend on your data and queries. The ones shown here are probably too simple for it to be of any use since they also had to be easy to explain. When you push some of your conditions from the WHERE clause into the IF statements you are effectively stopping the database from using any indexes to solve those conditions. This can hurt you, but obviously if you weren’t using (or don’t have) indexes that the database could have used for those conditions then you haven’t lost anything.

The other basic premise is that if the database going over some set of data it might as well be counting more than one thing. So if in order to find rows satisfying condition A the database needs to scan some subset X and in order to satisfy condition B the database also needs to scan that same subset X then you’re onto a winner. If on the other hand the two are completely distinct (or if X is hopelessly big) then you won’t be saving much.

Typically I’ve used this the most in reporting style applications where having applied a common set of conditions I want to count the number of occurrences of a large number of features. Not something to be using willy-nilly, but a neat trick to have in your toolbox. Use it wisely. And as with all performance things, don’t do things blindly just because you read somewhere that it’s faster. Profile, measure and so on before and after and come to your own conclusions.

When Ducks Go Mutant

Aug 9th, 2008

Ruby is a permissive language. In general we don’t care what objects are, as long as they respond in certain ways to certain methods: “If it walks like a duck and quacks like a duck, I would call it a duck”.

As you may recall, in Rails 2.1 there was a rewrite of the :include mechanism, however the old mechanism persists for compatiblity reasons. When using the new mechanism, the :included associations are not joined, and so if any part of your query looks references the tables that would formerly have been joined it won’t work.

To work around this Rails looks at the conditions, order, select statements and so on to see if any of them mention tables other than the main table. If they do then the fallback to the old code is triggered and everything works fine. The code that detects tables in the order clause looks something like

   return [] unless order && order.is_a?(String)
   order.scan(/([\.\w]+).?\./).flatten

The code is quite simple: if we ever have “something.” then that means we’re using the something table.

The code that eventually adds the order to the statement looks something like

  sql << " ORDER BY #{order}"

The code for the other options is similar.

So what’s the problem here? If as the api docs indicate, you pass a string containing a fragment of sql then nothing at all is wrong. However some people (I assume that the :conditions option is the origin of this habit) have taken to doing things like

  Foo.find :all, :order => ['bars']

Before 2.1, this happens to work, because the default to_s on an array just joins the strings together. However the table scanning code won’t scan an array (My guess is that the explicit check for string was because people quite legitimately write :order => :name and things like that). So if you’ve got a select or order clause specified as an array that depends on an included able it will break when you move to rails 2.1.

It’s a complete accident that this ever worked (and it breaks if you were to try anything like :order =>[‘name desc’, ‘age desc’]), but that isn’t a huge amount of comfort when code that has been working suddenly stops working. You could probably waste a lot of time before working out that it was specifying the order option as an array, which obviously is not a good thing (and makes people scared of upgrading). On the other hand it’s hard to anticipate how people will use things and explicitly checking types and so on isn’t a very rubyish thing to do and could get in the way of legitimate uses.

I’m not sure how a framework provider should handle this in the general case. It’s a delicate balance between not stifling some of flexibility ruby offers and helping programmers not rely on things that only work by accident.

Nested Includes and Joins

Aug 8th, 2008

Nested eager loads are a not entirely obvious bit of rails syntax. It’s not hard once you get it though. In a nutshell you have to tell ActiveRecord how it should walk through the associations (i.e. just listing them isn’t enough). Before we get going it’s worth pointing out that although I’ve written :include everywhere, everything I’ve said applies equally to :include and :joins (see my previous post on the difference between the two).

When you are nesting includes, you’re building up a data structure that is inherently recursive. There are 3 rules for nested :includes: - Always use the association name. Not the table name, not the class name but the association name (whatever it is that you typed just after belongs_to, has_many etc…). A correlation is that if you don’t have all your associations set up, you’re dead in the water. If you’ve got one side of a relationship Rails won’t infer the other for you. - If you want to load multiple associations from a model, use an array. - If you want to load an association and some of its child associations then use a hash. The key should be the parent association name, the value should be a description of the child associations.

Now just combine those 3 rules and apply recursively. As an aid here’s a quick snippet that takes an include option (such as [:comments, {:posts => :authors}]) and describes which associations are loaded. The structure is the same as the code in activerecord that handles the :include option, so it should give some insight into how things work.

def describe associations, from = 'base'
  case associations
  when Symbol then puts "load #{associations} from #{from}"
  when Array then associations.each {|a| describe a, from}
  when Hash then
    associations.each do |parent, child|
      raise "Invalid hash - key must be an association name" unless parent.is_a?(Symbol)
      describe parent, from
      describe child, parent
    end
  end
end

This code isn’t too hard to understand. It’s all about the class of the associations parameter - The easy case is if it’s a string or a symbol: what we’ve got is just the name of an association, so just go ahead and load it. - If what we’ve got is an array, then just call ourselves recursively on the contents of that array. - if what we’ve got is a hash, then for each key value pair: - Load the association specified by the key (the parent association) - Load the associations specified by the value (from the from the parent association).

It produces (ugly) output like this:

  describe [{:comments => :user}, :category]
load comments from base
load user from comments
load category from base

which is exactly what activerecord would do. If you see something like “load user from comments” but instances of Comment don’t have an association named user then you’ve screwed up.

Examples

Still confused ? Here are some examples, from simple to complicated (these are purely examples of the :include syntax - don’t see this as a recommendation to actually load 10 layers of nested associations). The models are from a hypothetical book selling application and are: - Book - Author - Comment - User

Books belong to authors, and users of the site can leave a comment on any book. The obvious associations are defined. In addition user has a favourite_books association and through the friends association they can list other users who taste in books they generally share.

Book.find :all, :include => :author

I hope I don’t have to explain that one to anyone :-)

Book.find :all, :include => [:author, :comments]

We want to include both the author and comments, so we place the two names in an array

  Book.find :all, :include => [:author, {:comments => :user}]
  Book.find :all, :include => [{:comments => :user}]
  Book.find :all, :include => {:comments => :user}
  Book.find :all, :include => {:comments => [:user]}

In the first example we still want to include author and comments, but now we want to include an association from comments. From the 3rd rule we need a hash containing the key :comments and with corresponding value a description of the associations from the Comment model that we want to load (ie just :user in this case).

In the next 3 examples we just want to include the comments and the user from each comment. These three forms are entirely equivalent (which should be fairly obvious).

  Book.find :all, :include => {:comments => {:user => :favourite_books}}
  Book.find :all, :include => {:comments => {:user => {:favourite_books => :author}}}

Here we are loading the books’ comments, the user for each comment and the favourite books for each of those users. In the second example we’re also loading the author for each of those favourite books. You can keep on nesting these as far as you want.

  Book.find :all, :include => {:comments => {:user => {:favourite_books => [:author, :comments]}}}

Now we’ve come full circle - on each favourite book we’ve loaded the comments

  Book.find :all, :include => {:comments =>
    {:user => [
       :friends,
       {:favourite_books => [:author, :comments]}
]}}
  Book.find :all, :include => {:comments =>
    {:user => [
      {:friends => :favourite_books},
      {:favourite_books => [:author, :comments]}
]}}
  Book.find :all, :include => {:comments =>
    {:user =>
      {:friends => :favourite_books,
       :favourite_books => [:author, :comments]
}}}

Our final examples. In addition to the favourite_books association, we’re loading a user’s friends, and in the second case the favourite books of those friends. The last two examples are identical: we can either have an array with two 1 item hashes, or just one hash with 2 items. We can’t do that in the first example: because we’re not loading any associations from friends we can’t make it into a hash (what would the corresponding value be?)

Parametrised to the Max

Jul 18th, 2008

If you’ve played with Rails for more that about 3 minutes you’ll know that the params hash can itself contain hashes. If you use form_for/fields_for then you’re used to things like params[:user] containing those fields pertaining to the user and things like that. But it can also contain arrays, arrays of hashes and so on.

Fundamentally html forms don’t know about any sort of structured data. All they know about is name-value pairs. Rails tacks some conventions onto parameter names which it uses to express some structure.

A lot of the time you don’t really care about how this happens. You use the helpers, rails generates its magic form element names and more magic on the other side stuffs that into your params hash. Every now and again you need to head off the beaten track, and there it’s useful to understand how things all fit together (eg if you are generating data to submit via javascript).

Boring cases & parlour tricks

First off we’ll need a way of trying out things. You could of course just mock up a form with the appropriate input elements but this would massively slow down the process of trying out stuff which is never a good thing. Instead we can tap into what Rails uses. Just open up a script/console prompt:

  ActionController::AbstractRequest.parse_query_parameters "name=fred&phone=0123456789"
    => {"name"=>"fred", "phone"=>"0123456789"}

Just stick name=value pairs (joined with &) into a string and pass it to the above function. What you get back is what would be in the params hash in your controller. From now on, for the sake of brevity I’ll just write parse instead of ActionController::AbstractRequest.parse_query_parameters. If you’re on 2.3 or higher (ie after this commit then you’ll need to use this:

  ActionController::RequestParser.parse_query_parameters "name=fred&phone=0123456789"
    => {"name"=>"fred", "phone"=>"0123456789"}

And if you’re on Rails 3, you can use

Rack::Utils.parse_nested_query "name=fred&phone=0123456789"
    => {"name"=>"fred", "phone"=>"0123456789"}

If you’ve ever looked at the html your app generates you will have seen form inputs with names like user[name] (as generated by the text_field or form_for helpers). This notation indicates that the user parameter should be a hash, and that here we’re talking about the name key in that hash:

  parse "user[name]=fred&user[phone]=0123456789"
    => {"user"=>{"name"=>"fred", "phone"=>"0123456789"}}

The other basic structure is an array. By default, repeated parameters with the same name are ignored and only the first appearance counts.

  parse "name=fred&name=bob&name=henry"
    => {"name"=>"fred"}

If the name ends with [] then instead of discarding extra occurrences they are accumulated in an array:

  parse "aliases[]=fred&aliases[]=bob&aliases[]=dark+prince&things[]=foo"
    => {"aliases"=>["fred", "bob", "dark prince"], "things"=>["foo"]}

One use case might be a set of checkboxes which indicate which folders a user has to. Just create all your checkboxes with the name accessible_folder_ids[] and with the ids of the folders as values, for example:

  <% @folders.each do |f| %>
    <%= check_box_tag 'user[accessible_folder_ids][]', f.id %> <%= h f.name %> <br/>
  <% end %>

Stick this in a form and when you submit it, params[:user][:accessible_folder_ids] will be an array containing the ids of the folders the user can play with.

Nested parameters for dummies

We can nest hashes if we want. For example perhaps a user has an associated model such as an address:

  parse "user[name]=fred&user[address][town]=cambridge&
         user[address][line1]=4+Station+road"
    => {"user"=>{"name"=>"fred", "address"=>{"line1"=>"4 Station road", "town"=>"cambridge"}}}

A member of a hash can of course be an array. To do this you just append [] to the parameter name, as with a top level array parameter:

  parse "user[aliases][]=fred&user[aliases][]=bob&user[aliases][]=dark+prince"
    => {"user"=>{"aliases"=>["fred", "bob", "dark prince"]}}

The array can be several levels down: we can improve a previous example of a user with an address by saying that the address should have a lines array containing the lines from the address. Start with user[address][lines] and then just append [] to indicate the parameter should be an array:

  parse "user[name]=fred&user[address][town]=cambridge&user[address][lines][]=Random+house&
         user[address][lines][]=4+Station+road&user[address][lines][]=By+the+station"
    => {"user"=> {
      "name"=>"fred",
      "address"=>{
        "lines"=>["Random house", "4 Station road", "By the station"],
        "town"=>"cambridge"
    }}}

There’s another case in which nesting like this can be useful. Support for example that we have a list of users and we want the user to be able to change as many of them as they want with one edit operation. If we use the id of the records as the first key then this is easy:

  parse "users[1][name]=fred&users[1][email]=fred@example.com&
         users[2][name]=bob&users[2][email]=bob@example.com"
    =>{"users"=>{
      "1"=>{"name"=>"fred", "email"=>"fred@example.com"},
      "2"=>{"name"=>"bob", "email"=>"bob@example.com"
}}}

The corresponding controller code is similarly simple

  params[:users].each do |id, new_attributes|
    User.find(id).update_attributes new_attributes
  end

(Handling errors and invalid entries when editing multiple elements is an interesting problem in itself and is left as an exercise to the reader). If you are using the Rails form helpers, the :index option does precisely this:

  helper.text_field "person", "name"
    => '<input id="person_name" name="person[name]" size="30" type="text" />'
  helper.text_field "person", "name", "index" => 1
    => '<input id="person_1_name" name="person[1][name]" size="30" type="text" />'

Editing multiple objects

An interesting case is a form allowing us to create several models, for example to add several users to a mailing list. To do this we use parameters of the form users[][name]. This says that users is an array and we’re pushing an hash with key name onto it. So for example

  parse "users[][name]=fred&users[][email]=fred@example.com"
    => {"users"=>[{"name"=>"fred", "email"=>"fred@example.com"}]}

So how does rails know when one record is finished and the next start? Simple: if we’ve already had a users[][name] parameter and we see a new one then we can usually assume that the next one must belong to a new user

  parse "users[][name]=fred&users[][email]=fred@example.com&
         users[][name]=bob&users[][email]=bob@example.com"
    => {"users"=>[{"name"=>"fred", "email"=>"fred@example.com"},
                  {"name"=>"bob", "email"=>"bob@example.com"}]}

To continue with a previous example, a user might have several addresses:

  parse "user[name]=fred&user[addresses][][line]=24+bob+street&user[addresses][][town]=cambridge&
         user[addresses][][line]=1+market+square&user[addresses][][town]=bedford"
    => {"user"=>{"name"=>"fred", "addresses"=>[
        {"line"=>"24 bob street", "town"=>"cambridge"},
        {"line"=>"1 market square", "town"=>"bedford"}]
}}

Turtles all the way down

Hashes can be nested as much as you want:

  parse "users[a][b][c][d]=fred"
    => {"users"=>{"a"=>{"b"=>{"c"=>{"d"=>"fred"}}}}}

Arrays can’t be nested: you can can’t for example have a parameter which is an array of arrays. You can have a hash with an array parameter or an array of hashes but in general there can only be one level of ‘arrayness’. It’s easy enough to see why this is: arrays are built by repeating the same parameter name multiple times, however if you are inside an array then parameter name repetition is precisely what rails uses to determine whether it should move on to the next array element[1].

To an extent this can be sidestepped, for example instead of an array of users you can have a hash of users keyed by id. If your data structure is genuinely just an array you can always make it into a hash that is keyed by array index. For example if we had a form displaying a list of users and each user has a name and a list of aliases then the following query string parses into what we want:

  parse "users[1][name]=fred&users[1][aliases][]=joker&users[1][aliases][]=the+bat&
         users[2][name]=bob&users[2][aliases][]=bobbo&users[2][aliases][]=bobster"
  =>{"users"=>{
    "1"=>{"name"=>"fred", "aliases"=>["joker", "the bat"]},
    "2"=>{"name"=>"bob", "aliases"=>["bobbo", "bobster"]}
}}

Where it can go wrong

Consider the following query string:

  user[aliases][]=fred&user[aliases][name]=fred

This can’t work: user[aliases][] indicates that user has an aliases attribute that’s an array, but user[aliases][name] indicates that user has an aliases attribute that’s a hash. There are other ways in which this can happen, but the result is the same:

  TypeError: Conflicting types for parameter containers

[1] This is also the reason for the problem Xavier mentions in his comment. Checkboxes submit no value if they are not checked, however it’s rather convenient to create the illusion that they send (for example) 1 if the box is checked, 0 if not.

The rails check_box helper does this by adding a hidden field with the same parameter name. If the checkbox submits nothing then the hidden field “wins” and submits 0, if the checkbox is checked then it wins because it’s the first parameter. However since rails uses parameter name repetition to distinguish between elements of an array the hidden parameter causes rails to start a new array element. I don’t know of a good workaround other than using a hash instead of an array or using check_box_tag instead of check_box.

The Difference Between :include and :joins

Jun 22nd, 2008

I think some people occasionally mix up :include and :joins (or possibly don’t know of the existance of :joins).

:include is for loading associations. Before Rails 2.1 it will always left outer join the appropriate tables, starting with rails 2.1 it will either load them with a separate query or it will join the appropriate tables.

:joins is for joining (duh). It either takes an sql fragment (eg “INNER JOIN foos on foos.id = foo_id”) or association names in a variety of ways and joins the relevant tables. For example:

:joins => :products
:joins => [:products, :customers]
:joins => [:products, {:customers => [:friends, :foes]}]
:joins => [:products, {:customers => {:friends => :parents}}]
:joins => {:order => {:customer => {:address => :some_other_association}}}

and so on. You can nest these as deeply as you want (although bear in mind that if you end up requiring a 23 way join you may want to rethink your strategy). The same notation for nested associations is used by :include. The one difference is that :joins creates inner joins for you (if you desperately need those outer joins, you can always use the string form of :joins, but you will have to write the sql fragment explicitly).

If what you want is to use attributes from the joined tables for sorting or in your conditions then both :include and :joins will work since they both cause the relevant tables to be joined. But :include then does a lot more work massaging the results that come back from the database, instantiating lots of activerecord objects, gluing together all the relationships in the appropriate manner. If all you wanted was to order or filter results based on some of the joined attributes then this work is wasted.

Just to show that I’m not making things up, I performed the following unscientific test:

Benchmark.bm(7) do |x|
  x.report("include:") do
    Customer.find :all, :include => :customer_detail,
                 :conditions => "customer_details.date_of_birth >= '1980-01-1'"
  end
  x.report("joins:") do
    Customer.find :all, :joins => :customer_detail,
                 :conditions => "customer_details.date_of_birth >= '1980-01-1'"
  end
end

Some customers have provided extra details which we store in a separate table, and we want to get all customers born after 1980.

The results:

	user	system	total	real
include	1.310000	0.040000	1.350000	( 1.532910)
joins	0.350000	0.040000	0.390000	( 0.454001)

In short, don’t use :include unless you will actually be accessing those associations and want to avoid the hit caused by loading them from the database one by one.

← Older Blog Archives Newer →

Space Vatican

Ramblings of a curious coder