Faking cursors in ActiveRecord

Posted by Jamis on April 06, 2007 @ 08:15 AM

There are times (like, in a migration, or a cron job) where I want to operate on large numbers of rows in the database, such as for billing, where you want to select all accounts who are due for automatic renewal, or when adding a new column to a table that you need to prepopulate with computed data.

One way to do that is just to brute force it:

1
2
3
Account.find(:all).each do |account|
  # ...
end

The drawback here is obvious: when you’re dealing with hundreds of thousands or even millions of rows, selecting them all into memory at once is brutal. And since ActiveRecord doesn’t support cursor-based operations, you can’t just ask ActiveRecord to return the rows as it reads them.

Here’s a trick I’ve been using recently to query large result sets while being friendly to the computer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class <<ActiveRecord::Base
  def each(limit=1000)
    rows = find(:all, :conditions => ["id > ?", 0], :limit => limit)
    while rows.any?
      rows.each { |record| yield record }
      rows = find(:all, :conditions => ["id > ?", rows.last.id], :limit => limit)
    end
    self
  end
end

Account.each do |account|
  # ...
end

Sadly, this won’t work on every DBMS, or with every query; it exploits several idiosyncrasies of MySQL which might not be present on other DBMSs:

  • MySQL sorts indexes.
  • The primary key is an index.
  • Queries which MySQL determines can be best satisfied by the primary key, then, will be returned in sorted order.

This means that if you try to add additional conditions to the query, you’ll also need to add an :order clause to sort by the id…and this will more than likely cause the performance of the query to go down the tubes. But for those queries where you just want to select every row anyway, it works quite well. You could use OFFSET and LIMIT, but OFFSET begins to be really, really slow when the OFFSET is in the tens of thousands or higher because it has to count through that many rows before finding where to begin returning data. Basing the query on id, like this, has the advantage of speed, because the database can use indexes like it was meant to.

Posted in Tips & Tricks

Comments

Have something to add? Click here to leave a comment.

06 Apr 2007

1. Alex MacCaw said...

Great trick. How would you do a ‘collect’ rather than each?

2. Jacob Fugal said...

Alex:

Below is how I would implement collect, modeled off the above (and with the same caveats).

But remember—the reason we’re introducing this hack is to prevent having an array of hundreds of thousands or millions of ActiveRecord objects from one massive find. With the collect you’re just recreating another array of the same size. Hopefully the items returned by the block in the collect are more lightweight, so the size of the array will be more lightweight. But also note that building a huge array through Array#concat on several smaller arrays is probably going to reallocate memory for the array many, many times, which can cause other problems…

class <<ActiveRecord::Base def collect(limit=1000) results = [] rows = find(:all, :conditions => [“id > ?”, 0], :limit => limit) while rows.any? results.concat(rows.collect { |record| yield record }) rows = find(:all, :conditions => [“id > ?”, rows.last.id], :limit => limit) end results end end

3. Jacob Fugal said...

Sorry about the formatting above, hopefully this will work…

class <<ActiveRecord::Base def collect(limit=1000) results = [] rows = find(:all, :conditions => [“id > ?”, 0], :limit => limit) while rows.any? results.concat(rows.collect { |record| yield record }) rows = find(:all, :conditions => [“id > ?”, rows.last.id], :limit => limit) end results end end

4. Jamis said...

Alex, as Jacob said, the point of this exercise is to avoid mapping all of the rows into memory. If you need to use collect, you’re better of doing like Jacob demonstrated, or even simpler, just doing Foo.find(:all).collect { ... }.

5. Dr Nic said...

For DBs with inherent cursor support, could their adapters implement #each, #collect, etc to utilise these cursors? The model + association implementations of #each, et al, would need to delegate to the adapters to support this, I guess.

I should just go and look, but its so much easier to just ask :)

6. Giles Bowkett said...

I think the idea of adding each like that is a very good idea. How much more complicated would it to be to add cursors for Postgres?

7. Jamis said...

Adding cursor support to ActiveRecord is non-trivial, but that shouldn’t stop people from trying. :)

8. Dr Nic said...

@jamis – bah, your answer was as unhelpful :) as my question!

Now I’ll have to go investigate… :|

9. rise said...

Giles – most of our databases are small enough that returning the full result set isn’t an issue and so far I’ve been able to narrow down the rest to keep them efficient. However, in the long run this is pretty interesting to me and we’re a pure PostgreSQL shop with the luxury of dictating that’ll always be the case barring a buyout. I’m going to look into hacking this up with find_by_sql and perhaps PL/Ruby’s cursor support instead of PL/pgSQL. The biggest potential issue I can see is that declaring cursors apparently needs to happen in single transaction.

10. Michael Schuerig said...

Jamis, I think this time I’ve beat you to it :) Alas, as nobody reads my blog - and why should they anyway?—it might have gone unnoticed by the world.

A cursor or offset/limit based implementation of Enumerable for ActiveRecord

http://schuerig.de/michael/blog/index.php/2007/02/03/ar-enumerable/

11. Jamis said...

Very nice, Michael! Kudos for the first write-up. :) And the nice implementation.

12. JGeiger said...

Isn’t this essentially what this plugin is being used for as well? (Or at least his original use for it, now I just use it as a replacement for rails paginator)

http://cardboardrocket.com/pages/paginating_find

13. Jamis said...

JGeiger, sort of. The implementation you linked to uses LIMIT and OFFSET…and I wanted to specifically avoid an implementation that used OFFSET because that becomes really, really slow for large offsets. The solution I’m proposing here relies on MySQL’s ordering of primary keys to provide a way to (effectively) mimic OFFSET for a limited subset of queries, in a very efficient and fast manner.

14. Sean Geoghegan said...

This is a pretty good work-around Jamis. Although whoever is responsible for the “all_hashes” method in the database adapter interface should get a stern sideways glance.

I’d really like a solution that made use of database cursors and only created a single ActiveRecord object in memory at any one time without needing a query for each row in the database, although I pretty much understand that if I want that I should implement it and release it as a plugin, perhaps I will get time one day.

15. Justin said...

So, what does the call to self there at the end do?

16. Jamis said...

Justin, it’s not a method call, it’s just making sure that the class itself (Account, or whatever) is returned as the value of the call to “each”. It’s a habit I’ve gotten into, to make it easier to chain method calls.

11 Apr 2007

17. Pat Maddox said...

I wrote a paginator plugin about a week ago that people may find useful. We have a lot of custom finder queries in our app, and it bugged me that I’d have to duplicate stuff if I wanted to paginate over them. So this plugin makes it cake.

http://evang.eli.st/blog/2007/4/4/arpaginator-easily-paginate-existing-query-methods

Pretty handy when you’ve encapsulated your wild finds, so you can just do

paginate { Account.find_all_belonging_to_heineken_drinkers_and_recovering_php_addicts }

13 Apr 2007

18. dodik said...

This could be useful : http://cardboardrocket.com/pages/paginating_find

(thanks to Pat Maddox blog for pointing this).

16 Apr 2007

19. Neal Bergstrom said...

Jamis,

Several people have given me your name. I have some questions about Provo-Salt Lake based Ruby on Rails developers for a project I NEED to finish fast. Can you contact me?

N

29 Apr 2007

20. Stephen Touset said...

paginating_find as mentioned earlier is very useful. Besides providing pagination directly from a find (thus paginating over a group of objects matching your :conditions), it provides implicit cursor support with :auto => true.

However, with regard to your example, this could be better done in your each by providing a count of iterations to the each method itself, and using :offset in conjunction with :limit. This would make your cursor support naturally support just about every RDBMS. Also, it would be nice to support an options hash that gets passed along to the find.