Faking cursors in ActiveRecord
There are times (like, in a migration, or a cron job) where I want to operate on large numbers of rows in the database, such as for billing, where you want to select all accounts who are due for automatic renewal, or when adding a new column to a table that you need to prepopulate with computed data.
One way to do that is just to brute force it:
1 2 3 |
Account.find(:all).each do |account| # ... end |
The drawback here is obvious: when you’re dealing with hundreds of thousands or even millions of rows, selecting them all into memory at once is brutal. And since ActiveRecord doesn’t support cursor-based operations, you can’t just ask ActiveRecord to return the rows as it reads them.
Here’s a trick I’ve been using recently to query large result sets while being friendly to the computer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
class <<ActiveRecord::Base def each(limit=1000) rows = find(:all, :conditions => ["id > ?", 0], :limit => limit) while rows.any? rows.each { |record| yield record } rows = find(:all, :conditions => ["id > ?", rows.last.id], :limit => limit) end self end end Account.each do |account| # ... end |
Sadly, this won’t work on every DBMS, or with every query; it exploits several idiosyncrasies of MySQL which might not be present on other DBMSs:
- MySQL sorts indexes.
- The primary key is an index.
- Queries which MySQL determines can be best satisfied by the primary key, then, will be returned in sorted order.
This means that if you try to add additional conditions to the query, you’ll also need to add an :order clause to sort by the id…and this will more than likely cause the performance of the query to go down the tubes. But for those queries where you just want to select every row anyway, it works quite well. You could use OFFSET and LIMIT, but OFFSET begins to be really, really slow when the OFFSET is in the tens of thousands or higher because it has to count through that many rows before finding where to begin returning data. Basing the query on id, like this, has the advantage of speed, because the database can use indexes like it was meant to.
Reader Comments
Great trick. How would you do a ‘collect’ rather than each?
6 Apr 2007
Alex:
Below is how I would implement collect, modeled off the above (and with the same caveats).
But remember—the reason we’re introducing this hack is to prevent having an array of hundreds of thousands or millions of ActiveRecord objects from one massive find. With the collect you’re just recreating another array of the same size. Hopefully the items returned by the block in the collect are more lightweight, so the size of the array will be more lightweight. But also note that building a huge array through Array#concat on several smaller arrays is probably going to reallocate memory for the array many, many times, which can cause other problems…
class <<ActiveRecord::Base def collect(limit=1000) results = [] rows = find(:all, :conditions => [“id > ?”, 0], :limit => limit) while rows.any? results.concat(rows.collect { |record| yield record }) rows = find(:all, :conditions => [“id > ?”, rows.last.id], :limit => limit) end results end end
6 Apr 2007
Sorry about the formatting above, hopefully this will work…
class <<ActiveRecord::Base def collect(limit=1000) results = [] rows = find(:all, :conditions => [“id > ?”, 0], :limit => limit) while rows.any? results.concat(rows.collect { |record| yield record }) rows = find(:all, :conditions => [“id > ?”, rows.last.id], :limit => limit) end results end end
6 Apr 2007
Alex, as Jacob said, the point of this exercise is to avoid mapping all of the rows into memory. If you need to use collect, you’re better of doing like Jacob demonstrated, or even simpler, just doing
Foo.find(:all).collect { ... }
.6 Apr 2007
For DBs with inherent cursor support, could their adapters implement #each, #collect, etc to utilise these cursors? The model + association implementations of #each, et al, would need to delegate to the adapters to support this, I guess.
I should just go and look, but its so much easier to just ask :)
6 Apr 2007
I think the idea of adding each like that is a very good idea. How much more complicated would it to be to add cursors for Postgres?
6 Apr 2007
Adding cursor support to ActiveRecord is non-trivial, but that shouldn’t stop people from trying. :)
6 Apr 2007
@jamis – bah, your answer was as unhelpful :) as my question!
Now I’ll have to go investigate… :|
6 Apr 2007
Giles – most of our databases are small enough that returning the full result set isn’t an issue and so far I’ve been able to narrow down the rest to keep them efficient. However, in the long run this is pretty interesting to me and we’re a pure PostgreSQL shop with the luxury of dictating that’ll always be the case barring a buyout. I’m going to look into hacking this up with find_by_sql and perhaps PL/Ruby’s cursor support instead of PL/pgSQL. The biggest potential issue I can see is that declaring cursors apparently needs to happen in single transaction.
6 Apr 2007
Jamis, I think this time I’ve beat you to it :
) Alas, as nobody reads my blog -and why should they anyway?—it might have gone unnoticed by the world.A cursor or offset/limit based implementation of Enumerable for ActiveRecord
http://schuerig.de/michael/blog/index.php/2007/02/03/ar-enumerable/
6 Apr 2007
Very nice, Michael! Kudos for the first write-up. :) And the nice implementation.
6 Apr 2007
Isn’t this essentially what this plugin is being used for as well? (Or at least his original use for it, now I just use it as a replacement for rails paginator)
http://cardboardrocket.com/pages/paginating_find
6 Apr 2007
JGeiger, sort of. The implementation you linked to uses LIMIT and OFFSET…and I wanted to specifically avoid an implementation that used OFFSET because that becomes really, really slow for large offsets. The solution I’m proposing here relies on MySQL’s ordering of primary keys to provide a way to (effectively) mimic OFFSET for a limited subset of queries, in a very efficient and fast manner.
6 Apr 2007
This is a pretty good work-around Jamis. Although whoever is responsible for the “all_hashes” method in the database adapter interface should get a stern sideways glance.
I’d really like a solution that made use of database cursors and only created a single ActiveRecord object in memory at any one time without needing a query for each row in the database, although I pretty much understand that if I want that I should implement it and release it as a plugin, perhaps I will get time one day.
6 Apr 2007
So, what does the call to self there at the end do?
6 Apr 2007
Justin, it’s not a method call, it’s just making sure that the class itself (Account, or whatever) is returned as the value of the call to “each”. It’s a habit I’ve gotten into, to make it easier to chain method calls.
6 Apr 2007
I wrote a paginator plugin about a week ago that people may find useful. We have a lot of custom finder queries in our app, and it bugged me that I’d have to duplicate stuff if I wanted to paginate over them. So this plugin makes it cake.
http://evang.eli.st/blog/2007/4/4/arpaginator-easily-paginate-existing-query-methods
Pretty handy when you’ve encapsulated your wild finds, so you can just do
paginate { Account.find_all_belonging_to_heineken_drinkers_and_recovering_php_addicts }
11 Apr 2007
This could be useful : http://cardboardrocket.com/pages/paginating_find
(thanks to Pat Maddox blog for pointing this).
13 Apr 2007
Jamis,
Several people have given me your name. I have some questions about Provo-Salt Lake based Ruby on Rails developers for a project I NEED to finish fast. Can you contact me?
N
16 Apr 2007
paginating_find as mentioned earlier is very useful. Besides providing pagination directly from a find (thus paginating over a group of objects matching your :conditions), it provides implicit cursor support with :auto => true.
However, with regard to your example, this could be better done in your each by providing a count of iterations to the each method itself, and using :offset in conjunction with :limit. This would make your cursor support naturally support just about every RDBMS. Also, it would be nice to support an options hash that gets passed along to the find.
29 Apr 2007