Indexing for DB performance
Isn’t Rails great? It makes interacting with your database so easy, and removes almost every vestige of SQL from the development process. You can build and mutate your entire database schema (thanks to ActiveRecord::Migration and ActiveRecord::Schema), go crazy shoving data into your database (with
ActiveRecord::Base.create and friends) and query your data in a very friendly Ruby DSL (
Wonderful! But I think most of us have experienced the puzzlement and frustration of wondering why our application, which ran so beautifully during testing and for the first few days or weeks after launch, is suddenly running slower and slower, and why our database is being so incredibly overworked. What happened?
Chances are, you forgot to add indexes to your tables. Rails won’t (and, honestly, can’t) do it for you. In fact, Rails doesn’t even try to tell you where those indexes might be needed. And without those indexes, the only recourse the database has when fulfilling your query is to do a “full table scan”, basically looking at each row in the table, one at a time, to find all matching records. That’s not too bad when there are only a few tens (or even thousands, on a fast machine) of rows, but when you starting getting tens of thousands, hundreds of thousands, or even millions of rows, just imagine how hard your database has to work to satisfy those queries!
So you may be wondering, “alright, I need indexes…how do I know what indexes to create?”
Here are a few general tips. My experience is primarily with MySQL, so that’s where my advice is directed, but I believe most of these tips apply regardless of your DBMS:
- If you have a foreign key on a table (or, phrased another way, you have a
has_and_belongs_to_manyassociation on a model), then you almost certainly need to add an index for it, because any time you access those associations, Rails is generating SQL under the covers that queries based on those foreign keys.
- If you find yourself frequently doing queries on a non-foreign-key column (like
permalink), you’ll definitely want an index on that column.
- If you frequently sort on a column or combination of columns, make sure the index that is being used for the query includes those sort columns, too (if at all possible). Indexes store the data in sorted order, so if your index includes the sort column, the database can return the sorted data at almost no extra cost.
- Many databases (like MySQL, or Postgres prior to 8.1) will only use a single index per table, per query, so make sure you have indexes defined for the column combinations that you will query on frequently. A common mistake is to define an index on “user_name” and an index on “account_id”, and then expect the database to use both indexes to satisfy a query that references both columns. (Some databases will use both indexes, though; be sure and understand how your DBMS uses indexes.)
- Don’t go crazy defining indexes. It is tempting to just add an index on every column that could conceivably be queried on, just to preemptively destroy any possible DB performance problems that may arise. This is bad. Too many indexes can be just as bad as too few, since the DB has to try and determine which of the myriad indexes to use to satisfy a particular query. Also, indexes consume disk space, and they have to be kept in sync every time an insert, delete, or update statement is executed. Lots of indexes means lots of overhead, so try to strike a good balance. Start with only the indexes you absolutely need, and try to use only those. As problem queries surface, see if they can be rewritten to use existing indexes, and only if they can’t should you go ahead and add indexes to fix them.
ANALYZE(Postgres) (or whatever means your DB provides) are your best friends. Get to know them. Learn how to read their output. They will tell you what indexes (if any) a query will use, and how the database expects to be able to fulfil the query. It is a good idea to play with these commands during testing, to try and locate problem spots before they become problems. Note, though, that the number of rows in a table can affect how the database chooses indexes, so just because your query looks fine with only a handful of test rows in the database, don’t expect it to perform well when there are thousands of rows. In a perfect world, you could test your app with a large corpus of real data. In an imperfect world, you just have to make do.
In short, know your database. As convenient as ActiveRecord makes things, never assume you can get along with zero knowledge of SQL and how your database will work. Find a good book about your DBMS of choice. Read up on it. Take the time to educate yourself—it will pay off handsomely in the long run.
Do you have have any more information about MySQL and PostgreSQL using single indexes? I remember reading some of this for MySQL, I’d like to try and get more information for PostgreSQL. Getting a database to use its indexes makes all the difference.
23 Oct 2006
Adam, I’m afraid I have very, very little experience with PostgreSQL, so I can’t help you there. For MySQL, I strongly recommend “High Performance MySQL” by Jeremy Zawodny and Derek Balling. It’s a little dated (written when 4.0 brand new) but the concepts are still very relevant.
23 Oct 2006
MySQL also has a certification program: they offer seperate certifications for developers and DBAs. Whatever you think about the certifcation process, the textbook is pretty good.
Another tip you might try: mysql> SELECT * FROM someTable PROCEDURE ANALYSE\G
That will give you the optimal field type for every column in the table. The two numeric arguments tells it the maximum number of elements you’re willing to accept for an ENUM field, and the maximum number of characters you’ll accept in its definition, respectively.
But indexes are definitely the place to start. Joins that can be accomplished very easily given a couple of good indices can become very taxing in their absence.
23 Oct 2006
Sam, that PROCEDURE ANALYSE thing… is that mysql 5 only? I’m not having any luck with it on 4.1.
23 Oct 2006
Ah, nevermind, the empty parens got eaten by the comemnt system. For the record, that is PROCEDURE ANALYSE followed by empty parentheses. :)
23 Oct 2006
Skimming the PostgreSQL documentation, it looks like they actually can handle multiple indexes at a time in 8.1
23 Oct 2006
I came across the Query Analyzer plugin a while back. For development it is brilliant as it outputs the EXPLAIN into the logs after each query. Great for checking to see what needs optimising.
23 Oct 2006
The thing I love about migrations is that you can define indexes in a migration with add_index. For example to create a unique index across 2 fields you can add the following to your migration:
add_index “table”, [“field1”, “field2”], :name => “index_name”, :unique => true
23 Oct 2006
The best thing to do is read all the documentation for your db so you become intimately familiar with it. I don’t use mysql, but at least for postgres the docs are very in depth. Also you should sign up for the mailing list for your db of choice.
24 Oct 2006
As far as I remember RDBMS’ should use multiple indexes depending on the types of things you’re doing. The more
For example if you’re doing a few joins, for each join an index will be used depending on what you have in the ‘on’ clause. Typically the RDBMS will pick what index it thinks is best.
Then you’ve got the where clause, which another index will be used again, if you have more than one condition the RDBMS will again pick which index it thinks it should use.
It may be different for different RDBMS’, but I think that’s typically how they work.
25 Oct 2006
Oh and I must suggest a book if you’re wanting to really learn SQL. This book is DB independent and even explains the differences between the popular vendors. If you’re like ‘oh I don’t really use joins…’ and ‘what’s a group by?’ then this book is unreal. It even has a good coverage of NULLs which many other books do not.
I promise you’ll emerge an expert. Enjoy!
25 Oct 2006
Regarding your point about adding indexes on fields that are foreign keys, I think it really depends on what side of the association you are coming from. For example, if I have two models called Server and Administrator:
class Server < ActiveRecord::Base belongs_to :administrator end
class Administrator < ActiveRecord::Base has_many :servers end
The servers table has a foreign key called administrator_id that holds references to the administrators table’s primary key, administrator_id.
If you do a:
Server.find(:all, :include => :administrator)
Having an index defined on servers.administrator_id is not going to do you any good. However, if you do:
Administrator.find(:all, :include => :servers)
then, the index on servers.administrator_id will be used.
25 Oct 2006
Well, it has always been the case that indexing is less about how to do it and more about when to do it. That has clearly been what seperates the champs from the chumps when it comes to fast retrieval database operations.
27 Oct 2006
Only one thing I would add which is really important.
Check the cardinality of the column before indexing.
MySQL: SHOW INDEX FROM table_name
The distribution of values in a column is important as well. In Database speak that is the cardinality of an column.
The more unique a value in a column the higher its cardinality. Example: types of sexes is extremely low, email address is very high.
Until PostgreSQL and MySQL get bitmap indexes like Oracle it is better to full scan the table than get data from the table via an index where the cardinality is very low.
Getting rid of file sorts and temp table stuff is important in MySQL, be careful of IN/EXIST subqueries, the MySQL optimiser is still not fully baked for those, from my experience.
28 Oct 2006
On the databases I’ve worked with in the past, its worth noting that indices incur a write penalty. So anytime you’re inserting a row (and probably updating), the DB has to generate data for and insert data into the index, possibly re-organizing it. In most scenarios, this is fine as reads often out-number writes 5:1 or so. However, for something like an audit table (for example), you might want to choose your indices sparingly.
Other than that, this has been a very valuable thread, Jamis. Thanks for starting it!
28 Oct 2006
For a good treatment of indexes, see “The Art of SQL” (http://www.amazon.com/Art-SQL-Stephane-Faroult/dp/0596008945) on O’Reilly, a refreshingly no-nonsense, high-level, database-agnostic view on relational databases, with much material devoted to how to make the relational model work for you within the constraints of the non-relational SQL language.
As Adam Sanderson points out, PostgreSQL 8.1 will use in-memory bitmap indexes to perform multi-index joins (a feature DB2 has had for years). Bizgres has implemented on-disk bitmap indexes similar to Oracle’s, scheduled for inclusion sometime after 8.2. Oracle’s compressed bitmap indexes, which are only available in the $30k-per-CPU enterprise version, yields ridiculously fast queries on high-cardinality attributes.
Oh, and the command to get explain output from PostgreSQL is “explain”, not “analyze”. Usually you will want to run “explain analyze”, which requests actual timings in addition to the estimated costs. PostgreSQL’s explain output, incidentally, is the best I have seen—Oracle’s is all right and MySQL’s is practically unreadable.
30 Oct 2006
Note to those who asked about MySQL only using one index per query: this restriction was removed in the 5.0 series. http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html
2 Nov 2006