Introducing the Capistrano Shell
Our current cluster at 37signals consists of (at the moment) 12 machines. The first few weeks that we were running them live were rather bumpy, and I took to using Capistrano to do a bit of ad-hoc monitoring, using the uptime
task (from the capistrano-ext gem) to keep an eye on things:
cap -v uptime |
However, with 12 machines, the wait for that command to run each time was about 20 to 30 seconds, which is far too long for my impatient self.
Part of that seemingly-interminable wait was alleviated by changing Capistrano so that it connects to servers in parallel, rather than serially. Prior to 1.2, when you executed a task that needed to connect to multiple servers, each connection was established one at a time, in a single thread. Now, they are established in parallel, one per thread. The connection overhead, which used to be 18 seconds on my laptop, dropped to just over 6 seconds. Not a bad savings!
However, there was more that could be done. Why, when I am running this command every few minutes, should I need to reestablish the connection with every request? What if I could cache the connection somehow?
This line of thinking led to what I feel is the most exciting new feature in Capistrano 1.2: cap shell.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
jamis> cap -v shell ==================================================================== Welcome to the interactive Capistrano shell! This is an experimental feature, and is liable to change in future releases. -------------------------------------------------------------------- cap> !uptime [establishing connection(s) to db2, db1, db3, file1, app1, app2, app3, app4, app5, web1, web2, web3] querying servers... [app1 ] up 15 days, 7:40, load 3.53, 1.36, 0.83 [app2 ] up 15 days, 8 hrs, load 0.23, 0.38, 0.44 [app3 ] up 19 days, 11:13, load 0.29, 0.40, 0.43 [app4 ] up 19 days, 12:02, load 0.58, 0.49, 0.51 [app5 ] up 20 days, 9:50, load 0.50, 0.45, 0.43 [db2 ] up 28 days, 22:29, load 0.05, 0.06, 0.07 [file1] up 60 days, 12:42, load 0.06, 0.17, 0.17 [web1 ] up 19 days, 22:31, load 0.17, 0.12, 0.13 [web2 ] up 36 days, 22:18, load 0.06, 0.10, 0.11 [db1 ] up 28 days, 21:07, load 1.27, 0.90, 0.82 [db3 ] up 28 days, 22:53, load 1.19, 0.94, 0.85 [web3 ] up 188 days, 19:36, load 0.09, 0.06, 0.01 cap> |
Subsequent invocations of the uptime task from within that shell will reuse the existing connections. The result? It now takes less than two seconds to give me my answers.
But the shell is good for more than just executing tasks. You can execute arbitrary commands and have them run on your servers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
cap> uname -srp [db2] FreeBSD 6.1-STABLE amd64 [web2] FreeBSD 6.1-STABLE amd64 [web3] FreeBSD 6.0-RELEASE amd64 [file1] Linux 2.6.9-34.0.2.ELsmp x86_64 [app1] FreeBSD 6.0-RELEASE amd64 [app2] FreeBSD 6.1-STABLE amd64 [app3] FreeBSD 6.1-STABLE amd64 [app4] FreeBSD 6.1-STABLE amd64 [app5] FreeBSD 6.1-STABLE amd64 [web1] FreeBSD 6.1-STABLE amd64 [db1] FreeBSD 6.1-STABLE amd64 [db3] FreeBSD 6.1-STABLE amd64 cap> |
(Note that prefixing the command with an exclamation point causes Capistrano to interpret it as the name of a task, and to execute it as such. Without the exclamation mark, it is considered a bare command, and is simply executed verbatim on each host.)
Also, you can focus the task or command so that it only executes on a specific role, or host:
1 2 3 4 5 6 7 8 9 10 |
cap> with app uname -srp [app1] FreeBSD 6.0-RELEASE amd64 [app2] FreeBSD 6.1-STABLE amd64 [app3] FreeBSD 6.1-STABLE amd64 [app4] FreeBSD 6.1-STABLE amd64 [app5] FreeBSD 6.1-STABLE amd64 cap> on web1,web2 uname -srp [web1] FreeBSD 6.1-STABLE amd64 [web2] FreeBSD 6.1-STABLE amd64 cap> |
You can use with role
to focus the command to all machines answering to the named role. Use a comma-delimited list to execute on machines in any of a list of roles.
If that is too general, you can get as specific as you wish using on host
. This lets you execute only on the named host, or hosts. In fact, you can name hosts this way that aren’t even defined in your deploy.rb—it’ll establish connections on the fly to any machines it doesn’t recognize.
Lastly, you can use both with
and on
to set the scope for subsequent commands. Just leave the command off:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
cap> with app scoping with app cap> uname -srp [app1] FreeBSD 6.0-RELEASE amd64 [app2] FreeBSD 6.1-STABLE amd64 [app4] FreeBSD 6.1-STABLE amd64 [app3] FreeBSD 6.1-STABLE amd64 [app5] FreeBSD 6.1-STABLE amd64 cap> hostname -s [app1] 82095-app1 [app4] 82098-app4 [app5] 82099-app5 [app2] 82096-app2 [app3] 82097-app3 cap> with all scoping with all cap> hostname -s [db2] 82101-db2 [web3] 82094-web3 [app1] 82095-app1 [app2] 82096-app2 [db1] 82100-db1 [file1] 82103-file1 [app3] 82097-app3 [app4] 82098-app4 [app5] 82099-app5 [web1] 82092-web1 [web2] 82093-web2 [db3] 82102-db3 cap> |
(Note the use of the special all
keyword, to return the scope to all roles and machines.)
So, armed with this new tool, I was able to keep an eye on the load of each machine in the cluster using the uptime
task. When I noticed something anomalous, I focused on the box in question and looked for rogue processes:
1 2 3 4 5 6 |
cap> on app3 ps waux | head -n 5 ... cap> on app3 sudo kill 12345 Password: ... cap> |
It worked great. It’s definitely not a substitute for a real SSH shell, but it’s perfect for quick-and-dirty tasks that require you to switch between hosts frequently.
Which begs the question: why isn’t it a substitute for a real SSH shell?
Well, firstly, cap shell is stateless. Each command is executed in a new shell on the remote host. This means that commands like cd
and export
are pretty useless, since they won’t stick.
Secondly, cap shell isn’t intended to deal with interactive commands. You can’t, for instance, run IRB on multiple hosts simultaneously using cap shell. It does manage to deal with commands like tail -f
, but that’s about the limit.
Thirdly, if you thought rm -rf /
was dangerous when connected to a single host, imagine the damage you could do with cap shell! This is probably one of the biggest reasons it is still experimental. Until I can find a way to make it less likely to accidentally wipe an entire cluster with a single command, you ought to go into using this with caution.
Still, even with all those caveats, cap shell has become an irreplaceable tool in my toolbox. I’d love to hear from you with ideas for how to make it better, and safer.