Wait Until it Hurts
There’s a story behind the recent release of Net::SSH 1.0.7 that I want to share, and which ties in nicely with the indoctrination that I’ve been immersed in (and finding invaluable) at work.
For some time there has been a bug in Net::SSH that caused requests to die sporadically with a “corrupt mac detected” error. People have reported this, sending me bug reports and stack traces, but I was never able to duplicate it. Because I was never able to duplicate it, and because I wasn’t being flooded with reports of it, I felt no pain. Sure, I empathized with the people reporting the bug, in a “gee, I’m really sorry about that” kind of way, but I had no motivation to dig in and find the problem.
Last week I began playing with some fun SwitchTower tasks. For instance, I wanted a way to tail on the rails logs of all of our applications at once, so I could count the number of requests per second that were being handled. Our applications are distributed across four application servers, so this seemed like a great opportunity for SwitchTower. Here’s what I came up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
desc "Show application statistics in real-time" task :watch_status, :roles => :app do count = 0 last = Time.now run "tail -f /first/rails.log /second/rails.log" do |ch, stream, out| puts "#{ch[:host]}: #{out}" and break if stream == :err count += 1 if out =~ / (\w+) \w+\[(\d+)\]: Completed in/ if Time.now - last >= 1 puts "%2d rps" % count count = 0 last = Time.now end end end |
(It’s rough, but it really works—just replace the rails.log paths with the paths to your applications. Feel free to polish it off and make it more useful.)
While running this, I finally saw my first, honest-to-goodness, in-the-flesh “corrupted mac detected” error, and it happened reproducibly. Suddenly, it got personal. I felt the pain. I really wanted this feature, but it would never be practical as long as it could not be relied upon to work for extended periods.
Armed with this new motivation, as well as a way to reproduce the problem, I set out to ease the pain. It turned out to be a problem that only occurred in a multithreaded environment. The scheduler was periodically interrupting the socket read of the mac value. All that was needed was to keep retrying the read until the full length of the data was obtained. And now it’s fixed!
The lesson? Wait until you’re feeling pain until you fix something. We all have lots of things vying for our time, and no one likes it when they have to make something a priority. Wait until something hurts, whether because you’re being affected personally, or because you’re being flooded by support requests. It always feels nice to make something stop hurting.
Reader Comments
28 Jan 2006
28 Jan 2006
28 Jan 2006
29 Jan 2006
29 Jan 2006
30 Jan 2006
10 Feb 2006
10 Feb 2006