The maze book for programmers!
mazesforprogrammers.com

Algorithms, circle mazes, hex grids, masking, weaving, braiding, 3D and 4D grids, spheres, and more!

DRM-Free Ebook

The Buckblog

assorted ramblings by Jamis Buck

Wait Until it Hurts

27 January 2006 — 2-minute read

There’s a story behind the recent release of Net::SSH 1.0.7 that I want to share, and which ties in nicely with the indoctrination that I’ve been immersed in (and finding invaluable) at work.

For some time there has been a bug in Net::SSH that caused requests to die sporadically with a “corrupt mac detected” error. People have reported this, sending me bug reports and stack traces, but I was never able to duplicate it. Because I was never able to duplicate it, and because I wasn’t being flooded with reports of it, I felt no pain. Sure, I empathized with the people reporting the bug, in a “gee, I’m really sorry about that” kind of way, but I had no motivation to dig in and find the problem.

Last week I began playing with some fun SwitchTower tasks. For instance, I wanted a way to tail on the rails logs of all of our applications at once, so I could count the number of requests per second that were being handled. Our applications are distributed across four application servers, so this seemed like a great opportunity for SwitchTower. Here’s what I came up with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
desc "Show application statistics in real-time"
task :watch_status, :roles => :app do
  count = 0
  last = Time.now
  run "tail -f /first/rails.log /second/rails.log" do |ch, stream, out|
    puts "#{ch[:host]}: #{out}" and break if stream == :err

    count += 1 if out =~ / (\w+) \w+\[(\d+)\]: Completed in/

    if Time.now - last >= 1
      puts "%2d rps" % count
      count = 0
      last = Time.now
    end
  end
end

(It’s rough, but it really works—just replace the rails.log paths with the paths to your applications. Feel free to polish it off and make it more useful.)

While running this, I finally saw my first, honest-to-goodness, in-the-flesh “corrupted mac detected” error, and it happened reproducibly. Suddenly, it got personal. I felt the pain. I really wanted this feature, but it would never be practical as long as it could not be relied upon to work for extended periods.

Armed with this new motivation, as well as a way to reproduce the problem, I set out to ease the pain. It turned out to be a problem that only occurred in a multithreaded environment. The scheduler was periodically interrupting the socket read of the mac value. All that was needed was to keep retrying the read until the full length of the data was obtained. And now it’s fixed!

The lesson? Wait until you’re feeling pain until you fix something. We all have lots of things vying for our time, and no one likes it when they have to make something a priority. Wait until something hurts, whether because you’re being affected personally, or because you’re being flooded by support requests. It always feels nice to make something stop hurting.

Reader Comments

I have a feeling that this sort of approach is why so many bits of RoR and Ruby in general are so wonky on Windows. I don't think anybody on the dev team uses Windows, so they don't ever really Feel The Pain. Oh well - the ibook should get here sometime next week and then I too will join the "not my problem, who cares" camp.
Must the devs wait til they feel the pain before resolving the problem? I don't know, I foresee problems with waiting til it becomes personal:-/
Note that I'm not saying you have to actually experience the problem in order to feel the pain. Pain can be felt by a super-abundance of problem tickets, too. If I had been overwhelmed by people reporting the "corrupt mac" error, I might have responded to it sooner. As it was, it was reported seldomly, and I'd never been able to duplicate it, so there was no pain being felt.
This isn't a science. There's no formula that says "pain increases asymptotically as the number of tickets _t_ approaches _n_". I can tell you this, though--three or four reports (which is about how many I got on the "corrupted mac" thing) does not result in pain.
So if you had the idle time or extra resources to fix something like that, you wouldn't? I mean I guess you pick and choose your customers, but avoiding problems that could potentially cause you even more problems along the road is disagreeable. Again, it's all about priorities.. if you have bigger problems to deal with then so be it. However, waiting until a wound becomes infected may not be the best idea.
Lance, of course it is all about priorities. I say as much in this article. "We all have lots of things vying for our time, and no one likes it when they have to make something a priority." For me, I've got a day job, a family, responsibilities at church, books I'm reading, and side projects that I want to work on. Frankly, a bug that I can't reproduce and which only a handful of people have reported isn't going to make it very high on my list of priorities. If I were to go out and make every reported bug my #1 priority, I would burn out in a hurry and would begin to _hate_ working on Net::SSH. Pain is a great rearranger of priorities, and it cuts both ways. If working on Net::SSH becomes too painful, I'll stop working on Net::SSH. Always use common sense. If something is critical, work on it. If it isn't, don't. And many times, you can use pain as an indicator of criticality. That's all I'm saying.
I agree with the concept, but kind of in the same way that I agree with the concept of socialism..the practice can only taint the principle. I moved away from PHP because I felt that certain groups were pushing pain killers more than cures. All I'm saying is that wait-until-it-hurts only works if you're disciplined enough to cure the cause of the pain rather than the symptoms.
A very good point, Jeff. Applying bandaids instead of solving the underlying problem is ultimately self-defeating.