A while back, I wrote a long article about server design. To this day, it seems to get a steady stream of hits, often showing up in log summaries right behind my platypus pictures and the Lord of the Rings personality test (don’t ask). Now I’d like to dive a little deeper and talk about actual coding. Here are some survival tips that I’ve picked up here and there. They’re not especially innovative or anything, but you’d be amazed how many otherwise-smart people cause themselves unnecessary pain by forgetting them. There’s also a little bit of an emphasis on writing code that doesn’t just work once, but keeps working for the long term, even after some inferior programmer has messed with it and then it’s out in the field where you’re forced to debug without the full suite of programming tools available.

Log Properly

One of the most critical debugging tools for code outside of the lab is a decent logging facility. Three particular properties are important:

  • It must have a minimal impact on performance, so it doesn’t hide timing-related problems.
  • It must be shared across all modules in the system, or at least on one node if your system happens to be distributed. This implies both thread safety and consistent timestamps; events must appear in the log in exactly the same order that they actually occurred.
  • It must allow run-time filtering of what gets logged. Debug on/off is not sufficient, and neither are debug levels. To avoid the twin evils of insufficient information and information overload, the person looking at a problem must be able to fine-tune the logging to at least the module level. This can be achieved by maintaining a simple bitmask of what’s being logged, one bit per module or logging point within a module.

Some systems provide pretty good logging facilities already; believe it or not, AIX is particularly good in this regard. Use what’s there if it has the necessary features. If not, it’s pretty easy to write one yourself – even for kernel use. Circular buffers are good, but make sure they’re large enough to capture a decent amount of activity.

Don’t Abuse the Stack

The most obvious reason for not abusing the stack is because your code might run in stack-limited environments, such as an OS kernel or embedded system. However, I do realize that some of the people reading this might be stuck writing wimpy user-level general-purpose-OS code (that’s a joke, son) so that’s not the real reason I advise against keeping lots of state on the stack. The real reason is that keeping too much state on the stack is just harder to debug. Crawling the stack can be error-prone even for a debugger (if you happen to have one handy when that nasty bug hits), and you can still get stuck trying to piece together what happened from state splattered across half a dozen stack frames. Yech. Most state belongs in standalone objects or structures. If you’re handling some kind of request, put all state for that request in a neat little bundle, not in arguments everywhere. Every module or function that’s involved in processing the request is then guaranteed to know everything they need to know, and you won’t have to go back time after time to add a new parameter through three levels of function calls so the one at the end can make a correct decision. This approach also makes it easier to look at dumps, either the raw-memory kind or the formatted dump-routine kind, and sooner or later that’s all you’ll have to look at while a customer – or your boss – is breathing down your neck.

Explicit state can be overdone. It’s often associated with a state-machine approach, which has a great deal to recommend it, but it can be a real pain to write loops without using local variables as indices (for example). There’s a definite judgement call involved, but it’s usually best to err on the side of putting/passing state in structures instead of using arguments and local variables everywhere. It’s usually more efficient, too.

Exercisers and Shims

We all know we’re supposed to keep modules separate, right? Right? Good. What’s less universally understood is that writing the code for the module itself is only half the job of producing a quality module that you or your other team members can use with confidence. If you think of a module as having an “upper” interface through which others call it, plus one or more “lower” interfaces through which it calls others, the other half of the job is writing an exerciser for the top half and shims for the bottom half to simulate other modules. I usually write the exerciser as a Python extension module, with Python objects corresponding to objects within the module, so I have the full benefit of Python’s data and control structures when I’m writing unit-test scripts or poking at things manually. Yes, that applies even to code that’s supposed to run in the kernel; it ultimately takes less time to write the exerciser than would otherwise be spent in that compile-debug-reboot cycle familiar to us all. In this particular case the shims have to include somewhat-accurate equivalents for the kernel functions that don’t exist at user level, and that can be tedious, but there’s no such thing as a free lunch. It helps you appreciate and understand your dependencies, at any rate, and that’s often helpful.

If you structure everything right, so that all the proper exercisers and shims exist for your modules, you should be able to test any contiguous subset of modules in your system without having to run the rest. This makes putting everything together – and actually having it work – a breeze compared to the common approach of testing everything by itself and then throwing the whole lot in one big blender. There’s also nothing wrong with having exercisers talk to shims, e.g. for error injection. The net result is a faster debug cycle, more complete test coverage, and less pain when other people find bugs in your code.