Zen and the art of scalability

Lately I’ve been giving  lot of thought to scalability.

You see, a good portion of my work in the last few years has been scaling things. Junior admins configure and build things. Senior admins and architects are past the point of “getting $thing to work” and are into the realm of “getting $thing to work well under a massive amount of traffic”. In this arena, an ounce of design beats a pound of infrastructure.

Recently, I migrated a rather large website for an online web magazine. It had an ancient CMS and a blog platform, all of which was built years ago, extended by custom bits of code everywhere, and highly customized. All of which ran on a single FreeBSD box.

My duty was to take a sync of the single box and change it from a single piece of hardware to a database backend and three frontends, all of which would be VMs on VM hosts in the new dataacenter. This was easier said than done.

The software was never designed to be clustered. Multiple dirs ended up being symlinked to NFS mounts. Code was rewritten to be able to use a DB that was not “localhost”. The blog software lost posts. The CMS took forever to update its caches, relying on locally triggered events tied to the act of editing. Finally, weeks of issues, rsyncs, new services, hacking, and tears later, we had a scalable environment.

The big issue with what happened is this: it was almost all avoidable. This project should not have taken what it did, but the people designing the software did not give thought to scalability, nor did the people implementing it.

In light of this, I thought I’d talk about a few key points I frequently find as choke points for scaling software across multiple servers.

1. Reliance on a local filesystem. Now, every program needs files. One useful enough to require scaling usually needs a lot of files. Images, libraries, executables… Those are not what I am talking about. I am talking about lockfiles, session state information, and other things the application as a whole needs to function. Make these files NFS safe(eg not using flock() calls) and store them aside from the aforementioned fairly static files. Better yet, make the directory they live in configurable in the config files.

2. Heavy database use with a single monolithic database. Free hint- your app should have provisions for multiple DBs, possibly spanning multiple hosts. Offloading, say, session state data or authentication to another DB can buy some valuable breathing room for the app. Again, this should be in the config file.

3. The Monolithic Application. As application size and complexity grows, it is VERY useful to have parts of it independent of others. This enables you to have specialized servers for those particular parts, and it enables you to take out one part and replace it easily. It also makes testing and troubleshooting easier. Modular design is one the (many) big reasons Postfix and qmail beat the venerable Sendmail in every regard.

4. Dozens of confusing config files. You should have, at most, a small handful of config files. Ideally, each independent part of the app(see 3.) should have its own config file. There are reasons to have a lot of files (example: Apache2 /etc/apache2/sites-available) but for the most part one will do nicely.

5. Documentation and code readability. When I hit a wall in your application, I’d like to know about it. This means useful, informative errors, well-commented code, and readable, usable documentation. Yes, I am aware that documentation sucks to write. Just remember that it sucks more to not have any when you need it, and will make your app needlessly arcane and create a perception that it just plain sucks because admins like me cannot find the $suck=true config.

6. Non-use of outside services. Syslog, for example, exists for a reason. So does SMTP. If your app needs to log, use syslog properly. If it needs to send mail, use SMTP and not the sendmail binary. These things will make my life much easier, and most probably yours too.

7. Ignorant use of filesystems. Just because your environment initially has 100 users does not mean that /var/lib/app/home/user/ is an appropriate directory. /var/lib/app/home/u/s/user/, for example, is not much harder to code or use and won’t hit those magic dir index issues that pop up on most filesystems. Don’t count on features being present unless you explicitly specify they are needed- this goes for simple things like atime up to more advanced stuff like ext3 extended attributes.

These are a few of the common pitfalls I see apps fall into, and sadly some never recover from it. If you can avoid these, you’ll be well on your way to having a reasonably scalable system and you’ll be saving yourself (and the admins) from nightmares down the line. As my pal Larry says, it’s always easier to instantiate than mutate.

Leave a Reply