Hugh's ramblings: Scale and Performance

December 30, 2003

Scale and Performance

Although there's a good argument that software performance matters per se, and other arguments that (in the long view) code performance really isn't worth worrying about... recently I've been working on some speedup because it will help my applications scale. Performance is one of several gating factors (processor, memory, disk, network...) in the server-based application I've spent a lot of time with recently.

Server-based? Yes, why. A Groove Server, natch.

The Groove EIS, which is all the Groove peer-synchronization infrastructure pieces (comms, storage, crypto, awareness, dynamics, etc) with minimal UI and some quite aggressive "passivation". The current release of EIS (2.5i) has some fairly low limits on the number of active shared spaces, and several people (the EIS development team, QA, a few very demanding customers, and some folks like myself kicking in from the sidelines) have worked hard to eliminate the big static limitations for the 2.5j release. So far, we're very pleased with the results. The process has shown me a few things about scale, optimisation and performance. I'm still not really sure how to quantify things, though.

Scale

Several things prevent an application from scaling. In the case of EIS, the first barrier has been memory: we've seen EIS hit Windows' 2GB address-space limit (with around a thousand simultaneously-open Groove spaces), and it's not pretty: there just ain't no more memory. Adding RAM ($500 for a couple gig extra seems cheap enough)? Makes no difference at all. 32-bit apps simply don't do that. It's possible to tweak the OS some, but the only way to escape the 32-bit address limits would appear to be some major esoteric low-level reconstruction, or a complete move to a 64-bit environment - and I'm not expecting that to happen overnight. Meanwhile, this is a wake-up call: even my work laptop is close to being "memmed-out". (Oh no! Another few years of thunks!)

Fortunately, memory is only a problem when you use lots at once, and we've learned not to do that with our "bots".

After memory, CPU. My code (shuffling data between Groove shared space tools and an Oracle or SQLServer database) was eating processor cycles, even on a chunky Xeon box. While this wasn't an immediate showstopper, it did limit the number of shared spaces we could synchronize in a few-hour daily window (to low-thousands), and that in turn made it difficult to schedule various different things to happen at the right time. The integration code in this case is JavaScript, but I still scraped somewhere above a fivefold performance gain in the test lab, and nailed at least one O(N^2) problem. (Which was to be the topic of this entry here. The gory details are interesting, I promise, but they'll wait).

The kicker was to put the new code into production, and see... zip. Nada. Approximately zero (well, maybe 25%). Turns out, the lab environment has a gigabit network to a very-lightly-loaded and massively overspecified SQLServer. The production network to the production database is a little different, and all the code in the world won't make it much faster.

Is it worth it?

The apparently-trivial performance gain does mean a very significant gain in scalability. After all, the application was CPU-bound; now it's externality-bound, and some simple expedients (careful indexing, for example) can make a big difference on the database side.

Since we need several distinct servers for this customer's environment, there's also a case for considering a virtual machine architecture. VMWare ESX for example, which I've also been tinkering with: fascinating. That network-bound app? Just run two or three virtual servers on the one piece of hardware. The 32-bit address space limit on a box with slots for 16GB? Run multiple machines on one.

Quantifying

Quantifying (setting guidelines for scalability) suddenly got a lot harder as a result of all this work. Previously, we could safely say: you'll have problems running more than 1500 shared spaces on a single device. Now, it's a multivariate problem. How much workload? What sorts of activity in those spaces? Are the users on the same LAN? What external systems are you talking to? What's the lifetime of your spaces?

Thse are interesting questions, though.

Posted by hpyle at 11:52 PM