Robocop: Superhero, but Not a Savior

Robocop cant always save you

This image was taken by flickr user thevoicewithin. It was used under the Creative Commons license.

 

Until recently, our processing has run on one of two types of machines: normal (16 gigs of RAM) or high-mem (32 gigs). We use these machines to grab a client out of a queue, update our copy of the clients’ data from each applicable datasource, re-map the clients’ data into our schema, and upload it to our client-facing application.

The combination of normal and high-mem servers has covered our customer base as well as the majority of our trials. Still, every so often a prospect comes through the door that our high-mem boxes can’t handle. And from this was born the idea of Robocop: a new server, sporting a whopping 256 gigs of RAM, with the single-minded mission to clobber with ease even our largest customers’ sprawling data.

Having Robocop roaming the streets feels great. He lets us see how the rest of our infrastructure scales, allowing engineering to recognize front-end performance issues and allowing product folk to determine which visualizations fail for data of that size and shape. “Does the customer care about seeing all 2,000+ opportunities expected to close in the next 90 days? Heck, let’s just onboard ’em and see what they think without spending hours just to make it work!”  In other words, let’s just conquer scaling with hardware (and dollars).


A few weeks ago, we attempted to onboard a customer that even challenged Robocop – our hero had spent 36 hours processing and appeared to be deadlocked. After checking nagios and munin, the memory seemed fine – we were using just 20 gigs out of 256. The bottleneck was the CPU: it was pegged at 100%. The logs revealed that it was processing leads – it just so happened that they had half a million of them!

Oh no! Gasp!! Could it be that we had we found Robocop’s kryptonite?

A quick look at how we process leads revealed that for each lead, we checked that a field for membership in a list and appended it if it was missing. We determined we were doing this Ο(n²) times; changing the underlying data structure from a list to a set dropped that down to Ο(n)! Once that one line change was made, the data processed in 125 minutes.

So it turns out throwing money/hardware at a problem doesn’t always fix the problem. It certainly can, but more often it is a stop-gap.

The TL;DR conclusion: big hardware won’t always solve your problems, but when it doesn’t, it can still unveil what will.