Archive for January, 2012

Can a software engineer candidate build resilient software? How to find out …

January 23, 2012 2 comments

Hiring a stellar software engineer is simply a very difficult journey.   Stellar software engineers show a rare combination of deep problem solving skills, mastery over the technical domain, passion for the customer problem, humble yet determined respect for releases and dates, and (my favorite) persistence – while being able to work well with other team members who have the same qualities.

What if being stellar is not enough?  For software companies building enterprise, mission-critical products capable of operating in a global, 24/7 environment, learning if a candidate can build resilient software is often another critical discovery goal during the recruiting process.

The definition of resilient software can be indirectly highlighted by an example everyone – and I mean everyone – would rather avoid.

Imagine being a pilot guiding a single aisle, 300-passenger aircraft on final approach.   300 passengers cannot wait to get home.   At an altitude of 4,000 feet, both MFDs – or multi function displays – suddenly show a message:  Unhandled Exception.

What’s the problem?  The software controlling both MFDs is not resilient enough during the most critical operation.

When hiring senior and principal software engineers, I always try to learn to what extent the candidate respects the goal of resiliency and believes in building resilient software.   In my experience, there is no such thing as “just enough resiliency”.

So how can we find out if the candidate can develop resilient software?

I suggest to structure a simple problem which can be asked during the interview.    It needs to be simple in order to get the most of the first interview.

One of my favorite interview question is about 2 threads:

– Thread 1 writes a file
– Thread 2 reads the same file and creates a copy


“How can you accelerate this process?”   Most candidate will identify the need for Thread 1 to write data in blocks while notifying Thread 2 to process block N-1 while writing the next block.

“How can you still accelerate this process?”   Fewer candidates will identify the need to create an in-memory structure – created by Thread 1 – which Thread 2 can use instead of reading the file.

“How can you squeeze more performance?”   Very few candidate can identify more performance improvement options at this point.   Yet there are more options.

Now we can start asking very probing questions intended to learn if the candidate can build resilient software.

“How can you avoid excessive memory consumption in the above scenario?”   How could you monitor the memory consumption?   What if the memory was very limited?    How would you adjust your engineering approach?

“Let’s pull the power plug.  The server is down.   How can the above scenario continue running after the server restarts?”

“Imagine this scenario running on a remote server, 4,000 miles away.   Can you build a quick monitoring / alert management component?  What does it look like?”

Simple problem.  Yet very useful to determine if the candidate can build software which goes through great pains to avoid Unhandled Exceptions at the most inconvenient moment.