Home > Error Management, Software Engineering > Can a software engineer candidate build resilient software? How to find out …

Can a software engineer candidate build resilient software? How to find out …

Hiring a stellar software engineer is simply a very difficult journey.   Stellar software engineers show a rare combination of deep problem solving skills, mastery over the technical domain, passion for the customer problem, humble yet determined respect for releases and dates, and (my favorite) persistence – while being able to work well with other team members who have the same qualities.

What if being stellar is not enough?  For software companies building enterprise, mission-critical products capable of operating in a global, 24/7 environment, learning if a candidate can build resilient software is often another critical discovery goal during the recruiting process.

The definition of resilient software can be indirectly highlighted by an example everyone – and I mean everyone – would rather avoid.

Imagine being a pilot guiding a single aisle, 300-passenger aircraft on final approach.   300 passengers cannot wait to get home.   At an altitude of 4,000 feet, both MFDs – or multi function displays – suddenly show a message:  Unhandled Exception.

What’s the problem?  The software controlling both MFDs is not resilient enough during the most critical operation.

When hiring senior and principal software engineers, I always try to learn to what extent the candidate respects the goal of resiliency and believes in building resilient software.   In my experience, there is no such thing as “just enough resiliency”.

So how can we find out if the candidate can develop resilient software?

I suggest to structure a simple problem which can be asked during the interview.    It needs to be simple in order to get the most of the first interview.

One of my favorite interview question is about 2 threads:

– Thread 1 writes a file
– Thread 2 reads the same file and creates a copy


“How can you accelerate this process?”   Most candidate will identify the need for Thread 1 to write data in blocks while notifying Thread 2 to process block N-1 while writing the next block.

“How can you still accelerate this process?”   Fewer candidates will identify the need to create an in-memory structure – created by Thread 1 – which Thread 2 can use instead of reading the file.

“How can you squeeze more performance?”   Very few candidate can identify more performance improvement options at this point.   Yet there are more options.

Now we can start asking very probing questions intended to learn if the candidate can build resilient software.

“How can you avoid excessive memory consumption in the above scenario?”   How could you monitor the memory consumption?   What if the memory was very limited?    How would you adjust your engineering approach?

“Let’s pull the power plug.  The server is down.   How can the above scenario continue running after the server restarts?”

“Imagine this scenario running on a remote server, 4,000 miles away.   Can you build a quick monitoring / alert management component?  What does it look like?”

Simple problem.  Yet very useful to determine if the candidate can build software which goes through great pains to avoid Unhandled Exceptions at the most inconvenient moment.

  1. January 25, 2012 at 5:37 am

    Excellent article Leon.

    People who have gone through the pain know where and why exactly problems occur. It is my personal experience in a big MNC that a thing as simple as not doing NULL check creating havoc, and these things tend to happen at run-time and during customer go-live looming, so pressure situations occur where young smart engineers do not want to touch code by other engineers, as they are not documented well. Failure was pinned down to one place, and corrected, it popped in another, ultimately we found around a 100 such places in the module, which was mandated for cleanup.

    ‘Resilence’ is building your code ‘Fail-Safe’….in a worst case scenario, how elastic is my code to bring down the system gracefully and let me know what exactly happened.

    To your ‘Resilence’ I would like to add ‘Reusability’ too. If you do not build your code with reusability in mind, and centralize such code, optimization tasks will be more difficult and in some cases a nightmare due to repeating the changes in multiple places and still having a chance of missing a few. And with Reusability you are building your own product framework in a way.

    I have written a few coding guidelines, albeit basic, they are not language specific and help to see right practices followed. http://codevelle.wordpress.com/category/coding-general/
    Bright engineers are expected to know these, but it is always good to find out in an interview, young engineers who start coding badly will continue doing the same when there are no peer reviews and end up having lower expectations of themselves and their juniors.
    Coding guidelines act like Traffic lights, we can break them, but we have to be prepared for the penalty.

    Your examples help in finding out the logical thinking and persistence of Senior programmers and Principal engineers, as they can provide a variety of options, and the people with more options ‘might be’ the ones who think logically, have read about it/ experienced it and willing to try new ways to attack a problem, in a way they are not pigeon-holed into believing that there is one ‘right-way’ for every situation, e.g in the threads question, we can go to working with multi-processors, as well as fail-over configurations.


  2. September 13, 2012 at 6:47 am

    Great article. love the way it expands from the first two thread scenario,

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: