Archive

Archive for the ‘Error Management’ Category

Can a software engineer candidate build resilient software? How to find out …

January 23, 2012 2 comments

Hiring a stellar software engineer is simply a very difficult journey.   Stellar software engineers show a rare combination of deep problem solving skills, mastery over the technical domain, passion for the customer problem, humble yet determined respect for releases and dates, and (my favorite) persistence – while being able to work well with other team members who have the same qualities.

What if being stellar is not enough?  For software companies building enterprise, mission-critical products capable of operating in a global, 24/7 environment, learning if a candidate can build resilient software is often another critical discovery goal during the recruiting process.

The definition of resilient software can be indirectly highlighted by an example everyone – and I mean everyone – would rather avoid.

Imagine being a pilot guiding a single aisle, 300-passenger aircraft on final approach.   300 passengers cannot wait to get home.   At an altitude of 4,000 feet, both MFDs – or multi function displays – suddenly show a message:  Unhandled Exception.

What’s the problem?  The software controlling both MFDs is not resilient enough during the most critical operation.

When hiring senior and principal software engineers, I always try to learn to what extent the candidate respects the goal of resiliency and believes in building resilient software.   In my experience, there is no such thing as “just enough resiliency”.

So how can we find out if the candidate can develop resilient software?

I suggest to structure a simple problem which can be asked during the interview.    It needs to be simple in order to get the most of the first interview.

One of my favorite interview question is about 2 threads:

– Thread 1 writes a file
– Thread 2 reads the same file and creates a copy

Simple.

“How can you accelerate this process?”   Most candidate will identify the need for Thread 1 to write data in blocks while notifying Thread 2 to process block N-1 while writing the next block.

“How can you still accelerate this process?”   Fewer candidates will identify the need to create an in-memory structure – created by Thread 1 – which Thread 2 can use instead of reading the file.

“How can you squeeze more performance?”   Very few candidate can identify more performance improvement options at this point.   Yet there are more options.

Now we can start asking very probing questions intended to learn if the candidate can build resilient software.

“How can you avoid excessive memory consumption in the above scenario?”   How could you monitor the memory consumption?   What if the memory was very limited?    How would you adjust your engineering approach?

“Let’s pull the power plug.  The server is down.   How can the above scenario continue running after the server restarts?”

“Imagine this scenario running on a remote server, 4,000 miles away.   Can you build a quick monitoring / alert management component?  What does it look like?”

Simple problem.  Yet very useful to determine if the candidate can build software which goes through great pains to avoid Unhandled Exceptions at the most inconvenient moment.

Why deployment of software is just like first impressions (very important)

About four weeks ago, the team was busy working on Release 6.5 (no need for actual product names …).

Key metrics about Release 6.5:

– 200: critical defects addressed;  these were eagerly awaited by customers
– 10:  components refactored, which in turned helped achieve …
– 70:  percent of critical functionality covered by automated tests
– 5:  late nights addressing “fixed one defect, created several new ones” symptoms
– 1:  weekend spent polishing release candidate into gold image

The team was very tired yet happy.  Quality assurance engineers ran all certification tests with no errors.

It was finally time to send the new product release to customers.

This particular product is installed by the customer in their own environment.    Within 2 days, the customer support team began to receive telephone calls.  Within 4 days, every customer that received and tried to install the new product release had some problems, preventing the product from being used.

Actual customer complaints:

– “Installation failed.  Unable to determine the root cause”
– “Tried to start the old product release.  No longer starts”
– “Finally got the old release to work.   Lost 3 days – very unhappy”

But – the customer complaint which I thought carried the most valuable message was …

“If you can’t get the installation procedure to work correctly, how can I have faith that you fixed 200 defects in this release?”

That’s why customer first experience – or first impression – with a software product starts with the installation.

Even before the customer has a chance to appreciate how the new release addressed 200 defects, the installation has to work.  Yet very often, installation procedures do not receive the same attention as the actual software being installed.

Release 6.5.1 was quickly delivered with significantly better installation instructions as well as installation software:

– Ability to verify operating environment:  OS releases, required frameworks, directory structures.  Missing components were identified

– Ability to verify the installation of the prior software release

– Ability to backup the prior software release and create a restore package

– Ability to restart installation process in case of a failure

– Ability to configure all parameters using a configuration editor

– Ability to detect if the customer changed critical files, including databases

– Ability to verify the new installation

Software installation is indeed like a first impression.  Do not underestimate how important it is for the customer to expect and experience a flawless installation process.

Error localization: from definition to discipline (why it really matters)

April 23, 2010 2 comments

PerfectAnalytics Software Inc. sells PerfectReporter, a software product which enjoys a market leading position.  PerfectReporter has many customers.   These customers love the product.

What could be better than a software product which delights the customer and exceeds their expectations?

Answer: how the software product handles an unexpected error during the most critical moment.   Years of accumulated customer good will can instantly evaporate when the customer runs a report (3 hours before a deadline) and witnesses an error which should have never occurred in the first place.

The names are of course fictional.  The situation – although simplified for the purposes of this blog – is very real however.  Let’s get started.

IndustryInsiders licensed PerfectReporter software to produce very sophisticated analytic reports.    PerfectReporter is perfectly capable of …

– Reading multiple data sources
– Executing data aggregation and summarization tasks
– Enabling users to create and execute both standard and user-customized reports

Jane runs several reports every morning at 9:00am.   She uses metrics produced by these reports, performs additional analysis, and submits her own analytic brief by 12 noon.

For the first time, Jane suspects that something is very wrong.  The reports seemed OK.  Yet the data does not make any sense.

Jane calls PerfectAnalytics Software support line and opens a ticket.  She needs the problem to be addressed in the next 3 hours in order for the financial institution to make decisions about million-dollar investment transactions.

6 days later, the problem is solved.  Highlights (or lowlights by any other name):

– Multiple calls between the customer and PerfectAnalytics Software support organization
– Escalation by customer’s senior executive, “get your engineers onsite and do not leave until the problem is solved”
– 2 engineers (actual developers, not support engineers) working at IndustryInsiders for 6 days
– 4 engineers at PerfectAnalytics working behind the scenes to determine the root cause
– 1 day (literally 24 hours)  to build and package an emergency release
– 1 day to install the new release at IndustryInsiders

Good will loss at IndustryInsiders – unable to quantify …

So what was the root cause problem?  There were several and I’d like to spend a little time on each one.

1.  One of the data updates did not complete.

2.  Aggregation procedures did not verify if required data updates successfully completed

3.  Metrics were produced with good and missing / partially updated data

4.  Reports did not check if the execution could occur (no data – do not execute – signal alarm if applicable)

The contributing error occurred at a very low level but surfaced in a report.   There were many layers between the error and symptom.   The product architecture did nothing to facilitate error localization, or how to quickly find the root cause despite multiple layers between the error and the symptom.

That’s why every software engineering organization must treat error localization as a discipline.    Suggestions:

– It should take no more than 1 hour to find the root cause of the most complicated problem.   If it takes longer than 1 hour, the software does not have enough instrumentation and meaningful error management capabilities.   Define required error localization and instrumentation capabilities, treat them as any feature, and ensure they appear in the product release roadmap.

– The team should spend no more than 5% of development capacity on diagnostic time (or time required to find the root cause of problems).  Regardless of defect tracking / resolution system, start collecting and measuring both Diagnostic Time and Resolution Time.   If you haven’t done this before, the results may surprise you.

– During design reviews, propose and discuss likely error scenarios which have the greatest probability of cascading through the product architecture.  Every cascading junction (for example:   from data updates to data aggregation) is an opportunity to measure whether the architecture supports rapid error localization

Some of you are already asking, “How did the emergency release address the problem?  What did you change?  How did the customer respond?”

This is the best part (and I’ll share in just a moment why this happens to be the best part).

PerfectReporter Version 2.0 included the following features:

– Detailed logging about every data feed:  received (timestamp), from (source, company), number of records, etc.

– Detailed run time metrics about data feed validation:  records accepted, records rejected, reasons

– Detailed run time metrics about each aggregation task

– New validation procedures to verify if aggregation tasks were completed correctly;  if validation procedures failed, data administrators were notified via multiple channels (including text messages)

– New report execution controls;  if the validation procedures failed, affected reports were marked ‘cannot execute’ with unique error codes

– New administration portal showing status of all data feeds, aggregation tasks, reports (last executed, by whom, and errors)

Here is what Jane experienced after PerfectReporter Version 2.0 was installed:

– The report execution started and immediately ended with an error message, “Unable to execute.  Report disabled by administrator due to data validation errors.   No user action required.  Click here for details’

Jane was very upset.  She thought Version 2.0 of PerfectReporter would solve her problems.  She clicked on ‘details’ and began reading:

“Data Feed A arrived from vendor A on [date/time].   Found 1,000 records.   Accepted 0 records.  1,000 records were rejected / failed validation rules.   All data validation errors were logged.

“Data aggregation tasks will not be scheduled.

“Report automatically disabled until enabled by the data administrator.   Data administrators (Name 1, Name 2) were automatically notified.”

“No user action required.

Jane suddenly realized that Data Feed A was created and provided by an external company.   She also began to suspect that because all records from Data Feed A were rejected, something had to be wrong with the data feed – and not PerfectReport software.

In 20 minutes, one of the data administrators called Jane and told her that the data feed was indeed incorrectly packaged.   Luckily, the new version of PerfectReport had additional data validation features which allowed him to determine the root cause very quickly and contact the vendor.   The vendor apologized and retransmitted the file.  All steps – including data aggregation – executed successfully.   The report was automatically enabled and Jane received an email from PerfectReporter, “OK to run your reports”.

Everyone at IndustryInsiders was very impressed with PerfectReport Version 2.0 and especially how the product handled unexpected errors.

The root cause in this case was traced to an external data provider.  Yet the problem surfaced as an internal error in PerfectReport software.   As a result, PerfectAnalytics software had no choice but to own the problem resolution process.

Now – the best part:  the customer was very impressed how PerfectReport 2.0 was able to determine the root cause of errors that originated from external data providers and take steps to protect the customer from using bad data.   Good will – at a very low point – was instantly restored.

Closing thoughts:

– Think about error localization like any other critical design activity

– Treat error management and instrumentation capabilities like any feature;  they have to be visible in the product strategy and release roadmap

– Capture Diagnostic Time when closing defects and problems

– Start to worry when the encountering problems requiring more than a few hours to solve

– Set a goal of building software in such a way that it takes no more than 1 hour to find even the most complex problem

Your customers will love your products.

Plan for negative testing (or your customers will make the plan for you)

February 7, 2010 3 comments

How much negative testing does your testing plan include?  Good question.

Negative testing is a broad topic with plenty of coverage (so this is not a tutorial).  I will cover the reasons why negative testing is important, impact of insufficient negative testing, and several practical recommendations.

Why negative testing is important

Negative testing – or actively seeking what does not work – allows all of us to witness how our software handles small or potentially serious problems without the customer seeing these problems or – worse – actually experiencing these problems.

Negative testing is much more than determining if the user entered a blank password or a password not meeting minimum security standards, i.e. length, character mix, and previous passwords used.

Since the topic of negative testing has been covered in detail by others, I will limit the discussion to 2 very important targets of negative testing (see Recommendations):  state transitions and error handling.

Impact of insufficient negative testing

The biggest impact:  customer’s confidence in your software.  “If this simple functionality fails, what else may broken”.

Some of the most difficult problems (that negative testing should include) are state transition failures.  Imagine a new customer placing an order and then canceling it.  The software fails to recognize order cancellation request, still accepts the order, ships a large item to the wrong address (because new customer registration process fails to detect a duplicate name), and charges customer’s credit card.   In addition to software problems, there are now real business problems:  shipment that should have never occurred, credit card charges that need to be reversed, scheduling pickup / return of a large item (could be very expensive) and incorrect customer information.

Recommendations

First – always include individuals responsible for creating and executing test scenarios as early as possible, especially during design activities.  I suggest to identify applicable negative testing scenarios during the normal course of design discussions.   Even the most elegant design will no longer seem elegant when a critical problem is discovered by a customer – and becomes critical enough to be solved yesterday.   This customer will be very disappointed by the delays caused by complex testing procedures.

Rule 1:  think of negative testing an another, essential element of the design process.  Any design improved by testing considerations will save a lot of time later.

It’s important to note that planning negative tests usually requires more experienced testing engineers, especially if the target of negative testing is a series of dependent steps.  For example, a batch interface between 2 systems can be seen as a collection of distinct functional components:

– Schedule / initiate data extraction from source system
– Process data extraction
– Apply transformation rules / create a new entity
– Stage
– Transmit
– Stage (again)
– Schedule / initiate update of the target system
– Verify & update logs

Every step of the above process will require some thought how to properly structure a negative testing scenario.

What happens when an empty file is generated?  Does it mean that no transactions were found?  Or – empty file means something unexpected took place?

What happens when the file is 8 times larger than previously anticipated?  Will the update of the target system take an extra 8 hours?  Is this acceptable?

Ensuring that errors receive proper attention (detection, logging) is also a part of negative testing.  Every error is an opportunity to show to the customer how much we care if an error takes place.   There is nothing more disappointing than a misleading error message.  But – customers will appreciate accurate and descriptive error messages, even if the error occurred.

Rule 2:  prioritize and plan specific negative testing scenarios based on the functional path of what’s being tested.  Do not forget to include error management and error messages in negative testing scenarios.

Finally, always include additional time in design and testing activities for negative testing.   I suggest to define specific negative user stores or use cases and include them in the project plan of choice.  Plan, plan, and plan again … which brings us to the final Rule 3.

Rule 3:  plan for negative testing or your customers will plan it for you.  However in the latter instance you will no longer control the schedule.

Two perfect software engineering candidates: how do you choose?

November 18, 2009 Leave a comment

After a significant investment of time in the recruiting process, it’s a wonderful problem to have two perfect software engineering candidates.

Both candidates have deep experience in required technologies and demonstrated passion about developing great software products.   Plus – the software engineering team liked both during the interview process.

How do you choose one of them, even if a background thought continues to remind you that perhaps hiring both may be a good idea?

I learned over the years that one’s willingness to find a way to delight the customer in any situation – even if the product generates an error – is one of those defining elements that can make one perfect candidate seem even more perfect.

It’s a lot easier to delight the customer by delivering required functionality.   All is well when the software works as expected.

However, what happens when the customer begins to experience unexpected errors, confusing error messages, and lack of logging or diagnostic features?  Customer’s confidence in the software can be quickly eroded.

The truth:  the opportunity to delight the customer when the software does not work as expected is greater than when the product works as expected.

Although no one likes errors and unexpected functional behavior, the customer’s confidence in the software will actually increase when the software handles errors with grace and precision, while providing appropriate level of feedback to the customer.

So – how did I choose between two perfect candidates?

I asked both to consider the following scenario:

– “You are starting next Monday”

– “The product you just inherited does not contain a single error message or a message of any kind”

– “Think about an approach to begin introducing messages in the product – informational, errors, exceptions, or whatever you think may be required”

– “Let’s discuss your ideas in 10 minutes”

The first candidate struggled with the scenario.   “J” continued to look for a specific technology approach and – when faced with uncertainty – went back to a comfort zone of knowing how to use a well-known logging framework.

The second candidate got it right. “G” asked me a lot of questions about product architecture, user experience, critical functional paths, known defects, priority of defects, specific customers experiencing defects (who was more vocal).  “G” even asked which customers were at risk of not renewing the software maintenance contract. Then, “G” wanted to prioritize critical functional paths and correlate high priority defects.   “G” talked about order management process and the majority of defects that occurred after the order was submitted.  “G” also wanted to learn if certain defects were clustered in certain sections of code.   ‘In addition, G” thought it would be a good idea to address these defects with “one surgery” and localize changes in one or few related components.

It became very clear who turned out to be a perfect candidate.

 

 

 

Categories: Error Management, Hiring

Why runtime awareness in any code is important

November 1, 2009 2 comments

This is a classic scenario that repeats itself over and over again.

The team spent a lot of time testing a new and very important software release.   Everyone is tired but happy.  It should be a good release.   But – shortly after the long-awaited  release has been installed by  THE customer (THE customer waited for this release 2 months), one of the processes performing a critical function does not work as expected.   Worse – this process performs critical transaction integrity functions and transaction verification logs do not contain valid data.

What happened?  Exhaustive regression tests executed perfectly.  Yet, something in the customer environment caused the software to work incorrectly.

The problem was eventually traced to a missing configuration file.  When the process started in customer’s own environment, the code looked for a specific configuration file and could not find it (even more troublesome – the installation procedure failed and no one noticed).   Then – the code proceeded to read default configuration parameters from a file so old that some of the parameters caused the process to cause multiple errors much later during the execution.

First – the solution (and lessons learned at the end):

– This process was immediately changed to report critical information when it started, while it was running, and when it ended.

– When the process started, it reported – among many critical items – location and contents of all configuration parameters read, accepted, rejected, defaults used instead, or ignored (and action taken afterwards)

– In addition, the process initialization was gracefully terminated – with detailed error messages – if a critical parameter was incorrect.  This is a good practice to ensure the customer continues to maintain continued awareness of all elements essential to the operation of any mission critical software product.

– While the process was running, any dynamic parameter changes were logged and reported

Second – what was the cost of not doing this right the first time?

– 3 days of exhaustive research by an experienced software engineer who had to jump on a plane and visit the customer.  No one else could help other than the code owner.

– 6 person-days of 2 other software engineers working around the clock to support their colleague at the customer location

– Delayed customer reference to a prospective client

Third – lessons learned:

– Make runtime awareness one of the design objectives

– Challenge the team during design reviews (“how would you know if these conditions may be present …”)

– Reject the release if insufficient runtime awareness has been engineered in the code

Runtime awareness in software products is not a new concept .  Yet, it’s importance has never been higher.

How does a good error message look like?

October 30, 2009 1 comment

It’s not easy to write good error messages.  One of my mentors many years ago told me, “it’s easy to determine if an error message is good.  One week later, wake up at 2:30 in the morning and try to understand it.  If you can easily understand what the message indicates, then the error message passes the test of being a good one”.

It’s true.

Best-of-breed error messages share these attributes:

– Unique component or caller identifier + a unique number

– Severity level:  I = informational, E = error, W = warning,

– Error description (which passes the ‘2:30 in the morning sanity test’)

– Action taken by the system as a result of the error

– Action recommended to the user – if applicable

– Detailed diagnostic information (if enabled) to quickly determine the root cause and reduce diagnostic time to the bare minimum

This is a real error message which has been subject of a recent design discussion.

Before:

– “Error:  transmission failed”

After:

– TRX-SEND-0012E Transmission failed.  File=<name>, Source=<location>,Destination=<new location>.   10 MB of 50MB transmitted.  Transmission will restart in 3 minutes.  No action required from the user.

This message clearly passes the ‘2:30am in the morning sanity test’.   No one in the software engineering team will get a call from the client or technical support team.