Home > Error Management, Software Engineering > Error localization: from definition to discipline (why it really matters)

Error localization: from definition to discipline (why it really matters)

PerfectAnalytics Software Inc. sells PerfectReporter, a software product which enjoys a market leading position.  PerfectReporter has many customers.   These customers love the product.

What could be better than a software product which delights the customer and exceeds their expectations?

Answer: how the software product handles an unexpected error during the most critical moment.   Years of accumulated customer good will can instantly evaporate when the customer runs a report (3 hours before a deadline) and witnesses an error which should have never occurred in the first place.

The names are of course fictional.  The situation – although simplified for the purposes of this blog – is very real however.  Let’s get started.

IndustryInsiders licensed PerfectReporter software to produce very sophisticated analytic reports.    PerfectReporter is perfectly capable of …

– Reading multiple data sources
– Executing data aggregation and summarization tasks
– Enabling users to create and execute both standard and user-customized reports

Jane runs several reports every morning at 9:00am.   She uses metrics produced by these reports, performs additional analysis, and submits her own analytic brief by 12 noon.

For the first time, Jane suspects that something is very wrong.  The reports seemed OK.  Yet the data does not make any sense.

Jane calls PerfectAnalytics Software support line and opens a ticket.  She needs the problem to be addressed in the next 3 hours in order for the financial institution to make decisions about million-dollar investment transactions.

6 days later, the problem is solved.  Highlights (or lowlights by any other name):

– Multiple calls between the customer and PerfectAnalytics Software support organization
– Escalation by customer’s senior executive, “get your engineers onsite and do not leave until the problem is solved”
– 2 engineers (actual developers, not support engineers) working at IndustryInsiders for 6 days
– 4 engineers at PerfectAnalytics working behind the scenes to determine the root cause
– 1 day (literally 24 hours)  to build and package an emergency release
– 1 day to install the new release at IndustryInsiders

Good will loss at IndustryInsiders – unable to quantify …

So what was the root cause problem?  There were several and I’d like to spend a little time on each one.

1.  One of the data updates did not complete.

2.  Aggregation procedures did not verify if required data updates successfully completed

3.  Metrics were produced with good and missing / partially updated data

4.  Reports did not check if the execution could occur (no data – do not execute – signal alarm if applicable)

The contributing error occurred at a very low level but surfaced in a report.   There were many layers between the error and symptom.   The product architecture did nothing to facilitate error localization, or how to quickly find the root cause despite multiple layers between the error and the symptom.

That’s why every software engineering organization must treat error localization as a discipline.    Suggestions:

– It should take no more than 1 hour to find the root cause of the most complicated problem.   If it takes longer than 1 hour, the software does not have enough instrumentation and meaningful error management capabilities.   Define required error localization and instrumentation capabilities, treat them as any feature, and ensure they appear in the product release roadmap.

– The team should spend no more than 5% of development capacity on diagnostic time (or time required to find the root cause of problems).  Regardless of defect tracking / resolution system, start collecting and measuring both Diagnostic Time and Resolution Time.   If you haven’t done this before, the results may surprise you.

– During design reviews, propose and discuss likely error scenarios which have the greatest probability of cascading through the product architecture.  Every cascading junction (for example:   from data updates to data aggregation) is an opportunity to measure whether the architecture supports rapid error localization

Some of you are already asking, “How did the emergency release address the problem?  What did you change?  How did the customer respond?”

This is the best part (and I’ll share in just a moment why this happens to be the best part).

PerfectReporter Version 2.0 included the following features:

– Detailed logging about every data feed:  received (timestamp), from (source, company), number of records, etc.

– Detailed run time metrics about data feed validation:  records accepted, records rejected, reasons

– Detailed run time metrics about each aggregation task

– New validation procedures to verify if aggregation tasks were completed correctly;  if validation procedures failed, data administrators were notified via multiple channels (including text messages)

– New report execution controls;  if the validation procedures failed, affected reports were marked ‘cannot execute’ with unique error codes

– New administration portal showing status of all data feeds, aggregation tasks, reports (last executed, by whom, and errors)

Here is what Jane experienced after PerfectReporter Version 2.0 was installed:

– The report execution started and immediately ended with an error message, “Unable to execute.  Report disabled by administrator due to data validation errors.   No user action required.  Click here for details’

Jane was very upset.  She thought Version 2.0 of PerfectReporter would solve her problems.  She clicked on ‘details’ and began reading:

“Data Feed A arrived from vendor A on [date/time].   Found 1,000 records.   Accepted 0 records.  1,000 records were rejected / failed validation rules.   All data validation errors were logged.

“Data aggregation tasks will not be scheduled.

“Report automatically disabled until enabled by the data administrator.   Data administrators (Name 1, Name 2) were automatically notified.”

“No user action required.

Jane suddenly realized that Data Feed A was created and provided by an external company.   She also began to suspect that because all records from Data Feed A were rejected, something had to be wrong with the data feed – and not PerfectReport software.

In 20 minutes, one of the data administrators called Jane and told her that the data feed was indeed incorrectly packaged.   Luckily, the new version of PerfectReport had additional data validation features which allowed him to determine the root cause very quickly and contact the vendor.   The vendor apologized and retransmitted the file.  All steps – including data aggregation – executed successfully.   The report was automatically enabled and Jane received an email from PerfectReporter, “OK to run your reports”.

Everyone at IndustryInsiders was very impressed with PerfectReport Version 2.0 and especially how the product handled unexpected errors.

The root cause in this case was traced to an external data provider.  Yet the problem surfaced as an internal error in PerfectReport software.   As a result, PerfectAnalytics software had no choice but to own the problem resolution process.

Now – the best part:  the customer was very impressed how PerfectReport 2.0 was able to determine the root cause of errors that originated from external data providers and take steps to protect the customer from using bad data.   Good will – at a very low point – was instantly restored.

Closing thoughts:

– Think about error localization like any other critical design activity

– Treat error management and instrumentation capabilities like any feature;  they have to be visible in the product strategy and release roadmap

– Capture Diagnostic Time when closing defects and problems

– Start to worry when the encountering problems requiring more than a few hours to solve

– Set a goal of building software in such a way that it takes no more than 1 hour to find even the most complex problem

Your customers will love your products.

  1. April 23, 2010 at 9:49 pm

    Fantastic post, Leon! You’ve articulated clearly something many of us feel but too few implement in our products. Error localization is crucial to providing effective customer support, but it’s hard to find the hours to do a proper job achieving it.

    This was the driving force behind Gibraltar — making it so easy to instrument applications, gather the log data and analyze it to root cause — that your goal of finding the root cause of even the most complex problem in 1 hour becomes the expected norm.

    Gibraltar isn’t a silver bullet — teams still need to implement the process steps you suggest — but Gibraltar is an affordable, easily integrated tool that provides the essential infrastructure needed for your vision:

    – Aspect-oriented aspects make it trivial to instrument methods with zero coding and get details on execution rates, execution time, parameters and results including exception details

    – A web-service enables efficient, secure, automated transmission of log data to the support team

    – Compression and indexing allow huge amounts of data to be managed and analyzed

    – Analysis and visualization tools make it easy to see the big picture and drill into the details.

    – An extensibility API allows Gibraltar to be tightly integrated with the software development process and infrastructure such as automatically opening tickets in the defect tracking system as new errors are detected.

    http://www.GibraltarSoftware.com

    Thanks for doing such great job of explaining how important it is to put in place the infrastructure and processes that allow for rapid error localization and response.

  2. April 25, 2010 at 1:45 pm

    Leon,

    Thank you for such well-written post. I’m a generalist specializing in business relationships, monitoring and managing reputations online, internal/external customer service, visual storytelling, etc. Collaboration between the “Hard Skills Folks/Soft Skills Folks” is critical.

    The frustration “Jane” must have felt was real and stressful. When problems surface our customers don’t care who’s at fault, they simply want their problem fixed. I find it extremely frustrating with “professionals” waste my time by telling me who’s at fault.

    When internal challenges occur it’s critical for each member of the team to work together, keep those lines of communication open and reframe from “ego-tripping.”

    “Anytime internal customers disrespect each other they in fact help the competition.”

    Leon, I also appreciate how you used a “clean format” as you placed your closing thoughts. As you know Design does matter.

    Cheers,

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: