Archive for April, 2010

Error localization: from definition to discipline (why it really matters)

April 23, 2010 2 comments

PerfectAnalytics Software Inc. sells PerfectReporter, a software product which enjoys a market leading position.  PerfectReporter has many customers.   These customers love the product.

What could be better than a software product which delights the customer and exceeds their expectations?

Answer: how the software product handles an unexpected error during the most critical moment.   Years of accumulated customer good will can instantly evaporate when the customer runs a report (3 hours before a deadline) and witnesses an error which should have never occurred in the first place.

The names are of course fictional.  The situation – although simplified for the purposes of this blog – is very real however.  Let’s get started.

IndustryInsiders licensed PerfectReporter software to produce very sophisticated analytic reports.    PerfectReporter is perfectly capable of …

– Reading multiple data sources
– Executing data aggregation and summarization tasks
– Enabling users to create and execute both standard and user-customized reports

Jane runs several reports every morning at 9:00am.   She uses metrics produced by these reports, performs additional analysis, and submits her own analytic brief by 12 noon.

For the first time, Jane suspects that something is very wrong.  The reports seemed OK.  Yet the data does not make any sense.

Jane calls PerfectAnalytics Software support line and opens a ticket.  She needs the problem to be addressed in the next 3 hours in order for the financial institution to make decisions about million-dollar investment transactions.

6 days later, the problem is solved.  Highlights (or lowlights by any other name):

– Multiple calls between the customer and PerfectAnalytics Software support organization
– Escalation by customer’s senior executive, “get your engineers onsite and do not leave until the problem is solved”
– 2 engineers (actual developers, not support engineers) working at IndustryInsiders for 6 days
– 4 engineers at PerfectAnalytics working behind the scenes to determine the root cause
– 1 day (literally 24 hours)  to build and package an emergency release
– 1 day to install the new release at IndustryInsiders

Good will loss at IndustryInsiders – unable to quantify …

So what was the root cause problem?  There were several and I’d like to spend a little time on each one.

1.  One of the data updates did not complete.

2.  Aggregation procedures did not verify if required data updates successfully completed

3.  Metrics were produced with good and missing / partially updated data

4.  Reports did not check if the execution could occur (no data – do not execute – signal alarm if applicable)

The contributing error occurred at a very low level but surfaced in a report.   There were many layers between the error and symptom.   The product architecture did nothing to facilitate error localization, or how to quickly find the root cause despite multiple layers between the error and the symptom.

That’s why every software engineering organization must treat error localization as a discipline.    Suggestions:

– It should take no more than 1 hour to find the root cause of the most complicated problem.   If it takes longer than 1 hour, the software does not have enough instrumentation and meaningful error management capabilities.   Define required error localization and instrumentation capabilities, treat them as any feature, and ensure they appear in the product release roadmap.

– The team should spend no more than 5% of development capacity on diagnostic time (or time required to find the root cause of problems).  Regardless of defect tracking / resolution system, start collecting and measuring both Diagnostic Time and Resolution Time.   If you haven’t done this before, the results may surprise you.

– During design reviews, propose and discuss likely error scenarios which have the greatest probability of cascading through the product architecture.  Every cascading junction (for example:   from data updates to data aggregation) is an opportunity to measure whether the architecture supports rapid error localization

Some of you are already asking, “How did the emergency release address the problem?  What did you change?  How did the customer respond?”

This is the best part (and I’ll share in just a moment why this happens to be the best part).

PerfectReporter Version 2.0 included the following features:

– Detailed logging about every data feed:  received (timestamp), from (source, company), number of records, etc.

– Detailed run time metrics about data feed validation:  records accepted, records rejected, reasons

– Detailed run time metrics about each aggregation task

– New validation procedures to verify if aggregation tasks were completed correctly;  if validation procedures failed, data administrators were notified via multiple channels (including text messages)

– New report execution controls;  if the validation procedures failed, affected reports were marked ‘cannot execute’ with unique error codes

– New administration portal showing status of all data feeds, aggregation tasks, reports (last executed, by whom, and errors)

Here is what Jane experienced after PerfectReporter Version 2.0 was installed:

– The report execution started and immediately ended with an error message, “Unable to execute.  Report disabled by administrator due to data validation errors.   No user action required.  Click here for details’

Jane was very upset.  She thought Version 2.0 of PerfectReporter would solve her problems.  She clicked on ‘details’ and began reading:

“Data Feed A arrived from vendor A on [date/time].   Found 1,000 records.   Accepted 0 records.  1,000 records were rejected / failed validation rules.   All data validation errors were logged.

“Data aggregation tasks will not be scheduled.

“Report automatically disabled until enabled by the data administrator.   Data administrators (Name 1, Name 2) were automatically notified.”

“No user action required.

Jane suddenly realized that Data Feed A was created and provided by an external company.   She also began to suspect that because all records from Data Feed A were rejected, something had to be wrong with the data feed – and not PerfectReport software.

In 20 minutes, one of the data administrators called Jane and told her that the data feed was indeed incorrectly packaged.   Luckily, the new version of PerfectReport had additional data validation features which allowed him to determine the root cause very quickly and contact the vendor.   The vendor apologized and retransmitted the file.  All steps – including data aggregation – executed successfully.   The report was automatically enabled and Jane received an email from PerfectReporter, “OK to run your reports”.

Everyone at IndustryInsiders was very impressed with PerfectReport Version 2.0 and especially how the product handled unexpected errors.

The root cause in this case was traced to an external data provider.  Yet the problem surfaced as an internal error in PerfectReport software.   As a result, PerfectAnalytics software had no choice but to own the problem resolution process.

Now – the best part:  the customer was very impressed how PerfectReport 2.0 was able to determine the root cause of errors that originated from external data providers and take steps to protect the customer from using bad data.   Good will – at a very low point – was instantly restored.

Closing thoughts:

– Think about error localization like any other critical design activity

– Treat error management and instrumentation capabilities like any feature;  they have to be visible in the product strategy and release roadmap

– Capture Diagnostic Time when closing defects and problems

– Start to worry when the encountering problems requiring more than a few hours to solve

– Set a goal of building software in such a way that it takes no more than 1 hour to find even the most complex problem

Your customers will love your products.

To rewire, refactor, or rewrite

April 11, 2010 3 comments

What is one of the major risks – one that cannot be taken lightly – that a software company may face during a period of rapid growth?


Let’s stop for a moment and ask:  what is refactoring?

Refactoring is a conscious decision to make fundamental changes in a product architecture (take one step backwards) in order to accelerate delivery of future releases (make several steps forward).  These releases may include long awaited functionality, new competitive features, improved performance.

It’s important to note that the single most important reason which drives the decision to refactor is acceleration of future releases.    Mature products with a short list of pending features do not typically become the target of refactoring.

What are some of the symptoms that refactoring may be necessary?

– Fixing one problem leads to several new and unexpected problems

– The backlog of critical new features is growing longer

– It takes more time to add new features with each new product release

– It also take more time to test each new product release

– Competition is able to deliver new product releases faster

Refactoring is very expensive and carries significant business risks.   That’s why most refactoring efforts tend to be reactive (in fact exacerbating business risks).    Because refactoring is so disruptive and often viewed as a measure of last resort, aggressive timelines and pressure to correct what has not been dealt with (aka “technical debt”) for a long time escalates the risks exponentially.

Refactoring does not have to the measure of last resort.

Complex, enterprise software product continuously evolves, with each new release adding new functionality.

The product’s architecture and code base should also evolve with each new release.

The key to success is commitment to continuous refactoring – or better viewed as ‘rewiring’:  eliminating duplicate code, making tactical changes to improve automated testing, increasing separation of concerns.

When ‘rewiring’ no longer produces results, the two remaining options – refactor or rewrite – become the subject of many healthy debates.

Commitment to ‘rewiring’ – or dedicating a percentage of time spent on each software release to make tangible code improvements – will yield significant benefits, either delaying or reducing the scope of major refactoring efforts in the future.

At some point, every software company faces the decision to refactor.  Refactoring is inevitable.  The only question is the extent of future refactoring efforts.

Categories: Software Engineering

New superstar just joined the team: it’s ‘form, storm, norm, perform’ all over again

April 2, 2010 4 comments

It’s been awhile since I had a chance to write another blog entry in ‘advice for the new CTO’ category.

To get started – let’s ask a question …

After an incredible amount of effort was spent to recruit and attract a superstar software engineer, what do you do in the first 60 days ensure that this A+ hire will succeed?

It does take an incredible amount of effort to recruit and attract a superstar software engineer.   These are real metrics (courtesy of my recent project):

– 1 opening for a very senior software engineer with ‘failure not an option’ DNA

– 1,200 resumes

– 95 good resumes (it’s hard to believe but very true)

– 10 resumes selected for phone screens (20 person-hours invested)

– 3 candidates selected for on site interviews

– 6 hours per candidate (2 sessions over a 2 day period), 4 team members present during each team interview session

– 72 person-hours invested to interview 3 final candidates.  NOTE:  this time was not spent on the next critical release (recruiting is both essential and expensive).

Everyone felt very comfortable with Candidate A who has shown a tremendous potential to succeed.

Three weeks later Candidate A joined the team.

In many organizations, one may hear a sigh of relief, “finally – we can return to building the next product with Candidate A on board”.

The truth:  the journey to ensure that Candidate A succeeds has just started.

When someone very senior joins a product development, the team undergoes the same transformation as if the team were formed from the very beginning.

This transformation, or – Form, Storm, Norm, and only then Perform – cannot be ignored (has to be acknowledged) or the entire recruiting investment will mean very little.  Either the team will reject the candidate or the candidate will leave because the team has not found a way to accept someone just as sharp – or perhaps even sharper.

Form – has just ended.  Candidate A is now Employee A – a long awaited member of the team.

Storm – will be inevitable, with lots of clouds and thunder at times.   Employee A will be trying to prove him /herself and the competition between ideas can create many difficulties if individuals had not had some to work together.   Monitor this phase very carefully.  Meet with Employee A twice a week.  Meet with other members of the team and get their feedback.

Norm – if Employee A is about to suggest something that may not work well, go against the grain and agree with Employee A.  Then – create an opportunity for Employee A to fail but fail very graciously.   Empower the team to help Employee A and recover.   This will allow Employee A to build that very bond that he / she seeks.     It’s perfectly OK to fail, quickly learn, and move on.   The knowledge the team will be there to help – no matter what – is a powerful motivator.

Perform – will exceed expectations once Employee A successfully navigates through the first three phases.   The first 60 days will be very critical.

For those that are interested, Bruce Tuckman – an American psychologist – first proposed the model of group development (Form, Storm, Norm, Perform) in 1965.