Home > Software Engineering > Neglecting batch jobs can be very expensive

Neglecting batch jobs can be very expensive

Batch jobs – or processes which run typically at night without anyone watching – are not glamorous. Yet they perform some of the most critical tasks in the business environment.

Every data center with batch intensive processing at night deals with batch window constraints. If there are 7 hours during the night to post all customer transactions, that’s the batch window. If the sales team exceeds everyone’s expectations and the number of customer transactions quickly increases by a factor of 4, well … the batch window is still 7 hours.

Neglecting batch jobs can be a very costly proposition. But first – what are the signs that batch jobs are neglected?

– Batch jobs seem to never end and no one knows why
– Batch jobs produce 2 messages: started, ended
– The amount of time spent by the engineering team to determine root causes of batch job failures continues to increase
– Operations team warns that batch window constraint is only 20 minutes away
– Engineering team warns that redesigning stored procedures developed by engineers who are no longer with the team is too risky (there are no comments anywhere in the stored procedure code)

How much of the above sounds all too familiar?

The scenario I described above where the number of customer transactions suddenly quadrupled is real. Here is how my team learned about:

– One of the critical batch jobs started and continued to execute well past 7:00am without any meaningful messages
– It took an enormous amount of time to understand the code written by others (and it was not very well written)
– It took another heroic effort to examine database tables and realize that the data everyone thought was improperly submitted by the customer was in fact real and urgent

Root cause: the customer acquired a company a few months ago and added new transactions to the file being transmitted every night. And batch jobs could not detect nor inform the Operations team that the execution profile was very different from the execution profile from last night.

If you design, engineer, or operate mission critical batch jobs, I suggest to sit down and set very clear objectives for not only how batch jobs will be designed but how they will monitored during execution.

To get you started – a few design objectives:

– Identify technical and business drivers which directly influence batch job execution profile, i.e. “number of transactions from customer A”, “lowest number of transactions during the last month”, “highest number of transactions during the last month”, “number of unique customers found”, etc. Technical drivers can be “amount of time required to update 1,000 transactions”, etc.

– Design an approach to store this information after execution and create automated reports which show how nightly batch job runs compare. Any significant deviation will be immediately seen and become the topic of discussion, hopefully well in advance of a crisis.

– Instrument, instrument, and instrument again. Batch jobs should produce meaningful & actionable messages about their execution. There is nothing more unfortunate than an error message at 2:30am which states “unknown data encountered – execution terminated”.

The best batch job is one that never requires a phone call at 2:30am, even if it encounters errors but with crystal clear messages such as:

“Unexpected data encountered in Record 543, Customer=A,Source File=A.File”
“Record will be ignored and saved in File B for manual review”
“Processing will continue. No operator action required”

Categories: Software Engineering
  1. Leo
    November 4, 2009 at 2:31 am

    Some parts of the system are more critical than others. It’s not just about the [low] likelihood of failure, it is also about the impact of that failure. Critical systems should be well designed, period. The airline industry does not kill its passengers very often, but it does lose your luggage. Same principle should be applied to IT – be able to tell “live beings” apart from the luggage and prioritize accordingly. Having undocumented code is bad; anyone will tell you that. However, there are some pieces of software that might be thrown away very soon – then who cares? The phrase “batch job” does not appear to be very important, but it seems to be in your environment. Treat it as such – solve a parking problem in downtown before sending a man to the moon. Explain to the management that due to this system being critical, the development process will move slow and steady instead of quick and dirty. if they don’t get it, then you are doomed. Just my 2 cents.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: