Report Failed SQL Jobs

One of the first things I check when I get to the office in the morning is any SQL agent jobs that failed overnight.  Our monitoring system will also pick these up, but I like to have a condensed report with the failed production jobs, job step(s) that failed, and error message so that I can quickly diagnose what went wrong.  We already had a script in place which mostly worked, based on this, but I wanted to modify it a bit to return some more useful information.

Since all the bones were already there, all I really needed to do was modify the code to get a list of production instances from my custom inventory, add one more layer of checking to get the status of the steps, clean up the formatting, and fix a small bug.  I will follow my usual format of explaining the code section by section, with a full link at the end.

First things first, the variables:

Most of these are for the email message itself, although a few ( $get_instances_query, $inventoryserver, $inventorydatabase ) are used for my custom inventory to return the names of all my non-Express production instances.  If you prefer something simpler, you can populate the instances via a different method, like reading in a text file.  That is actually how I have it configured for the SMTP recipients.  Of course, you will need to modify most of these variables to fit your environment.

On to the script itself:

This is mostly for the email message itself as well.  I set up the message with the variables defined above, then add all the recipients I read in from the SMTPTo.txt file.  The first line loads the SMO namespace, which is used heavily for actually determining the failed jobs.  That is followed by executing my inventory script and storing the names of all my instances I need to monitor.

Now for the actual SQL bits:

The comments should explain what is going on here, but here is a high-level overview.

  1. Connect to the instance
  2. Loop through each enabled job that has run at some point in the past
  3. Identify if the job both ran in the last 24 hours and failed, writing a blurb to the email message if so
  4. Loop through each job step for this failed job
  5. Identify if the job step failed and pull the failure message from   msdb.dbo.sysjobhistory
  6. Formatting the message using replaces, as by default the messages returned are all one line
  7. Write another blurb to the email message containing the failed step and the message, unless it is a step that didn’t have a chance to run due to a prior step failing

The most interesting part is getting the message.  As far as I know, there is no way to pull this using SMO, so I wrote a small script to do it.  It is fairly simple, just querying msdb.dbo.sysjobhistory  and returning the last result that matches the job and step ID of the failed step.

Finally comes the sending of the email message itself:

All this bit of logic does is change the body of the email if no failures are detected, else it writes the contents of all the step errors to a text file and attaches it to the email.  Here is a little sample of what the results look like in an email:

Failed Job

And here is what the attachment looks like:

I have this set up as a scheduled task that fires off every morning just before I arrive.  Here is the full code: ScriptCenter GitHub 

2 Responses

  1. AK December 17, 2015 / 1:26 PM

    Spelled out nicely.

Leave a Reply

Your email address will not be published. Required fields are marked *