Report Failed SQL Jobs

One of the first things I check when I get to the office in the morning is any SQL agent jobs that failed overnight.  Our monitoring system will also pick these up, but I like to have a condensed report with the failed production jobs, job step(s) that failed, and error message so that I can quickly diagnose what went wrong.  We already had a script in place which mostly worked, based on this, but I wanted to modify it a bit to return some more useful information.

Since all the bones were already there, all I really needed to do was modify the code to get a list of production instances from my custom inventory, add one more layer of checking to get the status of the steps, clean up the formatting, and fix a small bug.  I will follow my usual format of explaining the code section by section, with a full link at the end.

First things first, the variables:

$get_instances_query = @"
SELECT + CASE WHEN = 'Default Instance' THEN '' ELSE '\' + END AS InstanceName
FROM [dbo].[SQLInstances] si 
JOIN [dbo].[Servers] s ON si.serverid = s.serverid 
WHERE s.environment = 'Production' AND si.code = 2 AND si.Edition not like 'Express%'
$datefull          = Get-Date
$today             = $datefull.ToShortDateString()
$smtp_server       = "FQDN of Exchange"
$smtp_from         = ""
$smtp_subject      = "Failed SQL Jobs for $today"
$smtp_body         = "Here is a list of failed Production SQL Jobs for $today (over the last 24 hours)"
$errormessage      = "Error messages for the failed jobs for $today (over the last 24 hours)`r`n-------------------------------------------------------"
$smtp_recipients   = get-content $PSScriptRoot\SMTPTo.txt
$smtp_attachment   = "$PSScriptRoot\ErrorMessages.txt"
$inventoryserver   = 'InventoryServer'
$inventorydatabase = 'ServerInventory'

Most of these are for the email message itself, although a few ($get_instances_query, $inventoryserver, $inventorydatabase ) are used for my custom inventory to return the names of all my non-Express production instances.  If you prefer something simpler, you can populate the instances via a different method, like reading in a text file.  That is actually how I have it configured for the SMTP recipients.  Of course, you will need to modify most of these variables to fit your environment.

On to the script itself:

[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SqlServer.SMO") | out-null

# Pull all Production SQL instances from Server Inventory
$sql_instances = (invoke-sqlcmd -ServerInstance $inventoryserver -Database $inventorydatabase -Query $get_instances_query).InstanceName

# Set up email message
$msg         = new-object Net.Mail.MailMessage
$smtp        = new-object Net.Mail.SmtpClient($smtp_server)
$msg.Body    = $smtp_body
$msg.From    = $smtp_from
$msg.Subject = $smtp_subject

foreach($recipient in $smtp_recipients){$msg.To.Add($recipient)}

This is mostly for the email message itself as well.  I set up the message with the variables defined above, then add all the recipients I read in from the SMTPTo.txt file.  The first line loads the SMO namespace, which is used heavily for actually determining the failed jobs.  That is followed by executing my inventory script and storing the names of all my instances I need to monitor.

Now for the actual SQL bits:

# Loop through each instance
foreach($instance in $sql_instances){

    # Set up SMO server object to pull data from
    $srv = New-Object "Microsoft.SqlServer.Management.Smo.Server" $instance;
    # Loop through each job on the instance
    foreach ($job in $srv.Jobserver.Jobs){
        # Set up job variables
        $jobName           = $job.Name;
        $jobID             = $job.JobID;
        $jobEnabled        = $job.IsEnabled;
        $jobLastRunOutcome = $job.LastRunOutcome;
        $jobLastRun        = $job.LastRunDate;
        # Filter out jobs that are disabled or have never run
        if($jobEnabled -eq "true" -and $jobLastRun){  
            # Calculate the number of days ago the job ran
            $datediff = New-TimeSpan $jobLastRun $today
            # Check to see if the job failed in the last 24 hours   
            if($datediff.days -le 1 -and $jobLastRunOutcome -eq "Failed"){
                # Append failed job information to the message body
                $msg.body += "`n`nFAILED JOB INFO: `n`tSERVER`t= $instance `n`tJOB`t`t= $jobName `n`tLASTRUN`t= $jobLastRunOutcome `n`tLASTRUNDATE`t= $jobLastRun"                
                # Loop through each step in the job
                foreach ($step in $job.JobSteps){
                    # Set up step variables
                    $stepName           = $step.Name;
                    $stepID             = $step.ID;
                    $stepLastRunOutcome = $step.LastRunOutcome;

                    # Filter out steps that succeeded
                    if($stepLastRunOutcome -eq "Failed"){
                        # Get the latest message returned for the failed step
                        $stepMessage = (invoke-sqlcmd -ServerInstance $instance -Database msdb -Query "SELECT TOP 1 message FROM msdb.dbo.sysjobhistory WHERE job_id = '$jobID' AND step_id = '$stepID' ORDER BY instance_id DESC").message
                        # Filter out steps that didn't have a chance to run (have a failed status but no message)
                        if($stepMessage.length -gt 0){
                            # Format error messages a little bit
                            $stepMessage = $stepMessage -replace 'Source:', "`r`n`r`nSource:"
                            $stepMessage = $stepMessage -replace 'Description:', "`r`nDescription:"

                            # Append failed step information to the message body
                            $errormessage += "`r`nSERVER`t`t= $instance `r`nJOB`t`t= $jobName `r`nSTEP NAME`t= $stepName `r`nMESSAGE`t`t= $stepMessage `r`n`r`n-------------------------------------------------------"

The comments should explain what is going on here, but here is a high-level overview.

  1. Connect to the instance
  2. Loop through each enabled job that has run at some point in the past
  3. Identify if the job both ran in the last 24 hours and failed, writing a blurb to the email message if so
  4. Loop through each job step for this failed job
  5. Identify if the job step failed and pull the failure message from  msdb.dbo.sysjobhistory
  6. Formatting the message using replaces, as by default the messages returned are all one line
  7. Write another blurb to the email message containing the failed step and the message, unless it is a step that didn’t have a chance to run due to a prior step failing

The most interesting part is getting the message.  As far as I know, there is no way to pull this using SMO, so I wrote a small script to do it.  It is fairly simple, just querying msdb.dbo.sysjobhistory  and returning the last result that matches the job and step ID of the failed step.

Finally comes the sending of the email message itself:

# Change the message if there were no failed jobs detected
if(($msg.body | measure-object -Line).lines -eq 1){
    $msg.body = "There were no failed Production SQL Jobs for $today (over the last 24 hours)"
    # Send completed message
    # Appends all step error messages to attachment file
    $errormessage > $smtp_attachment
    # Create attachment object and attaches to email message
    $att = new-object Net.Mail.Attachment($smtp_attachment)
    # Send completed message
    # Clean up attachment from memory and text file created
    Remove-Item -path $smtp_attachment

All this bit of logic does is change the body of the email if no failures are detected, else it writes the contents of all the step errors to a text file and attaches it to the email.  Here is a little sample of what the results look like in an email:

Failed Job

And here is what the attachment looks like:

Error messages for the failed jobs for 5/6/2015 (over the last 24 hours)
SERVER =  servername
JOB = jobname
STEP NAME = jobstepname
MESSAGE = Executed as user: . Microsoft (R) SQL Server Execute Package Utility Version 10.0.2531.0 for 32-bit Copyright (C) Microsoft Corp 1984-2005. All rights reserved. Started: 7:20:00 AM Error: 2015-05-05 07:20:05.55

Code:  some code
Source: some ssis source 
Description: some helpful error message. End Error Error: 2015-05-05 07:20:06.31

I have this set up as a scheduled task that fires off every morning just before I arrive.  Here is the full code: ScriptCenter GitHub 

4 thoughts on “Report Failed SQL Jobs

  1. hi , I came across this article in search for a simpler way to get all my failed jobs in one report. Could you please help me better in understanding the $get_instances_query, $inventoryserver, $inventorydatabase pieces to set up to make this work.. I am brand new to powershell.

    1. Those variables are all somewhat specific to my setup. All you really need to do is get a list of all your SQL instance names into the $sql_instances variable. You could do that by just typing in an array of values, reading in from a text file, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *