Site Monitoring Using A Lambda Function

Background

I use a site monitoring service, which allows me to monitor my site by running a check every 5 minutes. When the site is down an e-mail alert is sent out. This service has a paid plan if you want to check your site more frequently.

A Lambda function is a great way to monitor a site and would also allow you to run the site check at a faster pace.

The function can be run from any region or multiple regions and can be used to monitor any web site running in any environment.

I decided to deploy my own Lambda Function which can monitor my sites with any interval I desire. The function will notify by sending an e-mail when it detects that the site is down. To prevent spamming, I used DynamoDB to track when the last notification was sent out.

Requirements

Make sure you have installed the Serverless Framework on your workstation.

Ensure that the workstation from where you are going to run the serverless framework has all the necessary permissions to publish your Lambda function. The workstation should have a ‘IAM Role’ or ‘Access Keys’ configured correctly.

For the Python runtime, I have published a Lambda Layer already in my account for the requests package. Refer to my earlier post to create the Lambda Layer.

We need a SNS Topic and DynamoDB table which I have shown how to setup below.

Procedure

SNS and DynamoDB

Let us go ahead and create a SNS topic with an e-mail subscription and a DynamoDB table first.

Refer to the shell script listed below and the json file which contains the information to setup an index on the DynamoDB table.

I could have created the DynamoDB table using Terraform, but for this post, I decided to keep things simple and I used AWS CLI.

Make sure you edit line #4 and set your e-mail for SNS Subscription.

Once you run the shell script, you will receive an e-mail as shown below.

SNS Subscription
SNS Subscription

Go ahead and click on the link provided in the e-mail to confirm your SNS subscription.

During my testing, I noticed that sometimes it took longer for the DynamoDb table to be fully setup before I could set the time to live specification. Adding a short sleep in the script helped overcome this issue.

So far, we have created:

  • A SNS topic called site-monitor
  • Setup an e-mail subscriber to the SNS topic
  • A DynamoDB table called site-monitor
  • Enabled a time to live attribute called ‘expires’ on the DynamoDB table.

The time to live specification allows us to keep our table small and compact with only relevant data.

See my other post where I use DynamoDb with the TTL field enabled for another example.

Lambda Function

Now it is time for us to setup our Lambda function using the Serverless Framework.

To create the template for our function, I run the commands as shown below.


sls create --template aws-python3 --path sitemonitor-iac
cd sitemonitor-iac

Here you will see two files serverless.yml and handler.py, I are going to update these files with our code as shown below.

The file serverless.yml contains information about our Lambda function and the IAM role that will be needed by the function.

File handler.py is where we will write the code which accomplishes our task.

Let us walk through the handler.py file.

But first a few important edits in this file.

  • Make sure you edit your account number on line #14 to ensure your topic ARN is valid.
  • Line #18 is to avoid spamming ourselves. If my site is down, I wont get an e-mail every time the function runs. I will get an e-mail every 5 min in this case.
  • If you want to monitor multiple sites, update all_urls with the sites you want to monitor.

Following is a brief description and workflow of the function, without going into too much details.

Method sitemon

The Lambda runtime invokes this method (Set by us in serverless.yml).

A simple for loop will process all urls listed in all_urls list, invoking the method check_sites. If the method returns a notification status, a dictionary variable is populated for that site.

At the end the sns_message dictionary is examined to see if any site down message was recorded. Method site_notification send the SNS notifcation.

Method check_sites

Here I make a HEAD request to the URL to see if the site responds. If site responds, I save a status in DynamoDB using the method site_ok else site_down is run.

Method site_ok

A record is saved in DynamoDb with a retention period set to 2 days. This can be changed if you want to keep the site up record for longer duration.

Method site_down

Control has reached here because the site did not respond to the HEAD request. The first check that is done here by me is to see when was the last down-notification sent for this site.

The query looks for last notification records in DyanmoDB for the site. The FilterExpression makes sure we are not looking at expired records.

If a record is found then it means a notification was sent for this site recently and a new one should not be sent out.

If no record was found, a call is make to save the site down status and a notification send status record.

Method save_down_status

In this method, a record is saved in DyanmodDB to indicate site is down. This record is saved with a long retention to keep track of site outage.

Another record is set indication that a down-notification is being sent. The retention for this record matches our e-mail notification interval.

Let us take a look at the serverless.yml file now.

Few important edits in this file.

  • Line #13 specify the name of your deployment bucket.
  • Line #15 set a tag value that makes sense to you.
  • Line #48 edit your account number so that the Lambda Layer ARN is valid.

In this file, we are setting the function runtime information, the bucket where serverless will deploy the code too. The IAM role needed for the function to query and update DynamoDB table.

Line #46 specifies that the Lambda function should invoke the method name sitemon in handler.py

We also specify the memory allocated to our function and the timeout period.

If you use this Lambda function to monitor multiple sites, you may have to adjust the timeout value to match your requirements. If all the site check cannot complete in 20 seconds, your Lambda function will timeout.

The events section in the file specifies that this function should run every 3 minutes. Serverless will take care of setting up the Event and its frequency to invoke your function accordingly.

Deployment

It is now time to deploy our function. The deployment can be done by running the command sls deploy as shown below:


sls deploy
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service sitemonitor.zip file to S3 (1.55 KB)...
Serverless: Validating template...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
....................
Serverless: Stack create finished...
Service Information
service: sitemonitor
stage: prd
region: us-east-1
stack: sitemonitor-prd
resources: 6
api keys:
  None
endpoints:
  None
functions:
  sitemon: sitemonitor-prd-sitemon
layers:
  None

This will trigger the process of creating the CloudFormation scripts which in turn will create a Lambda Function, an IAM role for your function and a CloudWatch Events Rule.

After the resource creation completes, go to CloudFormation to see the stack information to confirm the details. Note: This is not needed but good to lookup. If there are errors in the execution, you will most likely see them on the command line itself.

If you make any changes to either of the two files, you can simply run ‘sls deploy’ again to push the latest code up to your account.

Testing

You can either wait for the CloudWatch Events rule to trigger your function,, or to Test this function immediately from your workstation you can run this CLI command with your account number set (command and output listed):


aws lambda invoke --function-name arn:aws:lambda:us-east-1:<your account>:function:sitemonitor-prd-sitemon response.json
{
    "StatusCode": 200,
    "ExecutedVersion": "$LATEST"
}

See my other posts to get an idea of how to do a test run of a Lambda function.

If the site you are monitoring is down, you should receive an e-mail notification as shown below.

SNS notification
SNS notification

To get this notification, I rebooted my site and ran the Lambda function from my workstation.

Remove Resources

If you no longer wish to keep this function, you can remove the resources that were created by running this command.

sls remove

This will remove all the resources that were created. Remember you are responsible for all charges incurred. Leaving a Lambda function with a CloudWatch Event rule enabled, will cost you even if there are no servers to terminate.

You may also have to cleanup your CloudWatch Logs and S3 buckets where this code was deployed.

To ensure you do not retain too many CloudWatch logs, see my post to set retention period for your logs.

Improvements

The code shown here is meant for demonstration purpose.

Review exception handling, adding comments etc. to make it more meaningful for you.

Change retention period to suite your needs.

This may not be the best way to accomplish site monitoring, but it works for me.

You want to send a site up notification after a period of downtime.

Pass the all_urls list or the SNS topic using environment variables to your function, so that they are not hard coded.

Please monitor only those sites that you own.

Further Reading:

Photo Credit:

Photo by Tobias Tullius on Unsplash

Leave a Reply