Lambda Function To Remove Old EC2 Instances

Background

The firm where I am working right now has a strong emphasis on security for infrastructure in the public cloud. One of the security mandate is that all EC2 instances in any auto scaling group be terminated after a certain period of time.

As far as I know, a lambda function is used to terminate EC2 instances during off peak hours for the application.

I do not have access to the source code or know the method implemented to do this, so I decided to write a lambda function on my own and see how I could accomplish this task.

My Method

There are two ways that came to my mind, one was to write a python script and run it on schedule from my Jenkins server as it had a lot of spare capacity. The second method is to run it as a lambda function triggered by CloudWatch Events rule.

Either way works and people can argue merits of one over the other. However, for this post, I am going to go the serverless route and demonstrate how this could be achieved.

For demonstration purpose, I choose a time of one day, but in reality you should consider what is more appropriate for your account and application.

Please make sure you test this in an account where you are certain removal of EC2 instances will not impact your workload. Always test code in a non production environment before publishing to production.

Requirements

Make sure you have installed the Serverless Framework on your workstation.

Ensure that the workstation from where you are going to run the serverless framework has all the necessary permissions to publish your Lambda function. The workstation should have a ‘IAM Role’ or ‘Access Keys’ configured correctly.

You need at least one AWS Auto Scaling Group with some servers in it for testing. The age check can be changed in handler.py, so that you dont have to run EC2 instances for a long time to test out the function.

Procedure

To create the template for our work, I run the commands as shown below.

sls create -v --template aws-python3 --path asg-ec2-age-check
cd asg-ec2-age-check

Here you will see two files serverless.yml and handler.py, I are going to update these files with our code as shown below.

The file serverless.yml contains information about our Lambda function and the IAM role that will be needed by the function.

File handler.py is where we will write the code which accomplishes our task.

Let us take a look at the serverless.yml file.

  • In the provider section, we specify that this file is for AWS environment.
  • The runtime specifies the function will use Python 3.7
  • A bucket that is used by the serverless framework for storing the code.
  • If you use ‘Tags’, specify them next.
  • Next is the IAM role and its permissions. When the Lambda functions runs, the role created for it will have these permissions.
  • This is followed by the function section, where we define the entry point of our Lambda function.
  • The name of the function will be a combination of the service name, stage and function name.
  • I have specified that the maximum memory to be allocated for the function should be 128 MB.
  • The default timeout for Lambda function should be enough, but I wanted to show where it is set, so I used 10s as the value.
  • Next is the cron schedule. This function will run at 00:01, 08:01 and 16:01 GMT. Again this is for demonstration purpose only. Set a schedule that will work for your application.
  • If you at any time want to disable the schedule, you can just set enabled: false and do a sls deploy. The function will remain in your account, but not run on it’s own.

Let us take a look at our code in handler.py

  • The method ‘asg_ec2_age_check()’ is where Amazon hands over control to our code for execution. This was set by us in the serverless.yml file shown above.
  • The ‘list_asg()’ method is where, I have set a fixed days=7 time period for this function. Any EC2 instance, that was launched seven days ago will be selected for termination.
  • First, I loop through all the Auto Scaling Groups and examine if the instances attached to the group are healthy or not.
  • Why do I do this ? What if we run this Lambda function twice accidentally. The Auto Scaling Group would have already started removing instance in the first run and we do not want to remove another instance so soon. It is possible that you only have three servers in all and removing two might impact your workload.
  • If an instance was found to be Unhealthy, I skip the Auto Scaling Group in this run. It can pick it up on its next run.
  • If all the instances were healthy, then get instance info from another method and compare the instance launch time to our previously computed date. If the instance is older, we send the id of the instance back to the calling method.
  • ‘mark_unhealthy()’ method is where all the collected instance Ids are marked as ‘UnHealthy’.
  • After this is done, the auto scaling group handles, the process of de-registering the server and adding and removing instances.

Deploy

After editing the code, go ahead and deploy your function using the command shown below

sls deploy

You should see output similar to one shown below, which tracks the progress of deployment.

Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service asgec2agecheck.zip file to S3 (1.43 KB)...
Serverless: Validating template...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
.....................
Serverless: Stack create finished...
Service Information
service: asgec2agecheck
stage: prd
region: us-east-1
stack: asgec2agecheck-prd
resources: 6
api keys:
  None
endpoints:
  None
functions:
  asgec2svc: asgec2agecheck-prd-asgec2svc
layers:
  None

This will trigger the process of creating the CloudFormation scripts which in turn will create a Lambda Function, an IAM role for your function and a CloudWatch Events Rule.

After the resource creation completes, go to CloudFormation to see the stack information to confirm the details. Note: This is not needed but good to lookup. If there are errors in the execution, you will most likely see them on the command line itself.

If you make any changes to either of the two files, you can simply run ‘sls deploy’ again to push the latest code up to your account.

Testing

You can either wait for the CloudWatch Events rule to trigger your function, or you can go to the Lambda console and create a test event to manually run your function.

See my other posts to get an idea of how to do a test run of a Lambda function.

Here is the output for my Lambda function when I ran it from the console.

START RequestId: XXXXXX Version: $LATEST
Max age:2020-01-26 20:36:31.908895+00:00
ASG->sbali-asg1
i-XXXXXX launched at:2020-01-26 12:21:27+00:00 max age:2020-01-26 20:36:31.908895+00:00
ASG->sbali-asg2
i-XXXXXX launched at:2020-01-26 13:11:48+00:00 max age:2020-01-26 20:36:31.908895+00:00
{"sbali-asg1": ["i-XXXXXX"], "sbali-asg2": ["i-YYYYYY"]}
Marking Unhealthy i-XXXXXX
Marking Unhealthy i-YYYYYY
END RequestId: XXXXXX
REPORT RequestId: XXXXXXX	Duration: 1255.33 ms	Billed Duration: 1300 ms	Memory Size: 128 MB	Max Memory Used: 88 MB	Init Duration: 388.95 ms

Remove Resources

If you no longer wish to keep this function, you can remove the resources that were created by running this command.

sls remove

This will remove all the resources that were created. Remember you are responsible for all charges incurred. Leaving a Lambda function with a CloudWatch Event rule enabled, will cost you even if there are no servers to terminate.

Improvements

The code shown here is meant for demonstration purpose. I hope you dont use this code as is for production.

Review exception handling, adding comments etc. to make it more meaningful for you.

You can improve the code by making it more flexible. The time period used to check age of servers can be passed as a parameter.

Take a look at the IAM role created along with this function. I used a wildcard, you may want to list the exact permission instead of it.

Use a CloudWatch schedule that works best for your workload. Do not implement without examining your auto scaling groups.

In one of my accounts, I have an auto scaling group which manages an EKS cluster. I use a different method to drain my EKS worker nodes, so I bypass the EKS cluster in this method.

You can add a ‘continue’ in the for loop that looks at the Auto Scaling Groups.

       asgName=asg['AutoScalingGroupName']

       if asgName.startswith('eks'):
          print('Skiping ' + asgName)
          continue

Consider passing in an auto scaling group name as a parameter, if you want to run the function for different auto scaling groups at different schedules.

Further Reading:

Photo Credit

unsplash-logoChris Ried

Leave a Reply