Analyze Apache Logs Using Athena – Part II

In my earlier post, I demonstrated how one could copy their Apache log files to S3 and then use AWS Athena to query the log data.

See Analyze Apache Logs Using Athena post.

This worked great for me, but it did not allow me to run more fine grained SQL like queries on the data. I wanted the ability to search for data based on the request time with more precision.

Problem

The default format of the request time in Apache Log was not the most optimal for running queries. I could modify the SQL queries and apply some transformation, but I decided against it.

Instead, I decided to transform the date field in the file before uploading it to S3. This allowed me to continue with the existing TABLE definition I created in my earlier post.

Procedure

Python is available on most servers and was my choice for writing a small script to apply the transformation on my file before uploading to AWS S3 bucket.

Here is my modified ~/bin/S3backup.sh file to show the changes.


#!/bin/bash -x

IN_DIR="$HOME/local-logs/in"
SRC_DIR="/opt/bitnami/apache2/logs/"

rsync -av --include "access*" --exclude "*" $SRC_DIR  $IN_DIR

cd $IN_DIR
if [ -f ${IN_DIR}/access_log ]; then
   gzip -fk ${IN_DIR}/access_log
   python3 $HOME/bin/transform.py access_log.gz
fi

for x in $(ls access_log-*.gz);do
  if [ ! -f ${IN_DIR}/mod_${x} ]; then
     python3 $HOME/bin/transform.py ${x}
  fi
done

aws s3 sync --exclude "*" --include "mod*.gz" --profile blog \
    $IN_DIR s3://sbali-blog-logs/apache-logs

Changes in Script

I decided to apply compression to the access_log file in order to minimize the size of the file. In my case, it is not going to save me too much money, but if the file was very large, you will save on S3 object storage costs.

I decided to write the transformed data to a new file named mod-apache-log.*.gz. This was helpful for debugging and allows you to keep the original server generated files intact and unmodified.

If a file was already modified earlier, then no transformation is applied. This allows you to re-generate files if needed. Assuming, I wish to change the transformation at a later date, I can delete all the mod*.gz files and run the transformagain on the original log files.

Here is the transform.py script I used to apply transformation to the date field.

I have kept the script very simple, so it is easy to understand. You are free to optimize the code, add error handling as needed.


import datetime
import gzip
import sys

fn_in = sys.argv[1]
fn_out = 'mod_' + fn_in

print('Transforming...' + fn_in + ' to ' + fn_out)

with gzip.open(fn_in, 'rt') as f, gzip.open(fn_out, 'wt') as f_out:
    for line in f:
        new = line.strip().split()

        dt = new[3].strip('[')
        new_dt = datetime.datetime.strptime(dt, '%d/%b/%Y:%H:%M:%S')

        new[3] = '[' + new_dt.isoformat() + ']'
        del(new[4])
        new_line = ' '.join(new)
  
        f_out.write(new_line + '\n')

Transformation

The date in the original Apache Log file looked like this


[06/Feb/2021:19:27:18 +0000]

and after the transformation it changes to


[2021-02-06T19:27:18]

This allowed me to keep my regex and my table definitions the same. I just had to clear out the old log files from S3 and run the S3backup.sh script again.

Here is a full transformed sample log row.


172.XX.XX.XX - - [2021-02-06T19:27:18] "HEAD / HTTP/1.1" 200 - "https://blog.skbali.com" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" 63.XX.XX.XX. "blog.skbali.com" "TLSv1.3"

Once the data has ben loaded into Athena, we can start running more fine grained queries based on the request time.

Athena Queries

With the request time in new format I could run a query based on a narrow date range as shown below

-- Number of Log line in a 10 min period
SELECT count(*)
FROM blog_access_logs
WHERE time > '2021-02-06T08:00:00'
        AND time < '2021-02-06T08:10:00' ;

-- Number of Log lines after a certain time
SELECT count(*)
FROM blog_access_logs
WHERE time >= '2021-02-06T00:00:00' ;

-- Log lines for last 7 days
SELECT *
FROM blog_access_logs
WHERE time > to_iso8601(current_date + interval '-7' day)
order by time;

-- Similar to above but with timestamp
SELECT *
FROM blog_access_logs
WHERE time > to_iso8601(current_timestamp + interval '-7' day)
order by time;

-- Log line for the last hour
SELECT *
FROM blog_access_logs
WHERE time > to_iso8601(current_timestamp + interval '-1' hour)
order by time;

-- 404 Requests
SELECT request_url
FROM blog_access_logs
WHERE time >= '2021-02-06T00:00:00' 
AND status = '404';

Hope this post was helpful. Let me know if you have any questions or need help with this by commenting below.

Note: There is a cost associated with uploading files to S3, keeping files in S3 and running Athena queries. You are responsible for all charges incurred in your account. See Athena Pricing.

Further Reading

Photo Credit

Photo by Markus Winkler on Unsplash

Leave a Reply