Technology9 minute read

ELK to AWS: Managing Logs With Less Hassle

The ELK stack boasts a range of impressive capabilities, but in some scenarios, it can be difficult to configure and maintain.

In this article, Toptal DevOps Engineer Fabrice Triboix explains why he decided to shift to a serverless solution that requires less maintenance and allows for superior scaling.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

The ELK stack boasts a range of impressive capabilities, but in some scenarios, it can be difficult to configure and maintain.

In this article, Toptal DevOps Engineer Fabrice Triboix explains why he decided to shift to a serverless solution that requires less maintenance and allows for superior scaling.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Fabrice Triboix
Verified Expert in Engineering

Fabrice is a cloud architect and software developer with 20+ years of experience who’s worked for Cisco, Samsung, Philips, Alcatel, and Sagem.

PREVIOUSLY AT

Cisco
Share

Elasticsearch is a powerful software solution designed to quickly search information in a vast range of data. Combined with Logstash and Kibana, this forms the informally named “ELK stack”, and is often used to collect, temporarily store, analyze, and visualize log data. A few other pieces of software are usually needed, such as Filebeat to send the logs from the server to Logstash, and Elastalert to generate alerts based on the result of some analysis ran on the data stored in Elasticsearch.

The ELK Stack is Powerful, But…

My experience with using ELK for managing logs is quite mixed. On the one hand, it’s very powerful and the range of its capabilities is quite impressive. On the other hand, it’s tricky to set up and can be a headache to maintain.

The fact is that Elasticsearch is very good in general and can be used in a wide variety of scenarios; it can even be used as a search engine! Since it is not specialized for managing log data, this requires more configuration work to customize its behavior for the specific needs of managing such data.

Setting up the ELK cluster was quite tricky and required me to play around with a number of parameters in order to finally get it up and running. Then came the work of configuring it. In my case, I had five different pieces of software to configure (Filebeat, Logstash, Elasticsearch, Kibana, and Elastalert). This can be a quite tedious job, as I had to read through the documentation and debug one element of the chain that doesn’t talk to the next one. Even after you finally get your cluster up and running, you still need to perform routine maintenance operations on it: patching, upgrading the OS packages, checking CPU, RAM, and disk usage, making minor adjustments as required, etc.

My whole ELK stack stopped working after a Logstash update. Upon closer examination, It turned out that, for some reason, ELK developers decided to change a keyword in their config file and pluralize it. That was the last straw and decided to look for a better solution (at least a better solution for my particular needs).

I wanted to store logs generated by Apache and various PHP and node apps, and to parse them to find patterns indicative of bugs in the software. The solution I found was the following:

  • Install CloudWatch Agent on the target.
  • Configure CloudWatch Agent to ship the logs to CloudWatch logs.
  • Trigger invocation of Lambda functions to process the logs.
  • The Lambda function would post messages to a Slack channel if a pattern is found.
  • Where possible, apply a filter to the CloudWatch log groups to avoid calling the Lambda function for every single log (which could ramp up the costs very quickly).

And, at a high level, that’s it! A 100% serverless solution that will work fine without any need for maintenance and that would scale well without any additional effort. The advantages of such serverless solutions over a cluster of servers are numerous:

  • In essence, all routine maintenance operations that you would periodically perform on your cluster servers are now the responsibility of the cloud provider. Any underlying server will be patched, upgraded and maintained for you without you even knowing it.
  • You don’t need to monitor your cluster anymore and you delegate all scaling issues to the cloud provider. Indeed, a serverless set up such as the one described above will scale automatically without you having to do anything!
  • The solution described above requires less configuration, and it is very unlikely that a breaking change will be brought into the configuration formats by the cloud provider.
  • Finally, it is quite easy to write some CloudFormation templates to put all that as infrastructure-as-code. Doing the same to set up a whole ELK cluster would require a lot of work.

Configuring Slack Alerts

So now let’s get into the details! Let’s explore what a CloudFormation template would look like for such a setup, complete with Slack webhooks for alerting engineers. We need to configure all the Slack set up first, so let’s dive into it.

AWSTemplateFormatVersion: 2010-09-09

Description: Setup log processing

Parameters:
  SlackWebhookHost:
  	Type: String
  	Description: Host name for Slack web hooks
  	Default: hooks.slack.com

  SlackWebhookPath:
  	Type: String
  	Description: Path part of the Slack webhook URL
  	Default: /services/YOUR/SLACK/WEBHOOK

You would need to set up your Slack workspace for this, check out this WebHooks for Slack guide for additional info.

Once you created your Slack app and configured an incoming hook, the hook URL will become a parameter of your CloudFormation stack.

Resources:
  ApacheAccessLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

  ApacheErrorLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

Here we created two log groups: one for the Apache access logs, the other for the Apache error logs.

I did not configure any lifecycle mechanism for the log data because it is out of the scope of this article. In practice, you would probably want to have a shortened retention window and to design S3 lifecycle policies to move them to Glacier after a certain period of time.

Lambda Function to Process Access Logs

Now let’s implement the Lambda function that will process the Apache access logs.

BasicLambdaExecutionRole:
	Type: AWS::IAM::Role
	Properties:
  AssumeRolePolicyDocument:
  Version: 2012-10-17
  Statement:
  - Effect: Allow
  Principal:
  Service: lambda.amazonaws.com
  Action: sts:AssumeRole
  ManagedPolicyArns:
  - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Here we created an IAM role that will be attached to the Lambda functions, to allow them to perform their duties. In effect, the AWSLambdaBasicExecutionRole is (despite its name) an IAM policy provided by AWS. It just allows the Lambda function to create its a log group and log streams within that group, and then to send its own logs to CloudWatch Logs.

ProcessApacheAccessLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['status'][0] == "5":
    # This is a 5XX status code
    print(f"Received an Apache access log with a 5XX status code: {raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"Sending Slack post to: host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200

So here we are defining a Lambda function to process Apache access logs. Please note that I am not using the common log format which is the default on Apache. I configured the access log format like so (and you will notice that it essentially generate logs formatted as JSON, which makes processing further down the line a lot easier):

LogFormat "{\"vhost\": \"%v:%p\", \"client\": \"%a\", \"user\": \"%u\", \"timestamp\": \"%{%Y-%m-%dT%H:%M:%S}t\", \"request\": \"%r\", \"status\": \"%>s\", \"size\": \"%O\", \"referer\": \"%{Referer}i\", \"useragent\": \"%{User-Agent}i\"}" json

This Lambda function is written in Python 3. It takes the log line sent from CloudWatch and can search for patterns. In the example above, it just detects HTTP requests that resulted in a 5XX status code and posts a message to a Slack channel.

You can do anything you like in terms of pattern detection, and the fact that it’s a true programming language (Python), as opposed to just regex patterns in a Logstash or Elastalert config file, gives you a lot of opportunities to implement complex pattern recognition.

Revision Control

A quick word about revision control: I found that having the code inline in CloudFormation templates for small utility Lambda functions such as this one to be quite acceptable and convenient. Of course, for a large project involving many Lambda functions and layers, this would most probably be inconvenient and you would need to use SAM.

ApacheAccessLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheAccessLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*

The above gives permission to CloudWatch Logs to call your Lambda function. One word of caution: I found that using the SourceAccount property can lead to conflicts with the SourceArn.

Generally speaking, I would suggest not to include it when the service that is calling the Lambda function is in the same AWS account. The SourceArn will forbid other accounts to call the Lambda function anyway.

ApacheAccessLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheAccessLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheAccessLogGroup
  DestinationArn: !GetAtt ProcessApacheAccessLogFunction.Arn
  FilterPattern: "{$.status = 5*}"

The subscription filter resource is the link between CloudWatch Logs and Lambda. Here, logs sent to the ApacheAccessLogGroup will be forwarded to the Lambda function we defined above, but only those logs that pass the filter pattern. Here, the filter pattern is expecting some JSON as input (the filter patterns starts with ‘{‘ and ends with ‘}’), and will match the log entry only if it has a field status which starts with “5”.

This means that we call the Lambda function only when the HTTP status code returned by Apache is a 500 code, which usually means something quite bad is going on. This ensures that we don’t call the Lambda function too much and thereby avoid unnecessary costs.

More information on filter patterns can be found in Amazon CloudWatch documentation. The CloudWatch filter patterns are quite good, although obviously not as powerful as Grok.

Note the DependsOn field, which ensures CloudWatch Logs can actually call the Lambda function before the subscription is created. This is just a cherry on the cake, it’s most probably unnecessary as in a real-case scenario, Apache would probably not receive requests before at least a few seconds (eg: to link the EC2 instance with a load balancer, and get the load balancer to recognised the status of the EC2 instance as healthy).

Lambda Function to Process Error Logs

Now let’s have a look at the Lambda function that will process the Apache error logs.

ProcessApacheErrorLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['level'] in ["error", "crit", "alert", "emerg"]:
    # This is a serious error message
    msg = log['msg']
    if msg.startswith("PHP Notice") or msg.startswith("PHP Warning"):
    print(f"Ignoring PHP notices and warnings: {raw_log}")
    else:
    print(f"Received a serious Apache error log: {raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"Sending Slack post to: host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200

This second Lambda function processes Apache error logs and will post a message to Slack only when a serious error is encountered. In this case, PHP notice and warning messages are not considered serious enough to trigger an alert.

Again, this function expects the Apache error log to be JSON-formatted. So here is the error log format string I have been using:

ErrorLogFormat "{\"vhost\": \"%v\", \"timestamp\": \"%{cu}t\", \"module\": \"%-m\", \"level\": \"%l\", \"pid\": \"%-P\", \"tid\": \"%-T\", \"oserror\": \"%-E\", \"client\": \"%-a\", \"msg\": \"%M\"}"
ApacheErrorLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheErrorLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*
  SourceAccount: !Ref AWS::AccountId

This resource grants permissions to CloudWatch Logs to call your Lambda function.

ApacheErrorLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheErrorLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheErrorLogGroup
  DestinationArn: !GetAtt ProcessApacheErrorLogFunction.Arn
  FilterPattern: '{$.msg != "PHP Warning*" && $.msg != "PHP Notice*"}'

Finally, we link CloudWatch Logs with the Lambda function using a subscription filter for the Apache error log group. Note the filter pattern, which ensures that logs with a message starting with either “PHP Warning” or “PHP Notice” do not trigger a call to the Lambda function.

Final Thoughts, Pricing, and Availability

One last word about costs: this solution is much cheaper than operating an ELK cluster. The logs stored in CloudWatch are priced at the same level as S3, and Lambda allows one million calls per month as part of its free tier. This would probably be enough for a website with moderate to heavy traffic (provided you used CloudWatch Logs filters), especially if you coded it well and doesn’t have too many errors!

Also, please note that Lambda functions support up to 1,000 concurrent calls. At the time of writing, this is a hard limit in AWS that can’t be changed. However, you can expect the call for the above functions to last for about 30-40ms. This should be fast enough to handle rather heavy traffic. If your workload is so intense that you hit this limit, you probably need a more complex solution based on Kinesis, which I might cover in a future article.

Further Reading on the Toptal Blog:

Understanding the basics

  • What is the ELK stack?

    ELK is an acronym for Elasticsearch-Logstash-Kibana. Additional software items are often needed, such as Beats (a collection of tools to send logs and metrics to Logstash) and Elastalert (to generate alerts based on Elasticsearch time series data).

  • Is ELK stack free?

    The short answer is: yes. The various software items making up the ELK stack have various software licenses but usually have licenses that offer free usage without any support. It would be up to you, however, to set up and maintain the ELK cluster.

  • How does the ELK stack work?

    The ELK stack is highly configurable so there isn’t a single way to make it work. For example, here is the path of an Apache log entry: Filebeat reads the entry and sends it to Logstash, which parses it, and sends it to Elasticsearch, which saves and indexes it. Kibana can then retrieve the data and display it.

Hire a Toptal expert on this topic.
Hire Now
Fabrice Triboix

Fabrice Triboix

Verified Expert in Engineering

London, United Kingdom

Member since September 6, 2017

About the author

Fabrice is a cloud architect and software developer with 20+ years of experience who’s worked for Cisco, Samsung, Philips, Alcatel, and Sagem.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

PREVIOUSLY AT

Cisco

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

Join the Toptal® community.