How to avoid DoS and design resilient serverless applications is one of the most common topics we hear when discussing AWS Lambda security with organizations that are in the process of adopting serverless architectures.
In this blog post, I’ll cover the different methods for invoking AWS Lambda functions, why it’s important to be aware of things such as retry behavior and concurrency limits, how attackers can leverage poor application and software design to cause Denial of Service, and what are the recommended mitigation strategies.
The first thing that usually comes to mind when we think of the word “serverless” is scale. One of the biggest advantages of going serverless is that you don’t need to worry about scale or capacity planning anymore. The cloud provider does all the “heavy lifting” for you.
In reality, this is only partially true. When designed correctly, serverless applications are indeed much more resilient to spikes in traffic and can easily scale to support high bandwidth. However, there are certain limitations that you need to be aware of and best practices that you must follow for that to happen as planned. Otherwise, serverless applications can be vulnerable to Denial of Service attacks as any other application out there.
Lambda functions can be invoked either synchronously or asynchronously. To be clear, a synchronous invocation means that the service or API that invoked the Lambda function is going to wait for the function to finish running. On the other hand, when a Lambda function is invoked asynchronously, the invoker does not wait for a result.
When you manually invoke a Lambda function (using either AWS CLI or AWS SDK) you can specify what invocation type you want to use:
However, when you use an AWS service as a trigger, the invocation type is predetermined for each service. You have no control over the invocation type that these event sources use when they invoke your Lambda function. Below is a summary table, describing the different services, their invocation types and their behavior upon throttling:
"If the function is invoked synchronously and is throttled, Lambda returns a 429 error and the invoking service is responsible for retries. The ThrottledReason error code explains whether you ran into a function level throttle (if specified) or an account level throttle (see note below). Each service may have its own retry policy.” (AWS Documentation)
Let’s have a look at API Gateway events as an example for synchronous invocations.
An attacker that can control the amount of requests sent to API Gateway, will be able to cause throttling and as a result - Denial of Service. Applications which use synchronous invocations are easier for an attacker to target since the feedback is immediate and the attacker quickly figures out if the attack is successful or not.
To demonstrate that, let’s do a small test. I’ve created a very simple Lambda function that waits 5 seconds before returning the response. Then, I used a simple Bash script to execute 3 batches of concurrent executions (50, 100 and 150):
I set a limit of 100 concurrent executions using the reserved capacity feature (when not set, the function will be able to fully consume the account limit). As you can see in the metrics below, the third batch of 150 concurrent executions was throttled.
The same idea applies to other event sources in the same category. An attacker can take leverage of PreAuthentication Cognito triggers or mount an attack against a chat-bot application by causing throttling through the Lex intent integration.
"If your Lambda function is invoked asynchronously and is throttled, AWS Lambda automatically retries the throttled event for up to six hours, with delays between retries. For example, CloudWatch Logs retries the failed batch up to five times with delays between retries. Remember, asynchronous events are queued before they are used to invoke the Lambda function. You can configure a Dead Letter Queue (DLQ) to investigate why your function was throttled”
Let’s take AWS S3 as an example. An application where the user controls the frequency in which objects are uploaded to the bucket, and as a result the concurrent executions of the Lambda functions, has a potential to be throttled.
I mimicked the previous test, now with a Lambda trigger by S3. Same scenario, a Lambda function that sleeps for 5 seconds with 3 batches of concurrent events (50, 100 and 150):
Have a look at the results (on the right).
When I tried to execute 150 concurrent S3 events while the function’s limit was 100 -
All of the events were processed successfully!
That’s the power of the AWS Lambda “retry" mechanism. We can also see that there were 71 throttles, meaning that for some events, the Lambda service issued a retry more than once.
I then made another test, similar to the third batch of 150 events, now with a sleep time of 5 minutes instead of 5 seconds. Let’s see what happened (on the right).
The results show that at first, only 100 events were successfully processed. After 5 minutes another 46 events, and after another 5 minutes, the last 4 events were processed successfully as well. This really demonstrates how events are being retried when the concurrency limit is reached.
AWS states that Lambda “automatically retries the throttled event for up to six hours…” meaning that a long Denial of Service attack can eventually cause loss of data.
Another possible danger with asynchronous invocations, besides the possibility of being throttled - is the unexpected behavior of the application due to the ‘retry' mechanism. If our Lambda functions are being invoked more than once, and we designed and planned for only one execution - the application flow might break.
Poll Based & Stream Based Invocations
"AWS Lambda polls your stream and invokes your Lambda function. When your Lambda function is throttled, Lambda attempts to process the throttled batch of records until the time the data expires. This time period can be up to seven days for Amazon Kinesis. The throttled request is treated as blocking per shard, and Lambda doesn't read any new records from the shard until the throttled batch of records either expires or succeeds. If there is more than one shard in the stream, Lambda continues invoking on the non-throttled shards until one gets through" (AWS Documentation)
The potential victims here are applications with a DynamoDB Streams or Kinesis Streams triggers. An attacker can send a malformed batch of events to the stream (meaning events that will trigger an error during the function’s execution), causing the retry mechanism to step up. This, if not handled properly will cause a Denial of Service since the record processing is blocking.
Poll Based & Not Stream Based Invocations
"AWS Lambda polls your queue and invokes your Lambda function. When your Lambda function is throttled, Lambda attempts to process the throttled batch of records until it is successfully invoked (in which case the message is automatically deleted from the queue) or until the MessageRetentionPeriod set for the queue expires." (AWS Documentation)
The messages in an SQS queue are processed, according to AWS, in the following manner - AWS Lambda automatically scales up polling activity until the number of concurrent function executions reaches 1000, the account concurrency limit, or the (optional) function concurrency limit, whichever is lower. Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute.
This blog demonstrates really well how Lambda scales to process all the events in the queue.
Assuming we have a Lambda function that takes 5 minutes to process an event in the queue, and the account limit is 1000 concurrent executions. Since the default MessageRetentionPeriod for a message in the queue is 4 days, an attacker that attempts to cause data loss, will have to issue the following number of requests at once (assuming a DLQ is not configured):
Amount of requests = Concurrency Limit * ( 60 / Function’s processing time ) * 24 * 4
In our example it will be - 1,152,000. Definitely not an easy task.
Mitigations and Best Practices
Service Level Mitigations
- API Gateway provides the ability to set quota and throttling criteria. More information on request throttling can be found in the following AWS DOCUMENTATION.
- For relevant APIs, consider enabling API response caching, which will reduce the number of calls made to your API endpoint and also improve the latency of requests to your API. More information can be found HERE.
For S3 specifically, consider using SQS as a broker to your Lambda function. By defining a Queue as a destinations instead of a Lambda, you gain the ability to process multiple events at once (you have the ability to define the batch processing size):
Make sure your code doesn’t “hang” when faced with unexpected input. You should carefully test all edge cases and think of possible inputs that might cause the function timeouts (e.g. ReDoS attacks or long payloads). An attacker might be able to exploit such an application layer weakness. More information on application-layer DoS / ReDoS can be found in the following SECURITY ADVISORY by PureSec.
- PureSec Serverless Security Platform provides behavioral runtime protection against a wide range of attacks, and can reduce the risk of application layer DoS and unauthorized malicious behavior. The platform also provides unparalleled forensic-level visibility. You can find out more about the platform in the following link
Architectural Design Considerations
Design for retry - always build your Lambda functions in a way that takes into account the possibility of processing the same event more than once.
Reduce the blast radius by defining a reserved capacity limit to specific Lambda functions, so that an attacker won’t be able to leverage them for consuming the entire account capacity.
For Lambda functions with asynchronous event triggers (in SQS integrations - for the queue itself), set up a Dead Letter Queue. After retrying the event twice, Lambda will forward it to the DLQ destination (SQS Queue or SNS Topic) for further investigation.
Monitor your Concurrent Executions account metrics. More information on how to investigate spikes in AWS Lambda concurrency can be found here
Monitor your Lambda throttling metrics.
Monitor your Lambda errors metric, specifically for timeouts
- It is highly recommend to set up monitoring & alerts on your AWS charges & billing. More information can be found in the following link
The following BLOG POST from Yan Cui provides rich information, tips & tricks for logging and monitoring AWS Lambda functions.