Monitoring – the bigger picture
Before taking your application to production, it’s critical that you ensure your application is fully observable, both at a component level, as well as end to end. It is achieved through logging, gathering metrics at various granularity, tracing to understand system performance and end user experiences end-to-end. Good overview of monitoring of distributed systems can be found in book “Site Reliability Engineering” (monitoring chapter), where Google describes how it runs their production systems. It defines the four golden signals of monitoring: latency, traffic, errors, and saturation. There are many different tools that help you achieve that, like 3rd party monitoring software, AWS CloudWatch and XRay.
CloudWatch out of the box publishes various useful metrics that can be a starting point of your monitoring system, it also allows you to publish your own application metrics, create alerts, etc.
In this post we will focus only on some out of the box metrics, and for some serverless components only. It is not a definite guide what data you should gather, rather a starting point.
Metrics
1. Lambda
CloudWatch Lambda metrics will have no sense for you if you will not understand what errors can occur, how many times your Lambda will be retried and when event is sent to dead letter queue (DLQ).
When you invoke a function, two types of error can occur:
- Invocation – occur when the invocation request is rejected before your function receives it.
- Function – occur when your function’s code or runtime returns an error (i.e throws exception or timeouts) .
Depending on the type of error, the type of invocation, and the client or service that invokes the function, the retry behaviour and the strategy for managing errors varies.
Default retry attempts:
- Synchronous invocation – no retries.
- Asynchronous invocation – 2 retries (so 3 invocations overall) and then event goes to DLQ.
- Event source mappings – until event expires.
For streams, retries the entire batch of items. Repeated errors block processing. For queue, you determine the length of time between retries and destination for failed events by configuring the visibility timeout and redrive policy on the source queue (see below).
Dead letter queue
The dead letter queue lets you redirect failed events to a SQS queue, so that you can specifically catch messages that are causing retries. Events are sent after the two retries fail (it is not used for synchronous invocations).
Metrics
Metric | Description | Comments |
Errors | Measures the number of invocations that failed due to errors in the function (response code 4XX) | 1. Failed invocations may trigger a retry attempt that succeeds so metrics can be >0 and still all events were processed. 2. This does not include invocations that fail due to invocation rates exceeding concurrency limits (error code 429) or service errors (error code 500). |
Throttles | Measures the number of Lambda function invocation attempts that were throttled due to invocation rates exceeding the customer’s concurrent limits (error code 429). | If to monitor depends on your application architecture. |
NumberOfMessagesSent on DLQ (SQS metric) | The number of messages added to a queue. | If you invoke Lambda asynchronously, and DLQ is configured, this metric will tell you how many events were not processed. It must be configured on SQS, it is not a Lambda metric. |
2. SQS
Redrive policy
The redrive policy specifies the source queue, the dead-letter queue, and the conditions under which Amazon SQS moves messages from the former to the latter if the consumer of the source queue fails to process a message a specified number of times. For example, if the source queue has a redrive policy with maxReceiveCount
set to 5, and the consumer of the source queue receives a message 6 times without ever deleting it, Amazon SQS moves the message to the dead-letter queue.
Metrics
Metric | Description |
ApproximateAgeOfOldestMessage | The approximate age of the oldest non-deleted message in the queue. |
NumberOfMessagesSent on DLQ (SQS metric) | The number of messages added to a queue. It must be configured on SQS, it is not a Lambda metric. |
3. SNS
Retry policy
The delivery policy defines how Amazon SNS retries the delivery of messages when server-side errors occur (when the system that hosts the subscribed endpoint becomes unavailable). When the delivery policy is exhausted, Amazon SNS stops retrying the delivery and discards the message—unless a dead-letter queue is attached to the subscription. By default it will try to deliver it 100,015 times, over 23 days, so be careful and configure it according to your needs. More details can be found here.
Metrics
Metric | Description | Comments |
NumberOfNotificationsFailed | The number of messages that Amazon SNS failed to deliver. | For HTTP or HTTPS endpoints, the metric includes every failed delivery attempt, including retries that follow the initial attempt. |
NumberOfNotificationsRedrivenToDlq | The number of messages that have been moved to a dead-letter queue. | This metric actually tells you how many messages were never delivered (if DLQ configured). |
4. API Gateway
Metric | Description |
Latency | The time between when API Gateway receives a request from a client and when it returns a response to the client. The latency includes the integration latency and other API Gateway overhead. |
5XXError | The number of server-side errors captured in a given period. |
5. DynamoDB
Metric | Description |
SystemErrors | The requests to DynamoDB or Amazon DynamoDB Streams that generate an HTTP 500 status code during the specified time period. An HTTP 500 usually indicates an internal service error. |
UserErrors | Requests to DynamoDB or Amazon DynamoDB Streams that generate an HTTP 400 status code during the specified time period. An HTTP 400 usually indicates a client-side error, such as an invalid combination of parameters, an attempt to update a nonexistent table, or an incorrect request signature. |
ReadThrottleEvents, WriteThrottleEvents, | Requests to DynamoDB that exceed the provisioned read/write capacity units for a table or a global secondary index. |
6. S3
Metric | Description |
5xxErrors | The number of HTTP 5xx server error status code requests made to an Amazon S3 bucket with a value of either 0 or 1. |
TotalRequestLatency | The elapsed per-request time from the first byte received to the last byte sent to an Amazon S3 bucket. |
Materials
Serverless Application Lens to the AWS Well-Architected Framework
Lambda error handling