Alerts

These are the default out of the box alerts that we have configured for each of the supported platforms. While we can (and do) configure additional alerts for each customer depending on their needs, these alerts are not documented here.

AWS

DynamoDB

Alert Name
Description

AWS Dynamo DB High Read Capacity Utilization

Detects when read capacity consumption is approaching or exceeding provisioned limits, which may lead to throttled requests and degraded application performance.

AWS Dynamo DB High Write Capacity Utilization

Detects when write capacity consumption is approaching or exceeding provisioned limits, potentially causing write throttling and data ingestion delays.

AWS Dynamo DB High Number Of Throttled Requests

Detects excessive throttled requests indicating insufficient provisioned capacity or inefficient access patterns that impact application availability.

AWS Dynamo DB Conditional Check Failed Requests

Detects high rates of conditional write failures, which may indicate concurrency conflicts, optimistic locking issues, or application logic problems.

AWS Dynamo DB High System Errors

Detects internal DynamoDB service errors that could indicate platform issues, availability problems, or configuration issues requiring investigation.

EBS

Alert Name
Description

AWS EBS Low Idle Time

Detects when EBS volumes have consistently low idle time, indicating the volume is constantly busy and may be experiencing performance bottlenecks or saturation.

AWS EBS High Volume Utilization

Detects when storage capacity is approaching its limit, which could lead to out-of-space errors, failed writes, and application disruptions.

AWS EBS Low Burst Balance

Detects when burst balance credits are depleted on gp2/gp3 volumes, resulting in reduced IOPS performance and potential application slowdowns.

AWS EBS High Volume Queue Length

Detects when too many I/O operations are queued, indicating the volume cannot keep up with demand and may cause increased latency and degraded performance.

EC2

Alert Name
Description

AWS EC2 High Cpu Utilization

Detects when CPU usage is consistently high, which may indicate resource constraints, inefficient code, or the need to scale up or out to maintain performance.

AWS EC2 Status Check Failed

Detects when EC2 instance or system status checks fail, indicating underlying hardware issues, network problems, or instance health degradation requiring intervention.

Lambda

Alert Name
Description

AWS Lambda Throttling Events

Detects when Lambda functions are being throttled due to exceeding concurrent execution limits, indicating the need to request limit increases or optimize function concurrency.

AWS Lambda Invocation Failures

Detects when Lambda function invocations fail due to code errors, timeout issues, resource constraints, or permission problems, impacting application reliability and functionality.

Managed Kafka

Alert Name
Description

AWS Kafka High System Cpu

Detects when broker CPU utilization is consistently high, which may impact message processing throughput, increase latency, and indicate the need for cluster scaling or optimization.

AWS Kafka High Root Disk Used

Detects when broker disk space is running low, which could lead to message loss, inability to accept new data, and potential cluster instability if not addressed.

AWS Kafka Active Controller Count

Detects when there is not exactly one active controller in the cluster, indicating split-brain scenarios or controller election failures that can disrupt cluster operations.

AWS Kafka Partition Under Replicated

Detects when partitions have fewer in-sync replicas than configured, indicating replication lag or broker failures that reduce fault tolerance and risk data loss.

AWS Kafka Offline Partitions

Detects when partitions have no available leader, making them unavailable for reads and writes, resulting in service disruption and potential data unavailability.

RDS

Alert Name
Description

AWS RDS High Cpu Spikes

Detects sudden, sharp increases in CPU utilization that may indicate inefficient queries, missing indexes, or unexpected workload patterns requiring investigation and optimization.

AWS RDS High Cpu Load

Detects sustained high CPU usage indicating the database is under heavy load, which may require query optimization, scaling up the instance, or implementing read replicas.

AWS RDS Running Out Of Cpu Credits

Detects when burstable instance types (T-class) are depleting CPU credits, which will result in baseline performance limitations and potential application slowdowns.

AWS RDS Disk IOPS Bottleneck

Detects when IOPS utilization is at or near limits, causing disk I/O queuing and increased query latency that impacts database performance and application responsiveness.

AWS RDS Read Replica Lag

Detects when read replicas are falling behind the primary instance, indicating replication delays that may serve stale data and impact read consistency for applications.

Azure

Azure Databases

Alert Name
Description

Azure Database High Dtu Consumption

Detects when Database Transaction Unit (DTU) consumption is high, indicating the database is approaching resource limits and may need scaling or query optimization to maintain performance.

Azure Database High Storage Usage

Detects when database storage capacity is running low, which could lead to write failures, inability to grow transaction logs, and potential database unavailability.

Azure Database High Deadlock Count

Detects frequent database deadlocks indicating concurrency issues, poor transaction design, or missing indexes that impact application performance and user experience.

Azure Database High User Cpu Usage

Detects high CPU utilization from user queries, suggesting inefficient queries, missing indexes, or heavy workload that may require optimization or scaling.

Azure Database High System Failed Connections

Detects high rates of failed connections due to system-level issues like resource exhaustion, firewall rules, or service problems requiring immediate investigation.

Azure Database High User Failed Connections

Detects frequent failed connection attempts from users, indicating authentication issues, network problems, or connection pool exhaustion affecting application connectivity.

Azure Database High Worker Usage

Detects when worker thread utilization is high, indicating the database is handling many concurrent requests and may be approaching parallelism limits.

Azure Database High Data IO Usage

Detects high data I/O percentage indicating storage throughput bottlenecks that can slow query performance and impact overall database responsiveness.

Azure Database Low Temp DB Log Space

Detects when tempdb log space is running low, which can cause queries to fail, block operations, and disrupt database functionality requiring immediate attention.

Azure Firewall

Alert Name
Description

Azure Firewall Latency

Detects when firewall processing latency exceeds acceptable thresholds, indicating performance degradation that can slow network traffic and impact application response times.

Azure Firewall Health Percentage

Detects when firewall health percentage drops below acceptable thresholds, indicating degraded firewall functionality, capacity issues, or service problems affecting network security and connectivity.

Azure Virtual Machines

Alert Name
Description

Azure VM High Cpu Utilization

Detects when virtual machine CPU usage is consistently high, indicating resource constraints, inefficient workloads, or the need to scale up or out to maintain performance.

Azure VM Unavailable

Detects when a virtual machine becomes unavailable or unresponsive, indicating system failures, crashes, network issues, or underlying infrastructure problems requiring immediate attention.

Azure Virtual Network

Alert Name
Description

Azure VNet Subnet IP Exhaustion

Detects when available IP addresses in a subnet are running low, which could prevent new resources from being deployed and impact network scalability and growth.

Azure VNet Peering Connection Failures

Detects when VNet peering connections are failing, indicating network connectivity issues between virtual networks that can disrupt multi-region or multi-VNet architectures and application communication.

Kubernetes

External Secrets

Alert Name
Description

External Secret Status

Detects when external secrets are unhealthy and unable to sync, indicating the secret may be missing or values cannot be updated, potentially causing application configuration issues.

Secret Store Status

Detects when the Secret Store is unhealthy, typically indicating permission issues preventing access to the external secrets management system.

External Secrets Controller High Error Rate

Detects high error rates in the reconciliation of secrets from Secrets Manager, indicating potential performance degradation or permission issues affecting secret synchronization.

External Secrets Controller 90th Percentile Reconcile Time

Detects when reconcile time exceeds allowed thresholds, suggesting the controller may be under-provisioned or encountering errors that slow down secret synchronization.

External Secrets Workqueue Depth Too High

Detects when work queue depth is too high, indicating the need to increase concurrent controller reconciles or run multiple controllers to handle the secret synchronization workload.

Falco

Alert Name
Description

Falco Priority Alert

Detects when Falco raises security alerts with priority levels of Error, Critical, Alert, or Emergency, indicating suspicious runtime behavior or security policy violations.

Kubernetes Apps

Alert Name
Description

Crash Loop Backoff

Detects when pods are repeatedly crashing and restarting, indicating application errors, misconfigurations, or resource issues that prevent stable operation.

Kube Pod Not Ready

Detects when pods remain in a not-ready state, indicating failed health checks, startup issues, or dependencies not being met that prevent the pod from serving traffic.

Kube Deployment Generation Mismatch

Detects when deployment's observed generation doesn't match the desired generation, indicating the deployment controller is unable to process updates properly.

Kube Deployment Replicas Mismatch

Detects when the actual number of running replicas doesn't match the desired count, indicating scheduling issues, resource constraints, or pod failures.

Kube Deployment Rollout Stuck

Detects when a deployment rollout is not progressing, indicating issues with the new pod version preventing successful updates and deployments.

Kube Stateful Set Replicas Mismatch

Detects when StatefulSet replicas don't match the desired count, indicating issues with persistent storage, pod scheduling, or initialization failures.

Kube Stateful Set Generation Mismatch

Detects when StatefulSet's observed generation doesn't match the desired generation, indicating the controller cannot successfully apply configuration changes.

Kube Stateful Set Update Not Rolled Out

Detects when StatefulSet updates are not rolling out successfully, indicating problems with ordered pod updates, storage attachments, or pod readiness.

Kube Daemon Set Rollout Stuck

Detects when DaemonSet rollout is not progressing across nodes, indicating issues preventing the daemon from being deployed to all intended nodes.

Kube Container Waiting

Detects when containers are stuck in waiting state for extended periods, indicating image pull failures, config issues, or dependencies not being available.

Kube Daemon Set Not Scheduled

Detects when DaemonSet pods are not scheduled to nodes where they should run, indicating node selector mismatches, taints, or resource constraints.

Kube Daemon Set Mis Scheduled

Detects when DaemonSet pods are running on nodes where they shouldn't be, indicating scheduling policy violations or configuration errors.

Kube Job Not Completed

Detects when Jobs fail to complete within expected timeframes, indicating job failures, infinite loops, or resource starvation preventing task completion.

Kube Job Failed

Detects when Jobs fail, indicating errors in batch processing, data pipelines, or scheduled tasks that require investigation and remediation.

Kube Hpa Replicas Mismatch

Detects when Horizontal Pod Autoscaler cannot scale to the desired replica count, indicating metric collection issues, resource limits, or scaling constraints.

Kube Hpa Maxed Out

Detects when HPA has reached maximum replica count while still under load, indicating the need to increase max replicas or optimize application performance.

Kube Pdb Not Enough Healthy Pods

Detects when the number of healthy pods falls below PodDisruptionBudget requirements, risking service availability during voluntary disruptions like node drains.

Kubernetes System

Alert Name
Description

Kube Version Mismatch

Detects when Kubernetes component versions are inconsistent across the cluster, indicating upgrade issues or version drift that can cause compatibility problems and instability.

Kube Client Errors

Detects high rates of client errors when communicating with the Kubernetes API, indicating authentication issues, authorization problems, or API server instability.

Kubernetes Api Server

Alert Name
Description

Kube Client Certificate Expiration

Detects when client certificates used for API authentication are approaching expiration, which could cause service disruptions, authentication failures, and cluster access issues if not renewed.

Kube Aggregated API Errors

Detects errors in aggregated API services, indicating problems with extension API servers that provide custom resources and functionality beyond core Kubernetes APIs.

Last updated

Was this helpful?