Alerts
These are the default out of the box alerts that we have configured for each of the supported platforms. While we can (and do) configure additional alerts for each customer depending on their needs, these alerts are not documented here.
AWS
DynamoDB
AWS Dynamo DB High Read Capacity Utilization
Detects when read capacity consumption is approaching or exceeding provisioned limits, which may lead to throttled requests and degraded application performance.
AWS Dynamo DB High Write Capacity Utilization
Detects when write capacity consumption is approaching or exceeding provisioned limits, potentially causing write throttling and data ingestion delays.
AWS Dynamo DB High Number Of Throttled Requests
Detects excessive throttled requests indicating insufficient provisioned capacity or inefficient access patterns that impact application availability.
AWS Dynamo DB Conditional Check Failed Requests
Detects high rates of conditional write failures, which may indicate concurrency conflicts, optimistic locking issues, or application logic problems.
AWS Dynamo DB High System Errors
Detects internal DynamoDB service errors that could indicate platform issues, availability problems, or configuration issues requiring investigation.
EBS
AWS EBS Low Idle Time
Detects when EBS volumes have consistently low idle time, indicating the volume is constantly busy and may be experiencing performance bottlenecks or saturation.
AWS EBS High Volume Utilization
Detects when storage capacity is approaching its limit, which could lead to out-of-space errors, failed writes, and application disruptions.
AWS EBS Low Burst Balance
Detects when burst balance credits are depleted on gp2/gp3 volumes, resulting in reduced IOPS performance and potential application slowdowns.
AWS EBS High Volume Queue Length
Detects when too many I/O operations are queued, indicating the volume cannot keep up with demand and may cause increased latency and degraded performance.
EC2
AWS EC2 High Cpu Utilization
Detects when CPU usage is consistently high, which may indicate resource constraints, inefficient code, or the need to scale up or out to maintain performance.
AWS EC2 Status Check Failed
Detects when EC2 instance or system status checks fail, indicating underlying hardware issues, network problems, or instance health degradation requiring intervention.
Lambda
AWS Lambda Throttling Events
Detects when Lambda functions are being throttled due to exceeding concurrent execution limits, indicating the need to request limit increases or optimize function concurrency.
AWS Lambda Invocation Failures
Detects when Lambda function invocations fail due to code errors, timeout issues, resource constraints, or permission problems, impacting application reliability and functionality.
Managed Kafka
AWS Kafka High System Cpu
Detects when broker CPU utilization is consistently high, which may impact message processing throughput, increase latency, and indicate the need for cluster scaling or optimization.
AWS Kafka High Root Disk Used
Detects when broker disk space is running low, which could lead to message loss, inability to accept new data, and potential cluster instability if not addressed.
AWS Kafka Active Controller Count
Detects when there is not exactly one active controller in the cluster, indicating split-brain scenarios or controller election failures that can disrupt cluster operations.
AWS Kafka Partition Under Replicated
Detects when partitions have fewer in-sync replicas than configured, indicating replication lag or broker failures that reduce fault tolerance and risk data loss.
AWS Kafka Offline Partitions
Detects when partitions have no available leader, making them unavailable for reads and writes, resulting in service disruption and potential data unavailability.
RDS
AWS RDS High Cpu Spikes
Detects sudden, sharp increases in CPU utilization that may indicate inefficient queries, missing indexes, or unexpected workload patterns requiring investigation and optimization.
AWS RDS High Cpu Load
Detects sustained high CPU usage indicating the database is under heavy load, which may require query optimization, scaling up the instance, or implementing read replicas.
AWS RDS Running Out Of Cpu Credits
Detects when burstable instance types (T-class) are depleting CPU credits, which will result in baseline performance limitations and potential application slowdowns.
AWS RDS Disk IOPS Bottleneck
Detects when IOPS utilization is at or near limits, causing disk I/O queuing and increased query latency that impacts database performance and application responsiveness.
AWS RDS Read Replica Lag
Detects when read replicas are falling behind the primary instance, indicating replication delays that may serve stale data and impact read consistency for applications.
Azure
Azure Databases
Azure Database High Dtu Consumption
Detects when Database Transaction Unit (DTU) consumption is high, indicating the database is approaching resource limits and may need scaling or query optimization to maintain performance.
Azure Database High Storage Usage
Detects when database storage capacity is running low, which could lead to write failures, inability to grow transaction logs, and potential database unavailability.
Azure Database High Deadlock Count
Detects frequent database deadlocks indicating concurrency issues, poor transaction design, or missing indexes that impact application performance and user experience.
Azure Database High User Cpu Usage
Detects high CPU utilization from user queries, suggesting inefficient queries, missing indexes, or heavy workload that may require optimization or scaling.
Azure Database High System Failed Connections
Detects high rates of failed connections due to system-level issues like resource exhaustion, firewall rules, or service problems requiring immediate investigation.
Azure Database High User Failed Connections
Detects frequent failed connection attempts from users, indicating authentication issues, network problems, or connection pool exhaustion affecting application connectivity.
Azure Database High Worker Usage
Detects when worker thread utilization is high, indicating the database is handling many concurrent requests and may be approaching parallelism limits.
Azure Database High Data IO Usage
Detects high data I/O percentage indicating storage throughput bottlenecks that can slow query performance and impact overall database responsiveness.
Azure Database Low Temp DB Log Space
Detects when tempdb log space is running low, which can cause queries to fail, block operations, and disrupt database functionality requiring immediate attention.
Azure Firewall
Azure Firewall Latency
Detects when firewall processing latency exceeds acceptable thresholds, indicating performance degradation that can slow network traffic and impact application response times.
Azure Firewall Health Percentage
Detects when firewall health percentage drops below acceptable thresholds, indicating degraded firewall functionality, capacity issues, or service problems affecting network security and connectivity.
Azure Virtual Machines
Azure VM High Cpu Utilization
Detects when virtual machine CPU usage is consistently high, indicating resource constraints, inefficient workloads, or the need to scale up or out to maintain performance.
Azure VM Unavailable
Detects when a virtual machine becomes unavailable or unresponsive, indicating system failures, crashes, network issues, or underlying infrastructure problems requiring immediate attention.
Azure Virtual Network
Azure VNet Subnet IP Exhaustion
Detects when available IP addresses in a subnet are running low, which could prevent new resources from being deployed and impact network scalability and growth.
Azure VNet Peering Connection Failures
Detects when VNet peering connections are failing, indicating network connectivity issues between virtual networks that can disrupt multi-region or multi-VNet architectures and application communication.
Kubernetes
External Secrets
External Secret Status
Detects when external secrets are unhealthy and unable to sync, indicating the secret may be missing or values cannot be updated, potentially causing application configuration issues.
Secret Store Status
Detects when the Secret Store is unhealthy, typically indicating permission issues preventing access to the external secrets management system.
External Secrets Controller High Error Rate
Detects high error rates in the reconciliation of secrets from Secrets Manager, indicating potential performance degradation or permission issues affecting secret synchronization.
External Secrets Controller 90th Percentile Reconcile Time
Detects when reconcile time exceeds allowed thresholds, suggesting the controller may be under-provisioned or encountering errors that slow down secret synchronization.
External Secrets Workqueue Depth Too High
Detects when work queue depth is too high, indicating the need to increase concurrent controller reconciles or run multiple controllers to handle the secret synchronization workload.
Falco
Falco Priority Alert
Detects when Falco raises security alerts with priority levels of Error, Critical, Alert, or Emergency, indicating suspicious runtime behavior or security policy violations.
Kubernetes Apps
Crash Loop Backoff
Detects when pods are repeatedly crashing and restarting, indicating application errors, misconfigurations, or resource issues that prevent stable operation.
Kube Pod Not Ready
Detects when pods remain in a not-ready state, indicating failed health checks, startup issues, or dependencies not being met that prevent the pod from serving traffic.
Kube Deployment Generation Mismatch
Detects when deployment's observed generation doesn't match the desired generation, indicating the deployment controller is unable to process updates properly.
Kube Deployment Replicas Mismatch
Detects when the actual number of running replicas doesn't match the desired count, indicating scheduling issues, resource constraints, or pod failures.
Kube Deployment Rollout Stuck
Detects when a deployment rollout is not progressing, indicating issues with the new pod version preventing successful updates and deployments.
Kube Stateful Set Replicas Mismatch
Detects when StatefulSet replicas don't match the desired count, indicating issues with persistent storage, pod scheduling, or initialization failures.
Kube Stateful Set Generation Mismatch
Detects when StatefulSet's observed generation doesn't match the desired generation, indicating the controller cannot successfully apply configuration changes.
Kube Stateful Set Update Not Rolled Out
Detects when StatefulSet updates are not rolling out successfully, indicating problems with ordered pod updates, storage attachments, or pod readiness.
Kube Daemon Set Rollout Stuck
Detects when DaemonSet rollout is not progressing across nodes, indicating issues preventing the daemon from being deployed to all intended nodes.
Kube Container Waiting
Detects when containers are stuck in waiting state for extended periods, indicating image pull failures, config issues, or dependencies not being available.
Kube Daemon Set Not Scheduled
Detects when DaemonSet pods are not scheduled to nodes where they should run, indicating node selector mismatches, taints, or resource constraints.
Kube Daemon Set Mis Scheduled
Detects when DaemonSet pods are running on nodes where they shouldn't be, indicating scheduling policy violations or configuration errors.
Kube Job Not Completed
Detects when Jobs fail to complete within expected timeframes, indicating job failures, infinite loops, or resource starvation preventing task completion.
Kube Job Failed
Detects when Jobs fail, indicating errors in batch processing, data pipelines, or scheduled tasks that require investigation and remediation.
Kube Hpa Replicas Mismatch
Detects when Horizontal Pod Autoscaler cannot scale to the desired replica count, indicating metric collection issues, resource limits, or scaling constraints.
Kube Hpa Maxed Out
Detects when HPA has reached maximum replica count while still under load, indicating the need to increase max replicas or optimize application performance.
Kube Pdb Not Enough Healthy Pods
Detects when the number of healthy pods falls below PodDisruptionBudget requirements, risking service availability during voluntary disruptions like node drains.
Kubernetes System
Kube Version Mismatch
Detects when Kubernetes component versions are inconsistent across the cluster, indicating upgrade issues or version drift that can cause compatibility problems and instability.
Kube Client Errors
Detects high rates of client errors when communicating with the Kubernetes API, indicating authentication issues, authorization problems, or API server instability.
Kubernetes Api Server
Kube Client Certificate Expiration
Detects when client certificates used for API authentication are approaching expiration, which could cause service disruptions, authentication failures, and cluster access issues if not renewed.
Kube Aggregated API Errors
Detects errors in aggregated API services, indicating problems with extension API servers that provide custom resources and functionality beyond core Kubernetes APIs.
Last updated
Was this helpful?
