AWS Cloud Operations Blog

AWS Observability ICYMI: Jan-May 2026

Welcome to the first edition of the AWS Observability ICYMI (In Case You Missed It) recap! The first five months of 2026 has been transformational for AWS observability with over 40 launches across Amazon CloudWatch, AWS X-Ray, Amazon Managed Grafana, and Amazon Managed Service for Prometheus. Two major themes defined this period: OpenTelemetry as the […]

Import Historical data from AWS CloudTrail Lake to Amazon CloudWatch

Organizations managing workloads on AWS rely on AWS CloudTrail to answer the fundamental questions: Who did what, where, and when? Since January 2022, customers have stored their CloudTrail activity logs in CloudTrail Lake, a managed data lake purpose-built for capturing, storing, querying user and API activity across their AWS environment.  As organizations scale across multiple […]

Shift-Left Tag Compliance using AWS Organizations and Terraform

In this post you will learn about AWS Organizations tag policies, the tag_policy_compliance Terraform provider setting, a reusable tagging module that automatically applies required tags, and a test-driven approach that dynamically validates against your organizational policies.

Simplifying Prometheus metrics collection across your AWS infrastructure

If you’re running services such as Amazon EC2 instances, Amazon Elastic Container Service (Amazon ECS) containers, and Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters in AWS, maintaining separate Prometheus servers for each environment creates significant operational burden. Managing scraper configurations, high availability, scaling, and security distracts you from building great applications. AWS managed […]

Introducing OpenTelemetry and PromQL support in Amazon CloudWatch

If you run Kubernetes or microservices workloads on AWS, your metrics likely carry dozens of labels: namespace, pod, container, node, deployment, replica set, and custom business dimensions. To get a complete picture of your environment, you may be splitting your metrics pipeline: Amazon CloudWatch for AWS metrics, and a separate Prometheus-compatible backend for high-cardinality (many […]

AWS Unified Operations: Building Resilient Operations for Mission-Critical Workloads

Achieve Mission-Critical Resiliency at Scale with AWS Unified Operations – The Top Tier of AWS Support to Achieve High Availability, Faster Migrations, and Accelerated Incident Resolution The Shift-Left Paradigm: From Reactive Firefighting to Proactive Prevention Organizations running mission-critical workloads face three critical operational gaps that undermine resilience and slow cloud adoption. Skills gaps make cloud-native […]

Announcing General Availability of AWS DevOps Agent 

Today, we’re announcing the general availability of AWS DevOps Agent. AWS DevOps Agent is your always-available operations teammate. It resolves and proactively prevents incidents, optimizes application reliability and performance, and handles on-demand SRE tasks across AWS, multicloud, and on-premises environments. Operations teams spend countless hours investigating incidents, correlating data across multiple tools, and manually triaging […]

Essential security controls to prevent unauthorized account removal in AWS Organizations

When AWS member accounts are compromised, attackers can remove them from your organization, disabling all governance controls. In this post, you’ll learn how to protect your AWS environment from account compromise leaving your AWS Organization using layered security controls, including service control policies, secure account migration, and centralized root access management. AWS secures the infrastructure […]

Adaptive sampling with AWS X-Ray to capture critical spans

Introduction Enterprise applications using AWS X-Ray generate large volumes of distributed tracing data across multiple services. Static sampling strategies keep costs down by capturing a fixed percentage of traffic. However, they frequently miss critical data during intermittent failures or sudden latency spikes. Tracing every request for maximum visibility at scale may increase sampling costs for […]

Automate AWS Systems Manager activation for hybrid-managed node registration

AWS Systems Manager (formerly known as SSM) is an AWS service that you can use to view and control your servers on AWS cloud and on-premises infrastructure. Systems Manager makes it easy to manage a hybrid environment. To set up servers and virtual machines (VMs) in your hybrid environment as Systems Manager managed instances, you […]