AWS Big Data Blog

Category: Advanced (300)

Real-time CDC from Aurora PostgreSQL to Amazon S3 Tables using Debezium and Firehose

In this post, we show you how to build a CDC pipeline that delivers query-ready Iceberg tables directly. The pipeline captures inserts, updates, and deletes from Aurora PostgreSQL and applies them as row-level operations in Amazon S3 Tables, a capability of Amazon Simple Storage Service (Amazon S3).

Upgrade PySpark from Spark 3.5 to Spark 4.0 with AWS Spark Upgrade Agent

In this post, we walk through a hands-on PySpark migration from Spark 3.5 to Spark 4.0 on Amazon EMR Serverless, using the AWS Spark Upgrade Agent. You’ll see how the agent iteratively validates your application on a live Amazon EMR Serverless application, automatically diagnosing and resolving failures from Amazon CloudWatch logs until the job succeeds.

Migrate JMS applications to Amazon MQ for RabbitMQ with minimal changes

This post shows you how to migrate your JMS applications and walks through a complete setup, from creating the broker to sending and receiving messages. You will also see a real-world scenario: migrating an existing Apache ActiveMQ workload to an Amazon MQ broker running RabbitMQ. The post covers configuration changes, monitoring with Amazon CloudWatch, and validation steps to make sure that your migration succeeds.

How Zynga scaled multi-warehouse data governance with Amazon Redshift federated permissions

In this post, we walk through how Zynga adopted Amazon Redshift federated permissions and AWS IAM Identity Center to enforce consistent, tiered data access across provisioned and serverless Amazon Redshift environments without building custom synchronization pipelines.

A systematic approach to benchmarking SQL processing engines on AWS

Selecting the right SQL processing solution for large-scale data analytics is a critical decision for organizations. As data volumes grow exponentially, the technology landscape has evolved to offer diverse options for processing and analyzing this information efficiently. This post presents a systematic framework for evaluating and benchmarking SQL processing engines on AWS, using Apache JMeter to conduct practical performance testing at scale.

Build petabyte-scale synthetic test data with Amazon EMR on EC2

As data volumes grow from terabytes to petabytes, the architecture for generating synthetic data must evolve to meet increasing demands for scale, performance, and data quality. In this post, we show how you can build a scalable synthetic data generation solution using Amazon EMR, Apache Spark, and the Faker library.

Securing client confidentiality at scale: Automated data discovery and governed analytics for legal workloads

In this post, we show you a reference architecture that automates sensitive data discovery across legal document repositories on Amazon Web Services (AWS), demonstrate how to capture structured findings as a compliance dataset, and guide you through building a governed analytics workspace that maintains your security boundaries. You walk away with a practical model for building security and analytics into the same lifecycle, without moving documents outside their system of record.