Skip to main content
MySQL → AWS S3

Sync MySQL to AWS S3 —
snapshots and streaming CDC.

Export MySQL change events to S3 as partitioned Parquet files — feed Athena, Spark, or Glue without a separate ETL pipeline. Snapshot large tables, then stream incremental changes continuously.

TL;DR

rsync.ai reads MySQL binlog events in ROW format, batches them into Snappy-compressed Parquet files, and writes them to S3 with Hive-style date partitioning. Works with MySQL 5.7+, RDS, Aurora, Cloud SQL, and PlanetScale. IAM role or access key auth. Column-level PII masking before the S3 write. Athena-, Spark-, and Glue-compatible out of the box.

  • Parquet output with Snappy compression — Athena, Spark, Glue compatible
  • Hive-style date partitioning (YYYY-MM-DD) out of the box
  • Chunked snapshot for large tables, then streaming CDC
  • PII columns masked/nulled before S3 write — never in your data lake
How it works

How to sync MySQL to AWS S3 — 5 steps

From binlog setup to Parquet files in S3 — typically under 15 minutes.

  1. 1

    Enable MySQL binlog

    Set `binlog_format=ROW` and `log_bin=ON` in your MySQL config. Create a replication user with REPLICATION SLAVE and REPLICATION CLIENT privileges. On RDS or Aurora, enable automated backups and set the `binlog_format` parameter to ROW. Binlog retention should be at least 3 days — enough buffer for a rsync.ai restart without losing events.

    binlog_format=ROW · log_bin=ON · REPLICATION SLAVE + REPLICATION CLIENT
  2. 2

    Connect MySQL source

    Provide host, port, replication username, password, and the database name. rsync.ai connects as a MySQL replica and records the starting binlog position or GTID set before taking the initial snapshot. SSL/TLS and SSH tunnels are supported for MySQL instances inside a private VPC.

    mysql://rsync_cdc:pass@host:3306/mydb
  3. 3

    Connect AWS S3

    Provide your S3 bucket name and AWS credentials — either an IAM access key + secret, or an IAM role ARN if rsync.ai runs on EC2 or ECS with instance profile. The IAM policy needs `s3:PutObject`, `s3:GetObject`, and `s3:ListBucket` on your target bucket. You can optionally specify a key prefix (e.g. `mysql/prod/`).

    s3://your-bucket/mysql/ — IAM role or access key
  4. 4

    Describe the sync in plain English

    Tell rsync.ai what you want: 'Snapshot the MySQL orders table to S3 Parquet daily, then stream changes hourly.' or 'Replicate all tables in mydb to S3 continuously as CDC events.' The AI pipeline planner decides whether to use a full snapshot, incremental CDC, or both — and proposes a partition layout for Athena compatibility.

    No SQL · No YAML · No DAGs
  5. 5

    Approve the S3 path layout and start

    rsync.ai shows the proposed S3 key structure for each table (e.g. `s3://bucket/mysql/mydb/orders/2026-05-30/part-0001.parquet`). You can adjust prefix, partition granularity (hourly / daily / monthly), and file format (Parquet or JSON-Lines). Approve, and the pipeline starts — snapshot first, then streaming CDC batches on the schedule you set.

    s3://bucket/mysql/{database}/{table}/YYYY-MM-DD/part-NNNN.parquet

MySQL → S3 output layout

Each MySQL table maps to a partitioned Parquet prefix in your S3 bucket. Paths shown for a 2026-05-30 run.

MySQL tableS3 pathNotes
orderss3://bucket/mysql/mydb/orders/2026-05-30/part-0001.parquetPartitioned by event date. Each Parquet file contains one day of inserts and updates. Deletes written as tombstone records with _deleted=true.
productss3://bucket/mysql/mydb/products/2026-05-30/part-0001.parquetFull snapshot on first run, then CDC-only files. Parquet schema inferred from MySQL column types.
customerss3://bucket/mysql/mydb/customers/2026-05-30/part-0001.parquetPII columns (email, phone) can be nulled or hashed before writing to S3. Configured per-column in the pipeline.
sessionss3://bucket/mysql/mydb/sessions/2026-05-30/part-0001.parquetHigh-volume table. rsync.ai batches CDC events and writes Parquet files every N minutes (configurable) rather than per-event.
transactionss3://bucket/mysql/mydb/transactions/2026-05-30/part-0001.parquetAppend-only. rsync.ai writes INSERT events only. DECIMAL(18,8) preserved as Parquet DECIMAL with matching precision/scale.
audit_logs3://bucket/mysql/mydb/audit_log/2026-05-30/part-0001.parquetAppend-only audit table. Each Parquet file is Snappy-compressed. Athena can query directly with no manifest needed.

rsync.ai vs. Fivetran, Airbyte, mysqldump+cron for MySQL → S3

How the options stack up for getting MySQL data into your data lake.

Feature
rsync.aiyou
Fivetran
Airbyte
mysqldump+cron
Real-time MySQL binlog CDC to S3
Plain-English pipeline setup
Parquet output with Athena partitioning
Self-hosted (data stays in your VPC)
Source-available connector code (auditable)
Column-level PII masking before S3 write
No per-row / per-MAR pricing
Schema evolution handling

MySQL to S3 — frequently asked

What file formats does rsync.ai use when writing MySQL data to S3?

Parquet (Snappy-compressed) is the default and recommended format — it gives the best compression ratio, pushdown predicate support in Athena and Spark, and preserves MySQL column types as Parquet logical types. JSON-Lines (one JSON object per row, gzip-compressed) is available for destinations that don't support Parquet readers. CSV is supported but not recommended for production because it loses type information.

How are S3 files partitioned and can I query them with Athena?

By default, rsync.ai writes to `s3://bucket/{prefix}/{database}/{table}/YYYY-MM-DD/part-NNNN.parquet` — daily Hive-style partitioning. Athena discovers this layout automatically when you run `MSCK REPAIR TABLE` or use AWS Glue Crawler. You can change partition granularity to hourly or monthly, or use a custom prefix pattern. All Parquet files include Parquet statistics so Athena can skip non-matching partitions.

Should I use an IAM role or access keys for S3 access?

IAM role is strongly preferred. If rsync.ai runs on EC2 or ECS, attach an instance profile with the required S3 permissions (`s3:PutObject`, `s3:GetObject`, `s3:ListBucket`). IAM roles rotate credentials automatically and never expose long-lived secrets. If you run rsync.ai outside AWS (e.g. on-prem), use an IAM user with an access key scoped to only your target bucket via a resource-level policy. Long-lived keys should be rotated every 90 days.

How does rsync.ai handle large MySQL tables — full snapshot vs. streaming?

For large tables (millions of rows), rsync.ai takes a chunked snapshot — it reads the table in primary-key-ordered chunks and writes each chunk as a Parquet file to S3. While the snapshot runs, binlog events are buffered. Once the snapshot completes, rsync.ai switches to streaming mode and applies buffered + new CDC events. The switch is atomic — you get a consistent point-in-time snapshot plus all changes since.

How does rsync.ai handle MySQL schema changes (ALTER TABLE) for S3 output?

When rsync.ai detects a DDL event in the binlog, it writes the current in-flight Parquet file, then starts a new file with the updated schema. Older Parquet files retain the old schema. Athena handles schema evolution gracefully if you define your table with column-by-name projection. For breaking changes (column type change, column drop), rsync.ai surfaces a review prompt in the UI.

Is the MySQL → S3 output compatible with Apache Spark and AWS Glue?

Yes. The Parquet files rsync.ai writes use standard Parquet column statistics and Hive-style partitioning, which Spark, Glue ETL, AWS Glue Crawler, Presto, and DuckDB all understand natively. Spark can read the S3 prefix directly: `spark.read.parquet('s3://bucket/mysql/mydb/orders/')`. No manifest files or special configuration needed.

Can I redact or mask PII columns before they land in S3?

Yes. In the pipeline configuration you mark individual columns as Null, Hashed (SHA-256), or Truncated. Masking is applied in rsync.ai's processing layer before the Parquet file is written — the raw value never reaches S3 or any intermediate log. This is especially useful for the customers table where email and phone should not land in a data lake that many team members can query.

Is rsync.ai self-hosted for MySQL → S3?

Yes. rsync.ai runs on your own compute via `docker compose up`. Your MySQL credentials, AWS credentials, and row data never leave your infrastructure. The S3 connector writes directly from your network to your bucket — no relay through rsync.ai's servers. License is Elastic License 2.0 — free to self-host, cannot be resold as a managed service.