Skip to main content
AWS S3 integration

AWS S3 data pipelines,
described in plain English.

Land PostgreSQL, MySQL, and SaaS data in S3 as Parquet, JSON, or CSV — CDC streaming or scheduled snapshots — and move raw files byte-for-byte between S3, GCS, and Azure. Live on rsync.ai Cloud, no per-row fees.

TL;DR

rsync.ai writes to AWS S3 two ways: structured output (Parquet, JSON, or CSV with a schema manifest, from Postgres/MySQL CDC or snapshots) and byte-identical blob passthrough (copy any object between S3, GCS, and Azure). Path layout, partitioning, and format are configurable; PII columns are masked before the write. Works with S3-compatible stores like MinIO, R2, and Wasabi.

  • Parquet, JSON, or CSV — date / hour / Hive partitioning
  • CDC streaming or scheduled snapshots — or both
  • IAM role or access-key auth — MinIO & R2 supported
  • Blob passthrough between S3, GCS, and Azure Blob

What the S3 connector does

Structured exports for analytics, and raw blob passthrough for everything else.

Structured exports

Postgres & MySQL tables to Parquet, JSON, or CSV with a schema manifest.

Blob passthrough

Copy any object byte-for-byte between S3, GCS, and Azure — SHA-256 verified.

PII-safe

Mask or hash sensitive columns before a single byte lands in your bucket.

Parquet (Snappy), JSON, or CSV outputDate / hour / Hive-style partitioningIAM role or access-key authPII masking before the S3 writeResumable multi-part uploadsSHA-256 integrity on every objectBlob passthrough: S3 ↔ GCS ↔ AzureNo per-row or per-GB pricing

rsync.ai vs. Fivetran, Airbyte, custom scripts for S3

What you give up — and gain — choosing rsync.ai for pipelines into AWS S3.

Feature
rsync.aiyou
Fivetran
Airbyte
Custom scripts
Plain-English pipeline setup
CDC streaming to S3 (Postgres & MySQL)
Parquet output with schema manifest
Blob passthrough (S3 ↔ GCS ↔ Azure)
PII masking before write
No per-row / per-MAR pricing
Resumable snapshots (no restart on failure)

AWS S3 pipelines — frequently asked

What can rsync.ai write to AWS S3?

Two things. First, relational and SaaS data as structured files — PostgreSQL and MySQL tables (via CDC or snapshot) and any other source, written as Parquet, JSON, or CSV with a schema manifest alongside each batch. Second, raw files via blob passthrough — copy any object byte-for-byte from GCS or Azure Blob into S3 without re-encoding.

Is AWS S3 a source or a destination?

Both. S3 is most commonly a destination for data-lake ingestion and compliance archiving, but rsync.ai can also read objects from S3 and move them to another store (GCS, Azure Blob, or a different S3 bucket) using byte-identical blob passthrough. Blob → relational database is intentionally rejected — a raw binary can't be written to a table row without parsing.

How does S3 path partitioning work?

By default rsync.ai partitions by date: s3://bucket/{schema}/{table}/YYYY-MM-DD/part-NNNN.parquet. For high-volume tables you can switch to hourly partitioning, or Hive-style (year=2026/month=06/day=24/) for Athena partition projection and AWS Glue crawlers. You set the strategy in plain English or at the approval step.

IAM roles or access keys — which should I use?

IAM roles are strongly preferred. If rsync.ai runs on EC2, ECS, or Lambda inside your AWS account, attach an instance profile or task role and no long-lived credentials are stored. Otherwise use a scoped IAM user with s3:PutObject, s3:GetObject, and s3:ListBucket on your specific bucket and prefix. MinIO and other S3-compatible stores work too — just point rsync.ai at your custom endpoint.

Does rsync.ai support MinIO and other S3-compatible stores?

Yes. Anything that speaks the S3 API — MinIO, Cloudflare R2, Wasabi, Backblaze B2 — works by setting a custom endpoint. The same Parquet/JSON/CSV output, partitioning, and PII masking apply.

Do I have to deploy anything to use the S3 connector?

No. rsync.ai Cloud is live at app.rsync.ai — sign up free and build an S3 pipeline in minutes, nothing to provision. If you'd rather run the whole stack inside your own VPC, self-hosting (source-available, Elastic License 2.0) arrives July 2026.