Skip to main content
PostgreSQL → AWS S3

Export PostgreSQL to S3
as Parquet, JSON, or CSV.

Stream PostgreSQL CDC changes to S3 for data lake ingestion, compliance archiving, or cost-effective long-term storage — described in plain English. Parquet by default, Snappy-compressed, date-partitioned.

TL;DR

rsync.ai reads PostgreSQL via logical replication (CDC mode) or consistent COPY (snapshot mode) and writes to S3 as Parquet (Snappy), JSON, or CSV. Path layout is configurable: date or hour partitioning, Hive-style or flat. Schema is inferred from Postgres column types. Initial snapshot + incremental CDC — or scheduled full snapshot. Works with self-hosted Postgres, RDS, Aurora, Supabase, Neon.

  • Parquet (Snappy-compressed), JSON, or CSV — your choice
  • Date or hour partitioning — Hive-style for Athena / Glue
  • CDC streaming or scheduled snapshots — or both
  • Self-hosted, source-available under Elastic License 2.0
How it works

How to sync PostgreSQL to AWS S3 — 5 steps

From enabling logical replication to Parquet files landing in your bucket.

  1. 1

    Enable logical replication on PostgreSQL

    Set `wal_level=logical` in postgresql.conf (or via the RDS/Aurora parameter group with `rds.logical_replication=1`). Set `max_replication_slots` ≥ 2. Create a publication for the tables you want to export: `CREATE PUBLICATION rsync_pub FOR ALL TABLES;`. For batch snapshot mode only (no CDC streaming), you can skip the replication slot — rsync.ai will use a consistent COPY snapshot instead.

    wal_level=logical · pgoutput · or batch snapshot mode
  2. 2

    Connect PostgreSQL

    Provide host, port (default 5432), user, password, and database. For CDC streaming mode also provide the replication slot name and publication name. The user needs SELECT on target tables (snapshot mode) or REPLICATION privilege (CDC mode). rsync.ai supports self-hosted Postgres 10+, Amazon RDS, Aurora, Google Cloud SQL, Supabase, and Neon.

    CDC mode: REPLICATION role · Snapshot mode: SELECT only
  3. 3

    Connect AWS S3

    Provide the S3 bucket name and your preferred authentication: AWS access key + secret, or an IAM role ARN (recommended for EC2/ECS deployments). The IAM policy needs `s3:PutObject`, `s3:GetObject`, and `s3:ListBucket` on your bucket. If rsync.ai runs inside AWS, attach an instance profile or ECS task role — no long-lived credentials needed.

    Access key / secret · or IAM role (recommended)
  4. 4

    Describe the sync in plain English

    Type what you want: 'Snapshot all tables from the public schema to S3 as Parquet daily at 2am, then stream CDC changes hourly.' rsync.ai determines the S3 path layout, file format, and partitioning strategy. You can also specify a custom path prefix, format (Parquet, JSON, CSV), or compression (Snappy, GZIP, uncompressed).

    No SQL · No YAML · Parquet / JSON / CSV · Snappy / GZIP
  5. 5

    Approve path layout and start the pipeline

    rsync.ai shows the proposed S3 path structure — e.g. `s3://your-bucket/postgres/public/orders/2026-05-30/part-0001.parquet` — and the Parquet schema inferred from PostgreSQL column types. Review partitioning (date, hour), file naming, and column types. Approve and the pipeline takes an initial consistent snapshot, then streams incremental changes on the schedule you set.

    s3://bucket/{schema}/{table}/YYYY-MM-DD/part-NNNN.parquet

PostgreSQL → S3 path layout

Default path structure rsync.ai proposes. Customize prefix, format, and partitioning before approving.

PostgreSQL tableS3 path (Parquet)Notes
public.userss3://bucket/postgres/public/users/YYYY-MM-DD/part-0001.parquetUUID columns stored as STRING in Parquet. TIMESTAMPTZ→TIMESTAMP (UTC).
public.orderss3://bucket/postgres/public/orders/YYYY-MM-DD/part-0001.parquetNUMERIC(10,2)→DECIMAL(10,2) in Parquet. JSONB columns stored as STRING.
public.productss3://bucket/postgres/public/products/YYYY-MM-DD/part-0001.parquetTEXT[]→repeated STRING group. JSONB metadata→STRING.
public.eventss3://bucket/postgres/public/events/YYYY-MM-DD/part-0001.parquetHigh-volume table — hourly partitioning recommended. BIGSERIAL→INT64.
public.sessionss3://bucket/postgres/public/sessions/YYYY-MM-DD/part-0001.parquetTIMESTAMPTZ expires_at→TIMESTAMP. UUID columns→STRING.
public.audit_logss3://bucket/postgres/public/audit_logs/YYYY-MM-DD/part-0001.parquetJSONB row_data→STRING. Append-only table — no UPDATE/DELETE CDC needed.

rsync.ai vs. Fivetran, Airbyte, custom pg_dump scripts for PostgreSQL → S3

What you give up — and gain — choosing rsync.ai for Postgres to S3 pipelines.

Feature
rsync.aiyou
Fivetran
Airbyte
pg_dump scripts
Real-time CDC streaming to S3
Plain-English pipeline setup
Parquet output with schema manifest
Self-hosted (data stays in your network)
Source-available connector code (auditable)
No per-row / per-MAR pricing
Resumable snapshots (no restart on failure)
PII masking before S3 write

PostgreSQL to AWS S3 — frequently asked

What file formats does rsync.ai write to S3?

Parquet (default, Snappy-compressed), newline-delimited JSON, and CSV. Parquet is recommended for data lake use cases — it's columnar, compressed, and natively supported by Athena, Redshift Spectrum, BigQuery, Databricks, and DuckDB. JSON is useful for event streaming or downstream systems that consume JSON. CSV is available for legacy tooling that doesn't speak Parquet.

How does S3 path partitioning work?

By default rsync.ai partitions by date: `s3://bucket/{schema}/{table}/YYYY-MM-DD/part-NNNN.parquet`. For high-volume tables you can choose hourly partitioning: `.../YYYY-MM-DD/HH/`. Hive-style partitioning (`year=2026/month=05/day=30/`) is also available for compatibility with Athena partition projection and AWS Glue crawlers. You set the strategy in plain English or in the approve step.

IAM roles vs access keys — which should I use?

IAM roles are strongly preferred. If rsync.ai runs on EC2, ECS, or Lambda inside your AWS account, attach an instance profile or task role — no long-lived credentials are stored anywhere. If rsync.ai runs outside AWS (self-hosted on-premise or in another cloud), use an IAM user with a scoped policy: `s3:PutObject`, `s3:GetObject`, `s3:ListBucket` on your specific bucket and path prefix. Rotate the access key regularly.

How does rsync.ai handle large tables during the initial snapshot?

The initial snapshot uses a consistent PostgreSQL COPY (or a repeatable read transaction for logical replication mode) so you get a point-in-time consistent snapshot even for tables with billions of rows. rsync.ai streams the COPY output directly to S3 in multi-part upload chunks, so memory usage is constant regardless of table size. Progress is checkpointed per table so an interrupted snapshot resumes from the last committed chunk rather than restarting from scratch.

How does schema evolution work in Parquet files?

Parquet files are immutable once written. When rsync.ai detects a new column in PostgreSQL (via schema drift detection), it starts writing new files with the expanded schema. Old files retain the old schema. Tools like Athena, Spark, and DuckDB handle schema evolution via schema merging (union of all columns, NULL-filled for missing fields). rsync.ai also writes a JSON schema manifest alongside each batch so downstream tools can discover column additions.

What happens if a snapshot is interrupted mid-table?

rsync.ai checkpoints progress at the S3 object level. If the process is interrupted, the next run skips already-completed tables and resumes the in-progress table from the last committed multi-part upload chunk. Partial S3 objects from interrupted uploads are automatically aborted via S3 lifecycle rules (rsync.ai configures this on first run if you grant `s3:AbortMultipartUpload`).

Can rsync.ai mask PII columns in Parquet files written to S3?

Yes. Mark any column as `masked` in the pipeline configuration — rsync.ai replaces the value with a SHA-256 hash or NULL before writing to S3. The raw value is never written to any S3 object. This is useful for GDPR/CCPA compliance: you keep full event history in your data lake without exposing email, phone, or payment data.

Is rsync.ai self-hosted for PostgreSQL to S3?

Yes. The full rsync.ai stack runs on your infrastructure via `docker compose up`. Your PostgreSQL credentials and AWS keys or IAM role are stored in a Postgres control plane that you also host. Nothing leaves your network. License is Elastic License 2.0 — free to self-host, cannot be resold as a managed service.