Postgres at Scale: Lessons from Running Multi-Terabyte Clusters
Presented by:
Roneel Kumar
Roneel Kumar is a Senior Relational Databases Specialist Solutions Architect in the AWS Worldwide Database Services Organization (DBSO). His Core area of expertise is in designing, building, and implementing Database Modernization Platform for customers with PostgreSQL Databases.
Sameer Kumar
Sameer Kumar is a Database Specialist at Amazon Web Services. He focuses on Amazon RDS, Amazon Aurora and Amazon DocumentDB. He works with enterprise customers providing technical assistance on database operational performance and sharing database best practices.
No video of the event yet, sorry!
PostgreSQL has matured into one of the most trusted databases for mission-critical workloads, but scaling it to tens or even hundreds of terabytes is a different game altogether. At this scale, the challenges shift from simple tuning to deep architectural decisions: vacuum pressure, index bloat, replication lag, and the limits of storage and I/O subsystems all start to matter.
In this talk, we'll dive into hard-earned lessons from real multi-terabyte PostgreSQL clusters. We'll explore how teams have tackled performance bottlenecks, survived schema changes on large tables, optimized partitioning and parallelism, and balanced autovacuum with business SLAs. You'll see how replication strategies, failover design, and backup approaches evolve at scale, and how to make Postgres predictable when you push it to its limits.
Whether you're running a 2 TB analytics cluster or planning for 50 TB+ OLTP workloads, this session will give you practical strategies, tuning tips, and architectural patterns to keep PostgreSQL fast, stable, and resilient, no matter how big your data grows. The session begins by examining why scaling Postgres is fundamentally different from small-scale operations, exploring the transition from GB to TB workloads and the distinct challenges faced by OLTP versus analytics systems. We'll then dive into storage and architecture patterns, covering partitioning strategies including range, hash, and list approaches, practical parallel query execution techniques, and critical WAL and I/O considerations at scale. A significant portion focuses on autovacuum and bloat management, including tuning strategies for multi-TB tables, methods for detecting and preventing index bloat. The discussion continues with replication and high availability challenges, examining streaming replication at scale, common pitfalls like replication lag and cascading replica issues, and trade- offs between logical and physical replication approaches. We'll address schema evolution and upgrades, presenting safe methods for applying schema changes to massive tables and implementing rolling upgrades with zero-downtime patterns. The monitoring and observability section covers essential metrics including locks, LWLocks, replication slots, and vacuum progress, along with designing dashboards and alerts that provide actionable insights.
- Date:
- Duration:
- 45 min
- Room:
- Conference:
- PGConf India, 2026
- Language:
- Track:
- Case Study
- Difficulty:
- Hard