Database Optimization for Voice Analysis: Schema Design, Indexing, and Performance at Scale
Optimize PostgreSQL databases for voice analysis workloads. Learn schema design, indexing strategies, partitioning, compression, and query optimization for millions of audio recordings.
Database Optimization for Voice Analysis: Storing and Querying Voice Data at Scale
Voice analysis systems generate massive amounts of data: acoustic features (6,000+ per recording), ML predictions, audio files, user metadata, and session logs. A production system processing 10,000 recordings/day creates 60+ million feature values daily.
Poor database design leads to:
- Slow queries: 10+ second response times for simple profile lookups
- Storage bloat: 10TB database for 1 million recordings (should be ~500GB)
- Failed writes: Analysis jobs timing out during feature insertion
- High costs: $5,000/month in database hosting (should be ~$500)
This guide shares battle-tested optimization techniques for PostgreSQL (also applicable to MySQL, MongoDB) that reduce query times by 50-100x and storage costs by 80%.
1. Schema Design: Relational vs Denormalized
Core Entity Model
5 main tables:
sessions -- User recording sessions
├── acoustic_features -- 6,000+ features per session
├── predictions -- ML model outputs
├── audio_metadata -- Duration, sample rate, format
└── analysis_logs -- Processing timestamps, errors
Option A: Fully Normalized (Traditional)
CREATE TABLE sessions (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id),
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
status VARCHAR(20) NOT NULL, -- 'in_progress', 'completed', 'failed'
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE acoustic_features (
id BIGSERIAL PRIMARY KEY,
session_id UUID NOT NULL REFERENCES sessions(id),
feature_name VARCHAR(100) NOT NULL, -- 'f0_mean', 'jitter_local', etc.
feature_value FLOAT NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_features_session ON acoustic_features(session_id);
CREATE INDEX idx_features_name ON acoustic_features(feature_name);
Pros:
- Flexible: Easy to add new features without schema changes
- No duplication: Feature names stored once
- Traditional: Familiar to most developers
Cons:
- Massive row count: 6,000 rows per session → 6M rows for 1,000 sessions
- Slow queries: Retrieving all features requires 6,000-row join
- Index bloat: Indexes on 6M rows consume 500MB+ per index
Query performance: 8-15 seconds to fetch features for 100 sessions
Option B: Wide Table (Recommended for Voice Analysis)
CREATE TABLE sessions (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id),
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
status VARCHAR(20) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE acoustic_features (
session_id UUID PRIMARY KEY REFERENCES sessions(id),
-- Pitch features (F0)
f0_mean FLOAT,
f0_std FLOAT,
f0_min FLOAT,
f0_max FLOAT,
f0_range FLOAT,
-- Voice quality
jitter_local FLOAT,
shimmer_local FLOAT,
hnr_mean FLOAT,
-- Formants
f1_mean FLOAT,
f2_mean FLOAT,
f3_mean FLOAT,
-- MFCCs (1-13)
mfcc_1_mean FLOAT,
mfcc_2_mean FLOAT,
-- ... (88 total columns for eGeMAPS)
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_features_user ON acoustic_features
JOIN sessions ON sessions.id = acoustic_features.session_id
(sessions.user_id);
Pros:
- Fast queries: Single row per session, no joins needed
- Small row count: 1 row per session → 1,000 rows for 1,000 sessions
- Efficient indexes: Indexes on 1,000 rows consume <5MB
- Predictable performance: Query time scales linearly with sessions, not features
Cons:
- Wide schema: 88+ columns (but PostgreSQL handles this well)
- Schema changes required: Adding new features needs ALTER TABLE (mitigated with JSONB for experimental features)
Query performance: 50-200ms to fetch features for 100 sessions (50-100x faster)
Hybrid Approach: Wide Table + JSONB for Flexibility
CREATE TABLE acoustic_features (
session_id UUID PRIMARY KEY REFERENCES sessions(id),
-- Core features (always present, indexed)
f0_mean FLOAT NOT NULL,
f0_std FLOAT NOT NULL,
jitter_local FLOAT NOT NULL,
shimmer_local FLOAT NOT NULL,
hnr_mean FLOAT NOT NULL,
-- Experimental/optional features (JSONB)
additional_features JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Index for JSONB queries
CREATE INDEX idx_additional_features ON acoustic_features
USING GIN (additional_features);
-- Insert example
INSERT INTO acoustic_features (session_id, f0_mean, f0_std, additional_features)
VALUES (
'uuid-here',
120.5,
15.3,
'{"tremor_ratio": 0.15, "formant_precision": 0.82}'::jsonb
);
-- Query JSONB
SELECT * FROM acoustic_features
WHERE additional_features->>'tremor_ratio' > '0.10';
Best of both worlds: Fast queries on core features, flexibility for experiments
2. Indexing Strategies
Index Types for Voice Data
B-tree Indexes (Default, Most Common)
-- Single column (user lookups)
CREATE INDEX idx_sessions_user ON sessions(user_id);
-- Composite (common query pattern: user + date range)
CREATE INDEX idx_sessions_user_date ON sessions(user_id, started_at DESC);
-- Partial index (only index completed sessions)
CREATE INDEX idx_completed_sessions ON sessions(user_id, started_at)
WHERE status = 'completed';
Use case: Equality and range queries (WHERE user_id = 'X', WHERE started_at > '2025-01-01')
Performance: O(log n) lookups, efficient for high-cardinality columns
GIN Indexes (JSONB, Arrays, Full-Text Search)
-- JSONB features
CREATE INDEX idx_features_gin ON acoustic_features
USING GIN (additional_features);
-- Array of tags
CREATE INDEX idx_session_tags ON sessions
USING GIN (tags);
-- Full-text search on transcripts
CREATE INDEX idx_transcript_fts ON sessions
USING GIN (to_tsvector('english', transcript));
Use case: JSONB queries, array containment, full-text search
Performance: Slower inserts (GIN indexes are large), but fast searches
BRIN Indexes (Time-Series Data)
-- Very efficient for time-ordered data
CREATE INDEX idx_sessions_created_brin ON sessions
USING BRIN (created_at);
-- Tiny index size (1-2% of B-tree), fast range queries
Use case: Large tables with naturally ordered data (timestamps, IDs)
Performance: 100x smaller than B-tree, slightly slower queries (acceptable trade-off for analytics)
Index Maintenance
-- Find unused indexes (candidates for removal)
SELECT
schemaname,
tablename,
indexname,
idx_scan AS index_scans,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%_pkey'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild bloated indexes (after heavy updates/deletes)
REINDEX INDEX CONCURRENTLY idx_sessions_user;
-- Analyze tables (update statistics for query planner)
ANALYZE acoustic_features;
3. Partitioning for Large Datasets
When to partition: >10M rows, time-based queries common (e.g., "last 30 days")
Time-Based Partitioning (Most Common)
-- Parent table (empty, defines schema)
CREATE TABLE sessions (
id UUID NOT NULL,
user_id UUID NOT NULL,
started_at TIMESTAMP NOT NULL,
status VARCHAR(20) NOT NULL,
PRIMARY KEY (id, started_at) -- started_at must be in primary key
) PARTITION BY RANGE (started_at);
-- Monthly partitions
CREATE TABLE sessions_2025_01 PARTITION OF sessions
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE sessions_2025_02 PARTITION OF sessions
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Auto-create partitions (PostgreSQL 13+)
CREATE EXTENSION IF NOT EXISTS pg_partman;
SELECT partman.create_parent(
p_parent_table => 'public.sessions',
p_control => 'started_at',
p_type => 'native',
p_interval => 'monthly',
p_premake => 3 -- Create 3 months ahead
);
Benefits:
- Query pruning: Query last 7 days only scans 1 partition (not all 12 months)
- Faster maintenance: Drop old partitions instantly (vs slow DELETE)
- Parallel scans: PostgreSQL scans partitions in parallel
Query performance: 10-50x faster for time-range queries
Automated Partition Management
-- Daily cron job: create future partitions, drop old ones
SELECT partman.run_maintenance('public.sessions');
-- Configuration
UPDATE partman.part_config
SET retention_keep_table = false, -- Drop old partitions (don't just detach)
retention = '365 days' -- Keep 1 year of data
WHERE parent_table = 'public.sessions';
4. Storage Optimization
Audio File Storage: Database vs Object Storage
Option A: Store in Database (NOT RECOMMENDED)
CREATE TABLE audio_files (
session_id UUID PRIMARY KEY,
audio_data BYTEA NOT NULL, -- 1-10 MB per file
created_at TIMESTAMP DEFAULT NOW()
);
-- Problems:
-- - Database size explodes (10 MB × 1M sessions = 10 TB)
-- - Slow backups (must backup 10 TB database)
-- - Expensive (database storage costs 10x more than object storage)
Option B: Object Storage + Database Metadata (RECOMMENDED)
CREATE TABLE audio_files (
session_id UUID PRIMARY KEY,
storage_path VARCHAR(500) NOT NULL, -- 's3://bucket/recordings/2025/01/15/uuid.ogg'
file_size_bytes BIGINT NOT NULL,
duration_seconds FLOAT NOT NULL,
sample_rate INTEGER NOT NULL,
format VARCHAR(10) NOT NULL, -- 'ogg', 'wav', 'mp3'
created_at TIMESTAMP DEFAULT NOW()
);
-- Storage costs:
-- - S3: $0.023/GB/month = $230/month for 10 TB
-- - RDS: $0.115/GB/month = $1,150/month for 10 TB
-- - 5x savings
Feature Storage: FLOAT vs REAL
-- FLOAT (8 bytes, double precision)
f0_mean FLOAT -- -1.7976931348623157E+308 to 1.7976931348623157E+308
-- REAL (4 bytes, single precision)
f0_mean REAL -- -3.40282347E+38 to 3.40282347E+38
-- For acoustic features, REAL is sufficient:
-- - F0 typically 50-500 Hz (needs ~3 decimal places)
-- - Saves 50% storage (4 bytes vs 8 bytes × 6,000 features = 24 KB vs 48 KB per session)
Use REAL for acoustic features: Precision is adequate, saves 50% storage
Compression with TOAST
PostgreSQL automatically compresses large columns (>2KB) using TOAST (The Oversized-Attribute Storage Technique).
-- Force compression on JSONB (if storing large feature sets)
ALTER TABLE acoustic_features
ALTER COLUMN additional_features SET STORAGE EXTENDED;
-- Check TOAST statistics
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) -
pg_relation_size(schemaname||'.'||tablename)) AS toast_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
5. Query Optimization
Common Query: Get User's Recent Sessions with Features
Slow query (no indexes, cross-partition scan):
SELECT
s.id,
s.started_at,
af.f0_mean,
af.f0_std,
p.age_prediction,
p.gender_prediction
FROM sessions s
LEFT JOIN acoustic_features af ON s.id = af.session_id
LEFT JOIN predictions p ON s.id = p.session_id
WHERE s.user_id = 'user-uuid'
ORDER BY s.started_at DESC
LIMIT 10;
-- Execution time: 8-12 seconds (scans all partitions)
Optimized query:
-- 1. Add composite index
CREATE INDEX idx_sessions_user_date ON sessions(user_id, started_at DESC);
-- 2. Constrain time range (partition pruning)
SELECT
s.id,
s.started_at,
af.f0_mean,
af.f0_std,
p.age_prediction,
p.gender_prediction
FROM sessions s
LEFT JOIN acoustic_features af ON s.id = af.session_id
LEFT JOIN predictions p ON s.id = p.session_id
WHERE s.user_id = 'user-uuid'
AND s.started_at > NOW() - INTERVAL '90 days' -- Prune to last 3 months
ORDER BY s.started_at DESC
LIMIT 10;
-- Execution time: 50-150ms (50-100x faster)
Using EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;
-- Output shows:
-- - Index usage (Seq Scan = bad, Index Scan = good)
-- - Partition pruning (which partitions scanned)
-- - Join strategy (Hash Join, Nested Loop, Merge Join)
-- - Execution time breakdown
Red flags to look for:
- Seq Scan on large table: Add index on WHERE clause columns
- Hash Join on millions of rows: Optimize join keys or add indexes
- Filter on many rows: Move filter to WHERE clause (before join)
Materialized Views for Analytics
Problem: Dashboard query aggregates 1M sessions every page load (5-10 seconds)
-- Slow: Run every page load
SELECT
DATE_TRUNC('day', started_at) AS day,
COUNT(*) AS session_count,
AVG(f0_mean) AS avg_pitch,
AVG(age_prediction) AS avg_age
FROM sessions s
JOIN acoustic_features af ON s.id = af.session_id
JOIN predictions p ON s.id = p.session_id
WHERE started_at > NOW() - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', started_at)
ORDER BY day DESC;
Solution: Materialized view (pre-computed aggregates)
-- Create materialized view
CREATE MATERIALIZED VIEW daily_stats AS
SELECT
DATE_TRUNC('day', started_at) AS day,
COUNT(*) AS session_count,
AVG(f0_mean) AS avg_pitch,
AVG(age_prediction) AS avg_age
FROM sessions s
JOIN acoustic_features af ON s.id = af.session_id
JOIN predictions p ON s.id = p.session_id
GROUP BY DATE_TRUNC('day', started_at);
-- Index for fast queries
CREATE UNIQUE INDEX idx_daily_stats_day ON daily_stats(day DESC);
-- Query materialized view (instant)
SELECT * FROM daily_stats
WHERE day > NOW() - INTERVAL '30 days'
ORDER BY day DESC;
-- Execution time: 5-10ms (1,000x faster)
-- Refresh daily (cron job)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_stats;
6. Connection Pooling
Problem: Each API request opens new database connection (slow, resource-intensive)
Solution: Connection pooler (PgBouncer, AWS RDS Proxy)
# PgBouncer configuration (pgbouncer.ini)
[databases]
voice_analysis = host=localhost port=5432 dbname=voice_analysis
[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
pool_mode = transaction # Fastest for stateless API requests
max_client_conn = 1000 # API can handle 1,000 concurrent requests
default_pool_size = 20 # But only 20 actual database connections
reserve_pool_size = 5 # Extra connections for spikes
Performance:
- Without pooling: 50 requests/sec (connection overhead = 20-30ms per request)
- With pooling: 500+ requests/sec (connection reuse = <1ms overhead)
Application-Level Pooling (Node.js)
import { Pool } from 'pg';
const pool = new Pool({
host: 'localhost',
port: 5432,
database: 'voice_analysis',
user: 'postgres',
password: 'password',
max: 20, // Maximum connections in pool
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 2000, // Fail if can't connect within 2s
});
// Reuse connections
async function getSession(sessionId) {
const client = await pool.connect(); // Reuses existing connection
try {
const result = await client.query(
'SELECT * FROM sessions WHERE id = $1',
[sessionId]
);
return result.rows[0];
} finally {
client.release(); // Return to pool (doesn't close)
}
}
7. Batch Inserts for Feature Data
Problem: Inserting 6,000 features one-by-one = 6,000 round-trips (slow)
Slow approach:
-- 6,000 individual INSERTs
for (const feature of features) {
await db.query(
'INSERT INTO acoustic_features (session_id, feature_name, value) VALUES ($1, $2, $3)',
[sessionId, feature.name, feature.value]
);
}
// Time: 5-10 seconds
Fast approach (single multi-row INSERT):
-- Single INSERT with 88 values (wide table)
await db.query(`
INSERT INTO acoustic_features (
session_id, f0_mean, f0_std, jitter_local, shimmer_local, ...
) VALUES ($1, $2, $3, $4, $5, ...)
`, [sessionId, features.f0_mean, features.f0_std, ...]);
// Time: 10-50ms (100-500x faster)
For normalized schema (multi-row INSERT):
-- Batch insert 6,000 rows
const values = features.map(f => `('${sessionId}', '${f.name}', ${f.value})`).join(',');
await db.query(`
INSERT INTO acoustic_features (session_id, feature_name, value)
VALUES ${values}
`);
// Time: 200-500ms (10-50x faster than individual INSERTs)
8. Monitoring & Alerting
Key Metrics to Track
-- Slow queries (>1 second)
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
WHERE mean_time > 1000 -- 1 second
ORDER BY total_time DESC
LIMIT 20;
-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Index hit ratio (should be >95%)
SELECT
sum(idx_blks_hit) / NULLIF(sum(idx_blks_hit + idx_blks_read), 0) AS index_hit_rate
FROM pg_statio_user_indexes;
-- Cache hit ratio (should be >99%)
SELECT
sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit + heap_blks_read), 0) AS cache_hit_rate
FROM pg_statio_user_tables;
Alert Thresholds
- Slow query: >5 seconds → Investigate missing indexes
- Index hit rate: <95% → Add more RAM or optimize queries
- Cache hit rate: <99% → Increase shared_buffers
- Connection count: >80% of max → Scale database or add pooling
- Disk usage: >80% → Archive old data or add storage
9. Backup & Disaster Recovery
Logical Backups (pg_dump)
# Full database backup
pg_dump -h localhost -U postgres -Fc voice_analysis > backup_$(date +%Y%m%d).dump
# Restore
pg_restore -h localhost -U postgres -d voice_analysis backup_20250127.dump
# Incremental backups (WAL archiving)
# postgresql.conf:
archive_mode = on
archive_command = 'cp %p /backups/wal/%f'
Point-in-Time Recovery (PITR)
# Base backup + WAL logs = restore to any point in time
pg_basebackup -h localhost -D /backups/base -U postgres -Fp -Xs -P
# Restore to 2025-01-27 14:30:00
recovery_target_time = '2025-01-27 14:30:00'
10. Production Configuration Tuning
PostgreSQL Settings for Voice Analysis Workload
# postgresql.conf
# Memory (for 16GB server)
shared_buffers = 4GB # 25% of RAM
effective_cache_size = 12GB # 75% of RAM
work_mem = 64MB # Per-query memory (for sorts/joins)
maintenance_work_mem = 1GB # For VACUUM, CREATE INDEX
# Connections
max_connections = 200 # Total connections (use pooler for more)
# Query planner
random_page_cost = 1.1 # SSD (default 4.0 is for HDD)
effective_io_concurrency = 200 # SSD parallel I/O
# Write-ahead log
wal_buffers = 16MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB
# Autovacuum (critical for performance)
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 10s # Check for vacuum every 10 seconds
Vacuum Strategy
-- Analyze query planner statistics (daily)
ANALYZE;
-- Vacuum to reclaim space (weekly)
VACUUM ANALYZE;
-- Aggressive vacuum (monthly, during low-traffic window)
VACUUM FULL; -- Rewrites table, requires exclusive lock
-- Check for bloat
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
n_dead_tup AS dead_rows
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
The Bottom Line: Database Optimization Checklist
For production voice analysis systems:
- Schema: Use wide table (88 columns) for acoustic features, JSONB for experimental features
- Indexes:
- B-tree on user_id, (user_id, started_at DESC)
- BRIN on created_at (for large tables)
- GIN on JSONB columns
- Partitioning: Monthly partitions for sessions (>10M rows)
- Storage:
- Audio files: S3/object storage (not database)
- Features: REAL (4 bytes) instead of FLOAT (8 bytes)
- Queries:
- Always add time range for partition pruning
- Use materialized views for dashboard aggregates
- Monitor with EXPLAIN ANALYZE
- Connections: Use PgBouncer (20 DB connections for 1,000 API clients)
- Inserts: Single multi-column INSERT (not 6,000 individual INSERTs)
- Monitoring: Track slow queries, index hit rate, cache hit rate
Expected performance:
- Insert features: 10-50ms (was 5-10 seconds)
- Fetch user sessions: 50-200ms (was 8-15 seconds)
- Dashboard aggregates: 5-10ms via materialized view (was 5-10 seconds)
- Storage costs: $500/month for 1M sessions (was $5,000/month)
Voice Mirror's database architecture uses PostgreSQL with wide-table schema (88-column acoustic_features), monthly partitioning, BRIN indexes on timestamps, and S3 for audio storage. Our optimized schema handles 10,000 sessions/day with 50-200ms query times and $0.50/session storage costs.