Database Optimization for Voice Analysis: Storing and Querying Voice Data at Scale

Voice analysis systems generate massive amounts of data: acoustic features (6,000+ per recording), ML predictions, audio files, user metadata, and session logs. A production system processing 10,000 recordings/day creates 60+ million feature values daily.

Poor database design leads to:

Slow queries: 10+ second response times for simple profile lookups
Storage bloat: 10TB database for 1 million recordings (should be ~500GB)
Failed writes: Analysis jobs timing out during feature insertion
High costs: $5,000/month in database hosting (should be ~$500)

This guide shares battle-tested optimization techniques for PostgreSQL (also applicable to MySQL, MongoDB) that reduce query times by 50-100x and storage costs by 80%.

1. Schema Design: Relational vs Denormalized

Core Entity Model

5 main tables:

sessions         -- User recording sessions
├── acoustic_features    -- 6,000+ features per session
├── predictions          -- ML model outputs
├── audio_metadata       -- Duration, sample rate, format
└── analysis_logs        -- Processing timestamps, errors

Option A: Fully Normalized (Traditional)

CREATE TABLE sessions (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL REFERENCES users(id),
    started_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP,
    status VARCHAR(20) NOT NULL,  -- 'in_progress', 'completed', 'failed'
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE acoustic_features (
    id BIGSERIAL PRIMARY KEY,
    session_id UUID NOT NULL REFERENCES sessions(id),
    feature_name VARCHAR(100) NOT NULL,  -- 'f0_mean', 'jitter_local', etc.
    feature_value FLOAT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_features_session ON acoustic_features(session_id);
CREATE INDEX idx_features_name ON acoustic_features(feature_name);

Pros:

Flexible: Easy to add new features without schema changes
No duplication: Feature names stored once
Traditional: Familiar to most developers

Cons:

Massive row count: 6,000 rows per session → 6M rows for 1,000 sessions
Slow queries: Retrieving all features requires 6,000-row join
Index bloat: Indexes on 6M rows consume 500MB+ per index

Query performance: 8-15 seconds to fetch features for 100 sessions

Option B: Wide Table (Recommended for Voice Analysis)

CREATE TABLE sessions (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL REFERENCES users(id),
    started_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP,
    status VARCHAR(20) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE acoustic_features (
    session_id UUID PRIMARY KEY REFERENCES sessions(id),

    -- Pitch features (F0)
    f0_mean FLOAT,
    f0_std FLOAT,
    f0_min FLOAT,
    f0_max FLOAT,
    f0_range FLOAT,

    -- Voice quality
    jitter_local FLOAT,
    shimmer_local FLOAT,
    hnr_mean FLOAT,

    -- Formants
    f1_mean FLOAT,
    f2_mean FLOAT,
    f3_mean FLOAT,

    -- MFCCs (1-13)
    mfcc_1_mean FLOAT,
    mfcc_2_mean FLOAT,
    -- ... (88 total columns for eGeMAPS)

    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_features_user ON acoustic_features
    JOIN sessions ON sessions.id = acoustic_features.session_id
    (sessions.user_id);

Pros:

Fast queries: Single row per session, no joins needed
Small row count: 1 row per session → 1,000 rows for 1,000 sessions
Efficient indexes: Indexes on 1,000 rows consume <5MB
Predictable performance: Query time scales linearly with sessions, not features

Cons:

Wide schema: 88+ columns (but PostgreSQL handles this well)
Schema changes required: Adding new features needs ALTER TABLE (mitigated with JSONB for experimental features)

Query performance: 50-200ms to fetch features for 100 sessions (50-100x faster)

Hybrid Approach: Wide Table + JSONB for Flexibility

CREATE TABLE acoustic_features (
    session_id UUID PRIMARY KEY REFERENCES sessions(id),

    -- Core features (always present, indexed)
    f0_mean FLOAT NOT NULL,
    f0_std FLOAT NOT NULL,
    jitter_local FLOAT NOT NULL,
    shimmer_local FLOAT NOT NULL,
    hnr_mean FLOAT NOT NULL,

    -- Experimental/optional features (JSONB)
    additional_features JSONB,

    created_at TIMESTAMP DEFAULT NOW()
);

-- Index for JSONB queries
CREATE INDEX idx_additional_features ON acoustic_features
    USING GIN (additional_features);

-- Insert example
INSERT INTO acoustic_features (session_id, f0_mean, f0_std, additional_features)
VALUES (
    'uuid-here',
    120.5,
    15.3,
    '{"tremor_ratio": 0.15, "formant_precision": 0.82}'::jsonb
);

-- Query JSONB
SELECT * FROM acoustic_features
WHERE additional_features->>'tremor_ratio' > '0.10';

Best of both worlds: Fast queries on core features, flexibility for experiments

2. Indexing Strategies

Index Types for Voice Data

B-tree Indexes (Default, Most Common)

-- Single column (user lookups)
CREATE INDEX idx_sessions_user ON sessions(user_id);

-- Composite (common query pattern: user + date range)
CREATE INDEX idx_sessions_user_date ON sessions(user_id, started_at DESC);

-- Partial index (only index completed sessions)
CREATE INDEX idx_completed_sessions ON sessions(user_id, started_at)
WHERE status = 'completed';

Use case: Equality and range queries (WHERE user_id = 'X', WHERE started_at > '2025-01-01')

Performance: O(log n) lookups, efficient for high-cardinality columns

GIN Indexes (JSONB, Arrays, Full-Text Search)

-- JSONB features
CREATE INDEX idx_features_gin ON acoustic_features
    USING GIN (additional_features);

-- Array of tags
CREATE INDEX idx_session_tags ON sessions
    USING GIN (tags);

-- Full-text search on transcripts
CREATE INDEX idx_transcript_fts ON sessions
    USING GIN (to_tsvector('english', transcript));

Use case: JSONB queries, array containment, full-text search

Performance: Slower inserts (GIN indexes are large), but fast searches

BRIN Indexes (Time-Series Data)

-- Very efficient for time-ordered data
CREATE INDEX idx_sessions_created_brin ON sessions
    USING BRIN (created_at);

-- Tiny index size (1-2% of B-tree), fast range queries

Use case: Large tables with naturally ordered data (timestamps, IDs)

Performance: 100x smaller than B-tree, slightly slower queries (acceptable trade-off for analytics)

Index Maintenance

-- Find unused indexes (candidates for removal)
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan AS index_scans,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
    AND indexrelname NOT LIKE '%_pkey'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Rebuild bloated indexes (after heavy updates/deletes)
REINDEX INDEX CONCURRENTLY idx_sessions_user;

-- Analyze tables (update statistics for query planner)
ANALYZE acoustic_features;

3. Partitioning for Large Datasets

When to partition: >10M rows, time-based queries common (e.g., "last 30 days")

Time-Based Partitioning (Most Common)

-- Parent table (empty, defines schema)
CREATE TABLE sessions (
    id UUID NOT NULL,
    user_id UUID NOT NULL,
    started_at TIMESTAMP NOT NULL,
    status VARCHAR(20) NOT NULL,
    PRIMARY KEY (id, started_at)  -- started_at must be in primary key
) PARTITION BY RANGE (started_at);

-- Monthly partitions
CREATE TABLE sessions_2025_01 PARTITION OF sessions
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE sessions_2025_02 PARTITION OF sessions
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Auto-create partitions (PostgreSQL 13+)
CREATE EXTENSION IF NOT EXISTS pg_partman;

SELECT partman.create_parent(
    p_parent_table => 'public.sessions',
    p_control => 'started_at',
    p_type => 'native',
    p_interval => 'monthly',
    p_premake => 3  -- Create 3 months ahead
);

Benefits:

Query pruning: Query last 7 days only scans 1 partition (not all 12 months)
Faster maintenance: Drop old partitions instantly (vs slow DELETE)
Parallel scans: PostgreSQL scans partitions in parallel

Query performance: 10-50x faster for time-range queries

Automated Partition Management

-- Daily cron job: create future partitions, drop old ones
SELECT partman.run_maintenance('public.sessions');

-- Configuration
UPDATE partman.part_config
SET retention_keep_table = false,  -- Drop old partitions (don't just detach)
    retention = '365 days'          -- Keep 1 year of data
WHERE parent_table = 'public.sessions';

4. Storage Optimization

Audio File Storage: Database vs Object Storage

Option A: Store in Database (NOT RECOMMENDED)

CREATE TABLE audio_files (
    session_id UUID PRIMARY KEY,
    audio_data BYTEA NOT NULL,  -- 1-10 MB per file
    created_at TIMESTAMP DEFAULT NOW()
);

-- Problems:
-- - Database size explodes (10 MB × 1M sessions = 10 TB)
-- - Slow backups (must backup 10 TB database)
-- - Expensive (database storage costs 10x more than object storage)

Option B: Object Storage + Database Metadata (RECOMMENDED)

CREATE TABLE audio_files (
    session_id UUID PRIMARY KEY,
    storage_path VARCHAR(500) NOT NULL,  -- 's3://bucket/recordings/2025/01/15/uuid.ogg'
    file_size_bytes BIGINT NOT NULL,
    duration_seconds FLOAT NOT NULL,
    sample_rate INTEGER NOT NULL,
    format VARCHAR(10) NOT NULL,  -- 'ogg', 'wav', 'mp3'
    created_at TIMESTAMP DEFAULT NOW()
);

-- Storage costs:
-- - S3: $0.023/GB/month = $230/month for 10 TB
-- - RDS: $0.115/GB/month = $1,150/month for 10 TB
-- - 5x savings

Feature Storage: FLOAT vs REAL

-- FLOAT (8 bytes, double precision)
f0_mean FLOAT  -- -1.7976931348623157E+308 to 1.7976931348623157E+308

-- REAL (4 bytes, single precision)
f0_mean REAL   -- -3.40282347E+38 to 3.40282347E+38

-- For acoustic features, REAL is sufficient:
-- - F0 typically 50-500 Hz (needs ~3 decimal places)
-- - Saves 50% storage (4 bytes vs 8 bytes × 6,000 features = 24 KB vs 48 KB per session)

Use REAL for acoustic features: Precision is adequate, saves 50% storage

Compression with TOAST

PostgreSQL automatically compresses large columns (>2KB) using TOAST (The Oversized-Attribute Storage Technique).

-- Force compression on JSONB (if storing large feature sets)
ALTER TABLE acoustic_features
    ALTER COLUMN additional_features SET STORAGE EXTENDED;

-- Check TOAST statistics
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) -
                   pg_relation_size(schemaname||'.'||tablename)) AS toast_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

5. Query Optimization

Common Query: Get User's Recent Sessions with Features

Slow query (no indexes, cross-partition scan):

SELECT
    s.id,
    s.started_at,
    af.f0_mean,
    af.f0_std,
    p.age_prediction,
    p.gender_prediction
FROM sessions s
LEFT JOIN acoustic_features af ON s.id = af.session_id
LEFT JOIN predictions p ON s.id = p.session_id
WHERE s.user_id = 'user-uuid'
ORDER BY s.started_at DESC
LIMIT 10;

-- Execution time: 8-12 seconds (scans all partitions)

Optimized query:

-- 1. Add composite index
CREATE INDEX idx_sessions_user_date ON sessions(user_id, started_at DESC);

-- 2. Constrain time range (partition pruning)
SELECT
    s.id,
    s.started_at,
    af.f0_mean,
    af.f0_std,
    p.age_prediction,
    p.gender_prediction
FROM sessions s
LEFT JOIN acoustic_features af ON s.id = af.session_id
LEFT JOIN predictions p ON s.id = p.session_id
WHERE s.user_id = 'user-uuid'
    AND s.started_at > NOW() - INTERVAL '90 days'  -- Prune to last 3 months
ORDER BY s.started_at DESC
LIMIT 10;

-- Execution time: 50-150ms (50-100x faster)

Using EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;

-- Output shows:
-- - Index usage (Seq Scan = bad, Index Scan = good)
-- - Partition pruning (which partitions scanned)
-- - Join strategy (Hash Join, Nested Loop, Merge Join)
-- - Execution time breakdown

Red flags to look for:

Seq Scan on large table: Add index on WHERE clause columns
Hash Join on millions of rows: Optimize join keys or add indexes
Filter on many rows: Move filter to WHERE clause (before join)

Materialized Views for Analytics

Problem: Dashboard query aggregates 1M sessions every page load (5-10 seconds)

-- Slow: Run every page load
SELECT
    DATE_TRUNC('day', started_at) AS day,
    COUNT(*) AS session_count,
    AVG(f0_mean) AS avg_pitch,
    AVG(age_prediction) AS avg_age
FROM sessions s
JOIN acoustic_features af ON s.id = af.session_id
JOIN predictions p ON s.id = p.session_id
WHERE started_at > NOW() - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', started_at)
ORDER BY day DESC;

Solution: Materialized view (pre-computed aggregates)

-- Create materialized view
CREATE MATERIALIZED VIEW daily_stats AS
SELECT
    DATE_TRUNC('day', started_at) AS day,
    COUNT(*) AS session_count,
    AVG(f0_mean) AS avg_pitch,
    AVG(age_prediction) AS avg_age
FROM sessions s
JOIN acoustic_features af ON s.id = af.session_id
JOIN predictions p ON s.id = p.session_id
GROUP BY DATE_TRUNC('day', started_at);

-- Index for fast queries
CREATE UNIQUE INDEX idx_daily_stats_day ON daily_stats(day DESC);

-- Query materialized view (instant)
SELECT * FROM daily_stats
WHERE day > NOW() - INTERVAL '30 days'
ORDER BY day DESC;

-- Execution time: 5-10ms (1,000x faster)

-- Refresh daily (cron job)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_stats;

6. Connection Pooling

Problem: Each API request opens new database connection (slow, resource-intensive)

Solution: Connection pooler (PgBouncer, AWS RDS Proxy)

# PgBouncer configuration (pgbouncer.ini)
[databases]
voice_analysis = host=localhost port=5432 dbname=voice_analysis

[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
pool_mode = transaction  # Fastest for stateless API requests
max_client_conn = 1000   # API can handle 1,000 concurrent requests
default_pool_size = 20   # But only 20 actual database connections
reserve_pool_size = 5    # Extra connections for spikes

Performance:

Without pooling: 50 requests/sec (connection overhead = 20-30ms per request)
With pooling: 500+ requests/sec (connection reuse = <1ms overhead)

Application-Level Pooling (Node.js)

import { Pool } from 'pg';

const pool = new Pool({
    host: 'localhost',
    port: 5432,
    database: 'voice_analysis',
    user: 'postgres',
    password: 'password',
    max: 20,              // Maximum connections in pool
    idleTimeoutMillis: 30000,  // Close idle connections after 30s
    connectionTimeoutMillis: 2000,  // Fail if can't connect within 2s
});

// Reuse connections
async function getSession(sessionId) {
    const client = await pool.connect();  // Reuses existing connection
    try {
        const result = await client.query(
            'SELECT * FROM sessions WHERE id = $1',
            [sessionId]
        );
        return result.rows[0];
    } finally {
        client.release();  // Return to pool (doesn't close)
    }
}

7. Batch Inserts for Feature Data

Problem: Inserting 6,000 features one-by-one = 6,000 round-trips (slow)

Slow approach:

-- 6,000 individual INSERTs
for (const feature of features) {
    await db.query(
        'INSERT INTO acoustic_features (session_id, feature_name, value) VALUES ($1, $2, $3)',
        [sessionId, feature.name, feature.value]
    );
}
// Time: 5-10 seconds

Fast approach (single multi-row INSERT):

-- Single INSERT with 88 values (wide table)
await db.query(`
    INSERT INTO acoustic_features (
        session_id, f0_mean, f0_std, jitter_local, shimmer_local, ...
    ) VALUES ($1, $2, $3, $4, $5, ...)
`, [sessionId, features.f0_mean, features.f0_std, ...]);

// Time: 10-50ms (100-500x faster)

For normalized schema (multi-row INSERT):

-- Batch insert 6,000 rows
const values = features.map(f => `('${sessionId}', '${f.name}', ${f.value})`).join(',');

await db.query(`
    INSERT INTO acoustic_features (session_id, feature_name, value)
    VALUES ${values}
`);

// Time: 200-500ms (10-50x faster than individual INSERTs)

8. Monitoring & Alerting

Key Metrics to Track

-- Slow queries (>1 second)
SELECT
    query,
    calls,
    total_time,
    mean_time,
    max_time
FROM pg_stat_statements
WHERE mean_time > 1000  -- 1 second
ORDER BY total_time DESC
LIMIT 20;

-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Table sizes
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Index hit ratio (should be >95%)
SELECT
    sum(idx_blks_hit) / NULLIF(sum(idx_blks_hit + idx_blks_read), 0) AS index_hit_rate
FROM pg_statio_user_indexes;

-- Cache hit ratio (should be >99%)
SELECT
    sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit + heap_blks_read), 0) AS cache_hit_rate
FROM pg_statio_user_tables;

Alert Thresholds

Slow query: >5 seconds → Investigate missing indexes
Index hit rate: <95% → Add more RAM or optimize queries
Cache hit rate: <99% → Increase shared_buffers
Connection count: >80% of max → Scale database or add pooling
Disk usage: >80% → Archive old data or add storage

9. Backup & Disaster Recovery

Logical Backups (pg_dump)

# Full database backup
pg_dump -h localhost -U postgres -Fc voice_analysis > backup_$(date +%Y%m%d).dump

# Restore
pg_restore -h localhost -U postgres -d voice_analysis backup_20250127.dump

# Incremental backups (WAL archiving)
# postgresql.conf:
archive_mode = on
archive_command = 'cp %p /backups/wal/%f'

Point-in-Time Recovery (PITR)

# Base backup + WAL logs = restore to any point in time
pg_basebackup -h localhost -D /backups/base -U postgres -Fp -Xs -P

# Restore to 2025-01-27 14:30:00
recovery_target_time = '2025-01-27 14:30:00'

10. Production Configuration Tuning

PostgreSQL Settings for Voice Analysis Workload

# postgresql.conf

# Memory (for 16GB server)
shared_buffers = 4GB           # 25% of RAM
effective_cache_size = 12GB    # 75% of RAM
work_mem = 64MB                # Per-query memory (for sorts/joins)
maintenance_work_mem = 1GB     # For VACUUM, CREATE INDEX

# Connections
max_connections = 200          # Total connections (use pooler for more)

# Query planner
random_page_cost = 1.1         # SSD (default 4.0 is for HDD)
effective_io_concurrency = 200 # SSD parallel I/O

# Write-ahead log
wal_buffers = 16MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB

# Autovacuum (critical for performance)
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 10s       # Check for vacuum every 10 seconds

Vacuum Strategy

-- Analyze query planner statistics (daily)
ANALYZE;

-- Vacuum to reclaim space (weekly)
VACUUM ANALYZE;

-- Aggressive vacuum (monthly, during low-traffic window)
VACUUM FULL;  -- Rewrites table, requires exclusive lock

-- Check for bloat
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    n_dead_tup AS dead_rows
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

The Bottom Line: Database Optimization Checklist

For production voice analysis systems:

Schema: Use wide table (88 columns) for acoustic features, JSONB for experimental features
Indexes:
- B-tree on user_id, (user_id, started_at DESC)
- BRIN on created_at (for large tables)
- GIN on JSONB columns
Partitioning: Monthly partitions for sessions (>10M rows)
Storage:
- Audio files: S3/object storage (not database)
- Features: REAL (4 bytes) instead of FLOAT (8 bytes)
Queries:
- Always add time range for partition pruning
- Use materialized views for dashboard aggregates
- Monitor with EXPLAIN ANALYZE
Connections: Use PgBouncer (20 DB connections for 1,000 API clients)
Inserts: Single multi-column INSERT (not 6,000 individual INSERTs)
Monitoring: Track slow queries, index hit rate, cache hit rate

Expected performance:

Insert features: 10-50ms (was 5-10 seconds)
Fetch user sessions: 50-200ms (was 8-15 seconds)
Dashboard aggregates: 5-10ms via materialized view (was 5-10 seconds)
Storage costs: $500/month for 1M sessions (was $5,000/month)

Voice Mirror's database architecture uses PostgreSQL with wide-table schema (88-column acoustic_features), monthly partitioning, BRIN indexes on timestamps, and S3 for audio storage. Our optimized schema handles 10,000 sessions/day with 50-200ms query times and $0.50/session storage costs.