Feature Engineering Best Practices: Turning Voice Data Into Predictive Gold

In machine learning, features matter more than algorithms. A Random Forest trained on well-engineered features will outperform a sophisticated deep learning model trained on raw or poorly-processed features—especially with limited data.

In voice analysis, feature engineering is the art of transforming raw acoustic measurements (pitch, energy, MFCCs) into meaningful predictors that capture the nuances of age, emotion, health, and personality.

This guide shares battle-tested practices from production voice analysis systems—lessons learned from 100+ research papers and real-world deployments.

The Feature Engineering Pipeline

Raw Audio → Acoustic Features → Normalization → Derived Features → Feature Selection → Model Input

5 stages:

Acoustic extraction: openSMILE, librosa, or Wav2vec 2.0 embeddings
Normalization: Speaker-independent scaling, z-score, or per-speaker normalization
Derived features: Ratios, deltas, domain-specific combinations
Feature selection: Reduce 6,000+ features to 50-500 most informative
Validation: Cross-validation, outlier detection, missing value handling

Let's dive into each stage with practical examples.

1. Acoustic Feature Extraction: Choosing Your Base

Your choice of base features determines the ceiling of your model's performance.

Option A: Handcrafted Acoustic Features (openSMILE)

Best for: Classical ML, interpretability, limited data (<5,000 samples)

import opensmile

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

features = smile.process_file('audio.wav')  # (1, 88)

Pros:

Interpretable: Each feature has clear acoustic meaning (pitch mean, jitter, formants)
Fast: CPU-only, 10-minute audio in <1 second
Validated: eGeMAPS used in 200+ papers with established benchmarks
Low sample requirement: 100-1,000 samples sufficient for Random Forest

Cons:

Limited ceiling: 2-5% lower accuracy than deep learning on large datasets
Manual engineering required: Need domain knowledge to add task-specific features

Option B: Deep Learning Embeddings (Wav2vec 2.0)

Best for: Large datasets (>5,000 samples), SOTA accuracy requirements

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")

# Load audio
audio, sr = librosa.load('audio.wav', sr=16000)

# Extract embeddings
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state  # (1, time_steps, 768)

# Pool across time (mean pooling)
features = embeddings.mean(dim=1).squeeze().numpy()  # (768,)

Pros:

SOTA performance: 2-5% higher accuracy than handcrafted features
No manual engineering: Pre-trained on 60,000+ hours, learns generic speech representations
Transfer learning: Fine-tune on your task with 500-2,000 samples

Cons:

Black box: 768-dimensional embeddings lack interpretability
GPU required: 10-100ms inference latency
Larger models: 300MB+ model size vs 50MB for openSMILE

Option C: Hybrid Approach (Recommended)

Best for: Production systems balancing accuracy and interpretability

import opensmile
from transformers import Wav2Vec2Model
import numpy as np

# Extract both types
opensmile_features = smile.process_file('audio.wav')  # (1, 88)
wav2vec_features = extract_wav2vec_embeddings('audio.wav')  # (768,)

# Concatenate
combined = np.concatenate([
    opensmile_features.values.flatten(),
    wav2vec_features
])  # (856,)

# Train ensemble
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression

# Model 1: Classical ML on openSMILE
model_classical = RandomForestClassifier(n_estimators=200)
model_classical.fit(X_opensmile_train, y_train)

# Model 2: Logistic Regression on Wav2vec embeddings
model_deep = LogisticRegression()
model_deep.fit(X_wav2vec_train, y_train)

# Ensemble
ensemble = VotingClassifier(
    estimators=[('classical', model_classical), ('deep', model_deep)],
    voting='soft',
    weights=[0.4, 0.6]  # Favor deep learning slightly
)

Result: 1-3% accuracy boost over either approach alone, plus interpretable fallback.

2. Speaker Normalization: Accounting for Individual Differences

The problem: Male speakers have F0 ~100 Hz, female speakers ~200 Hz. Raw pitch predicts gender with 98% accuracy—but obscures other signals (emotion, health, age).

Solution: Normalize features to remove speaker-specific characteristics.

Technique 1: Z-Score Normalization (Per-Speaker)

When to use: You have multiple recordings per speaker (longitudinal data, conversation turns)

import numpy as np

def zscore_normalize_per_speaker(features, speaker_ids):
    """
    Normalize features per speaker: (x - speaker_mean) / speaker_std

    Args:
        features: (n_samples, n_features)
        speaker_ids: (n_samples,) - speaker ID for each sample

    Returns:
        normalized_features: (n_samples, n_features)
    """
    normalized = np.zeros_like(features)

    for speaker_id in np.unique(speaker_ids):
        speaker_mask = speaker_ids == speaker_id
        speaker_features = features[speaker_mask]

        # Compute per-speaker mean and std
        mean = speaker_features.mean(axis=0)
        std = speaker_features.std(axis=0) + 1e-8  # Avoid division by zero

        # Normalize
        normalized[speaker_mask] = (speaker_features - mean) / std

    return normalized

Example: Depression detection from conversation

# Speaker A: 5 turns, Speaker B: 5 turns
features = np.array([...])  # (10, 88)
speaker_ids = np.array(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])

normalized = zscore_normalize_per_speaker(features, speaker_ids)

# Now F0 mean is relative to each speaker's baseline
# Speaker A high F0 turn: +1.5 std above their mean
# Speaker B high F0 turn: +1.5 std above their mean
# Both are comparable despite different absolute F0

Research validation: Improves depression detection accuracy from 68% → 74% (AVEC 2014, Valstar et al.)

Technique 2: Percentile Normalization (Population-Based)

When to use: Single recording per speaker, large training set (>1,000 samples)

from sklearn.preprocessing import QuantileTransformer

# Fit on training data
qt = QuantileTransformer(output_distribution='normal', n_quantiles=1000)
qt.fit(X_train)

# Transform train and test
X_train_norm = qt.transform(X_train)
X_test_norm = qt.transform(X_test)

How it works: Maps feature distributions to standard normal (mean=0, std=1) using empirical quantiles.

Advantage over z-score: Robust to outliers (extreme pitch values don't skew normalization)

Technique 3: Gender-Specific Normalization

When to use: Gender information available, analyzing gender-independent traits (emotion, personality)

def normalize_by_gender(features, genders):
    """
    Normalize features separately for male/female speakers
    """
    normalized = np.zeros_like(features)

    for gender in ['male', 'female']:
        gender_mask = genders == gender
        gender_features = features[gender_mask]

        # Z-score normalize within gender
        mean = gender_features.mean(axis=0)
        std = gender_features.std(axis=0) + 1e-8
        normalized[gender_mask] = (gender_features - mean) / std

    return normalized

Research validation: Emotion recognition accuracy 78% → 83% (gender-specific norms remove gender confound, IEMOCAP dataset)

Technique 4: Formant Position Normalization

Specific to formants (F1, F2, F3): Account for vocal tract length differences

def lobanov_normalize_formants(f1, f2, f3):
    """
    Lobanov normalization: z-score per formant, per speaker

    Standard in phonetics research (Lobanov, 1971)
    """
    # Stack formants
    formants = np.column_stack([f1, f2, f3])  # (n_samples, 3)

    # Z-score per formant dimension
    mean = formants.mean(axis=0)
    std = formants.std(axis=0)
    normalized = (formants - mean) / std

    return normalized[:, 0], normalized[:, 1], normalized[:, 2]

# Or use Bark scale (psychoacoustic)
def hz_to_bark(f_hz):
    """Convert Hz to Bark scale (perceptual frequency)"""
    return 26.81 * f_hz / (1960 + f_hz) - 0.53

f1_bark = hz_to_bark(f1_hz)
f2_bark = hz_to_bark(f2_hz)

3. Temporal Features: Capturing Change Over Time

Static features (mean F0, mean energy) miss dynamic patterns: How does pitch change? How fast does energy rise?

Delta Features (First-Order Derivatives)

What they capture: Rate of change

import numpy as np

def compute_delta_features(features, window=2):
    """
    Compute delta (first derivative) of features

    Args:
        features: (time_steps, n_features) - frame-by-frame features
        window: Number of frames for derivative computation

    Returns:
        delta: (time_steps, n_features)
    """
    delta = np.zeros_like(features)

    for t in range(window, len(features) - window):
        # Regression slope over 2*window+1 frames
        numerator = sum(i * (features[t + i] - features[t - i])
                       for i in range(1, window + 1))
        denominator = 2 * sum(i**2 for i in range(1, window + 1))
        delta[t] = numerator / denominator

    return delta

# Extract frame-level features
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
llds = smile.process_file('audio.wav')  # (time_steps, 25)

# Compute deltas
delta = compute_delta_features(llds.values)
delta_delta = compute_delta_features(delta)  # Second derivative (acceleration)

# Concatenate
features_with_dynamics = np.concatenate([
    llds.values,
    delta,
    delta_delta
], axis=1)  # (time_steps, 75) - 3x feature count

Use cases:

Emotion: Rising pitch (delta > 0) indicates excitement/stress
Parkinson's: Reduced pitch variation (low delta std) indicates vocal rigidity
Depression: Monotone speech (low delta range)

Slope Features (Linear Regression)

What they capture: Overall trend across utterance

from scipy.stats import linregress

def compute_slope_features(features):
    """
    Compute linear regression slope for each feature over time

    Returns:
        slopes: (n_features,) - slope per feature
        intercepts: (n_features,) - intercept per feature
    """
    n_features = features.shape[1]
    slopes = np.zeros(n_features)
    intercepts = np.zeros(n_features)

    time = np.arange(len(features))

    for i in range(n_features):
        result = linregress(time, features[:, i])
        slopes[i] = result.slope
        intercepts[i] = result.intercept

    return slopes, intercepts

# Example: F0 slope
f0 = llds['F0semitoneFrom27.5Hz_sma3nz'].values
slope, intercept = linregress(np.arange(len(f0)), f0)

if slope > 0:
    print("Rising pitch over utterance (excitement, question intonation)")
else:
    print("Falling pitch over utterance (statement, fatigue)")

Turn-Taking Features (Conversation Dynamics)

For multi-speaker recordings (interviews, conversations)

def extract_turn_taking_features(speaker_segments):
    """
    Args:
        speaker_segments: List of (start_time, end_time, speaker_id) tuples

    Returns:
        features: Dict of turn-taking metrics
    """
    turns = []
    for i, (start, end, speaker) in enumerate(speaker_segments):
        duration = end - start

        # Gap/overlap with next turn
        if i < len(speaker_segments) - 1:
            next_start = speaker_segments[i + 1][0]
            gap = next_start - end  # Positive = pause, negative = overlap
        else:
            gap = 0

        turns.append({
            'speaker': speaker,
            'duration': duration,
            'gap': gap,
        })

    # Aggregate features
    features = {}

    for speaker in set(t['speaker'] for t in turns):
        speaker_turns = [t for t in turns if t['speaker'] == speaker]

        features[f'{speaker}_turn_count'] = len(speaker_turns)
        features[f'{speaker}_avg_turn_duration'] = np.mean([t['duration'] for t in speaker_turns])
        features[f'{speaker}_talk_time_ratio'] = sum(t['duration'] for t in speaker_turns) / sum(t['duration'] for t in turns)
        features[f'{speaker}_avg_gap'] = np.mean([t['gap'] for t in speaker_turns])
        features[f'{speaker}_interruptions'] = sum(1 for t in speaker_turns if t['gap'] < -0.2)  # Overlap > 200ms

    return features

Research validation: Turn-taking features improve depression detection (patient talks less, longer pauses: 71% → 76% accuracy, Cummins et al. 2015)

4. Domain-Specific Feature Engineering

Generic features (eGeMAPS) work well across tasks—but task-specific features can provide 3-8% accuracy boost.

Age Detection: Voice Aging Markers

def extract_age_features(features_df):
    """
    Domain-specific features for age detection
    """
    age_features = {}

    # 1. Tremor (age-related vocal instability)
    f0 = features_df['F0semitoneFrom27.5Hz_sma3nz']
    age_features['f0_tremor'] = compute_tremor(f0, tremor_band=(4, 8))  # 4-8 Hz modulation

    # 2. Jitter (pitch perturbation increases with age)
    age_features['jitter_mean'] = features_df['jitterLocal_sma3nz'].mean()

    # 3. Shimmer (amplitude perturbation increases with age)
    age_features['shimmer_mean'] = features_df['shimmerLocaldB_sma3nz'].mean()

    # 4. HNR (harmonics-to-noise ratio decreases with age)
    age_features['hnr_mean'] = features_df['HNRdBACF_sma3nz'].mean()

    # 5. Formant centralization (vowel space shrinks with age)
    f1 = features_df['F1frequency_sma3nz']
    f2 = features_df['F2frequency_sma3nz']
    age_features['vowel_space_area'] = compute_vowel_space_area(f1, f2)

    # 6. Speaking rate (decreases with age)
    voiced_frames = features_df['voicingFinalUnclipped_sma3nz'] > 0.5
    total_time = len(features_df) * 0.01  # 10ms per frame
    voiced_time = voiced_frames.sum() * 0.01
    age_features['articulation_rate'] = voiced_time / total_time

    return age_features

def compute_tremor(f0, tremor_band=(4, 8), frame_rate=100):
    """
    Detect 4-8 Hz tremor (age-related)
    """
    from scipy.signal import welch

    # Compute power spectral density
    freqs, psd = welch(f0, fs=frame_rate, nperseg=256)

    # Integrate power in tremor band
    tremor_mask = (freqs >= tremor_band[0]) & (freqs <= tremor_band[1])
    tremor_power = np.trapz(psd[tremor_mask], freqs[tremor_mask])

    # Normalize by total power
    total_power = np.trapz(psd, freqs)
    tremor_ratio = tremor_power / total_power

    return tremor_ratio

Research validation: Age-specific features reduce MAE from 6.4 → 5.1 years (Bahari et al., 2014)

Parkinson's Detection: Motor Speech Markers

def extract_parkinsons_features(features_df):
    """
    Domain-specific features for Parkinson's screening
    """
    pd_features = {}

    # 1. Reduced pitch variation (vocal rigidity)
    f0 = features_df['F0semitoneFrom27.5Hz_sma3nz']
    pd_features['f0_range'] = f0.max() - f0.min()
    pd_features['f0_std'] = f0.std()
    pd_features['f0_iqr'] = f0.quantile(0.75) - f0.quantile(0.25)

    # 2. Increased jitter (irregular vibration)
    pd_features['jitter_mean'] = features_df['jitterLocal_sma3nz'].mean()
    pd_features['jitter_std'] = features_df['jitterLocal_sma3nz'].std()

    # 3. Increased shimmer (irregular amplitude)
    pd_features['shimmer_mean'] = features_df['shimmerLocaldB_sma3nz'].mean()

    # 4. Reduced HNR (breathier voice)
    pd_features['hnr_mean'] = features_df['HNRdBACF_sma3nz'].mean()

    # 5. Micropauses (brief voicing breaks)
    voiced = features_df['voicingFinalUnclipped_sma3nz'] > 0.5
    voicing_changes = np.diff(voiced.astype(int))
    micropauses = (voicing_changes == -1).sum()  # Voiced → unvoiced transitions
    pd_features['micropause_rate'] = micropauses / (len(features_df) * 0.01)  # Per second

    # 6. Formant precision (reduced in PD)
    f1_std = features_df['F1frequency_sma3nz'].std()
    f2_std = features_df['F2frequency_sma3nz'].std()
    pd_features['formant_precision'] = 1 / (f1_std + f2_std)  # Inverse of variability

    return pd_features

Research validation: PD-specific features achieve 90-97% F1-score vs 78-85% with generic features (Tsanas et al., 2012)

Depression Detection: Affective Prosody

def extract_depression_features(features_df):
    """
    Domain-specific features for depression screening
    """
    dep_features = {}

    # 1. Reduced pitch variation (flat affect)
    f0 = features_df['F0semitoneFrom27.5Hz_sma3nz']
    dep_features['f0_range'] = f0.max() - f0.min()
    dep_features['f0_std'] = f0.std()

    # 2. Lower mean pitch
    dep_features['f0_mean'] = f0.mean()

    # 3. Reduced loudness variation
    loudness = features_df['loudness_sma3']
    dep_features['loudness_std'] = loudness.std()

    # 4. Longer pauses
    voiced = features_df['voicingFinalUnclipped_sma3nz'] > 0.5
    unvoiced_runs = get_run_lengths(~voiced)
    dep_features['pause_mean_duration'] = np.mean(unvoiced_runs) * 0.01  # Seconds
    dep_features['pause_max_duration'] = np.max(unvoiced_runs) * 0.01

    # 5. Slower articulation rate
    voiced_time = voiced.sum() * 0.01
    total_time = len(features_df) * 0.01
    dep_features['articulation_rate'] = voiced_time / total_time

    # 6. Reduced spectral energy (less expressive)
    mfcc1_std = features_df['mfcc1_sma3'].std()
    dep_features['spectral_variability'] = mfcc1_std

    return dep_features

def get_run_lengths(binary_array):
    """Get lengths of consecutive True runs"""
    changes = np.diff(np.concatenate(([0], binary_array.astype(int), [0])))
    run_starts = np.where(changes == 1)[0]
    run_ends = np.where(changes == -1)[0]
    return run_ends - run_starts

5. Feature Selection: Reducing Dimensionality

The curse of dimensionality: 6,373 ComParE features with 500 samples → severe overfitting.

Goal: Reduce to 50-500 most informative features.

Method 1: Random Forest Feature Importance

from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Train Random Forest
rf = RandomForestClassifier(n_estimators=500, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_
feature_names = X_train.columns

# Sort by importance
indices = np.argsort(importances)[::-1]

# Select top K features
K = 100
selected_features = feature_names[indices[:K]]

X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# Retrain on selected features
rf_final = RandomForestClassifier(n_estimators=500)
rf_final.fit(X_train_selected, y_train)

print(f"Accuracy (all features): {rf.score(X_test, y_test):.3f}")
print(f"Accuracy (top {K} features): {rf_final.score(X_test_selected, y_test):.3f}")
# Often: Selected features perform BETTER (less overfitting)

Typical result: 6,373 features (72% accuracy) → 100 features (78% accuracy)

Method 2: Lasso Regularization (L1)

from sklearn.linear_model import LassoCV

# Cross-validated Lasso (automatic alpha selection)
lasso = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso.fit(X_train, y_train)

# Get non-zero coefficients (selected features)
selected_mask = lasso.coef_ != 0
selected_features = X_train.columns[selected_mask]

print(f"Selected {selected_mask.sum()} / {len(X_train.columns)} features")

X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

Advantage: Automatic, no hyperparameter tuning

Method 3: Correlation-Based Feature Selection

Remove redundant features: If F1 and F2 are highly correlated (r > 0.9), keep only one.

import pandas as pd

def remove_correlated_features(X, threshold=0.9):
    """
    Remove features with correlation > threshold
    """
    corr_matrix = X.corr().abs()

    # Get upper triangle (avoid duplicate pairs)
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Find features with correlation > threshold
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

    print(f"Dropping {len(to_drop)} correlated features")
    return X.drop(columns=to_drop)

X_train_uncorrelated = remove_correlated_features(X_train, threshold=0.9)

Method 4: Sequential Feature Selection

Greedy search: Add features one-by-one, keep those that improve CV score

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier

# Forward selection: start with 0 features, add best one at each step
sfs = SequentialFeatureSelector(
    RandomForestClassifier(n_estimators=100),
    n_features_to_select=50,
    direction='forward',
    cv=5,
    n_jobs=-1
)

sfs.fit(X_train, y_train)

# Get selected features
selected_features = X_train.columns[sfs.get_support()]
X_train_selected = X_train[selected_features]

Warning: Computationally expensive (trains K × F models, where K = n_features_to_select, F = total features)

6. Handling Missing Data

Common scenario: Pitch (F0) undefined during unvoiced segments (consonants, silence).

Strategy 1: Forward Fill

import pandas as pd

# Forward fill: carry last valid value forward
df['F0_filled'] = df['F0'].fillna(method='ffill')

# Or backward fill
df['F0_filled'] = df['F0'].fillna(method='bfill')

Strategy 2: Interpolation

# Linear interpolation
df['F0_interpolated'] = df['F0'].interpolate(method='linear')

# Or cubic spline (smoother)
df['F0_interpolated'] = df['F0'].interpolate(method='cubic')

Strategy 3: Use Unvoiced Flag as Feature

# Instead of filling, use voicing as separate feature
df['is_voiced'] = (~df['F0'].isna()).astype(int)

# Fill with mean (but model learns voicing separately)
df['F0_filled'] = df['F0'].fillna(df['F0'].mean())

# Now model can learn:
# - F0 value (when voiced)
# - Voicing ratio (% time voiced)

Strategy 4: Exclude Missing-Heavy Features

def drop_features_with_missing(X, threshold=0.5):
    """
    Drop features with >threshold fraction missing
    """
    missing_fraction = X.isna().mean()
    to_drop = missing_fraction[missing_fraction > threshold].index

    print(f"Dropping {len(to_drop)} features with >{threshold*100}% missing")
    return X.drop(columns=to_drop)

X_clean = drop_features_with_missing(X, threshold=0.5)

7. Feature Validation & Quality Checks

Check 1: Distribution Sanity

import matplotlib.pyplot as plt

def plot_feature_distributions(X, features_to_plot=10):
    """
    Plot histograms of first N features
    """
    fig, axes = plt.subplots(2, 5, figsize=(15, 6))
    axes = axes.flatten()

    for i, feature in enumerate(X.columns[:features_to_plot]):
        axes[i].hist(X[feature].dropna(), bins=50)
        axes[i].set_title(feature)
        axes[i].set_xlabel('Value')
        axes[i].set_ylabel('Count')

    plt.tight_layout()
    plt.savefig('feature_distributions.png')

plot_feature_distributions(X_train)

Red flags:

All zeros: Feature not computed correctly
Single value: No variance, won't help model
Extreme outliers: May need clipping or log transform

Check 2: Train/Test Distribution Shift

from scipy.stats import ks_2samp

def check_distribution_shift(X_train, X_test, threshold=0.05):
    """
    Use Kolmogorov-Smirnov test to detect train/test shift
    """
    shifted_features = []

    for feature in X_train.columns:
        statistic, p_value = ks_2samp(
            X_train[feature].dropna(),
            X_test[feature].dropna()
        )

        if p_value < threshold:
            shifted_features.append((feature, p_value))

    if shifted_features:
        print(f"⚠️ {len(shifted_features)} features show train/test shift:")
        for feature, p_value in sorted(shifted_features, key=lambda x: x[1])[:10]:
            print(f"  {feature}: p={p_value:.4f}")
    else:
        print("✓ No significant distribution shift detected")

check_distribution_shift(X_train, X_test)

Check 3: Feature-Label Correlation

import pandas as pd

def analyze_feature_label_correlation(X, y, top_k=20):
    """
    Find features most correlated with target
    """
    correlations = {}

    for feature in X.columns:
        corr = np.corrcoef(X[feature].fillna(X[feature].mean()), y)[0, 1]
        correlations[feature] = abs(corr)

    # Sort by absolute correlation
    sorted_corr = sorted(correlations.items(), key=lambda x: x[1], reverse=True)

    print(f"Top {top_k} features correlated with target:")
    for i, (feature, corr) in enumerate(sorted_corr[:top_k], 1):
        print(f"{i}. {feature}: {corr:.3f}")

    return sorted_corr

correlations = analyze_feature_label_correlation(X_train, y_train)

8. Production Feature Pipeline

Putting it all together: reproducible, version-controlled pipeline

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class VoiceFeatureExtractor(BaseEstimator, TransformerMixin):
    """
    Custom transformer for voice feature extraction
    """
    def __init__(self, feature_set='eGeMAPS'):
        self.feature_set = feature_set
        self.smile = None

    def fit(self, X, y=None):
        # Initialize openSMILE
        import opensmile
        self.smile = opensmile.Smile(
            feature_set=getattr(opensmile.FeatureSet, self.feature_set),
            feature_level=opensmile.FeatureLevel.Functionals,
        )
        return self

    def transform(self, X):
        # X is list of audio file paths
        features_list = []
        for audio_path in X:
            features = self.smile.process_file(audio_path)
            features_list.append(features)

        return pd.concat(features_list, ignore_index=True)

class SpeakerNormalizer(BaseEstimator, TransformerMixin):
    """
    Z-score normalization per speaker
    """
    def __init__(self, speaker_id_col='speaker_id'):
        self.speaker_id_col = speaker_id_col
        self.speaker_stats = {}

    def fit(self, X, y=None):
        # Compute per-speaker means and stds
        for speaker_id in X[self.speaker_id_col].unique():
            speaker_data = X[X[self.speaker_id_col] == speaker_id]
            numeric_cols = speaker_data.select_dtypes(include=[np.number]).columns

            self.speaker_stats[speaker_id] = {
                'mean': speaker_data[numeric_cols].mean(),
                'std': speaker_data[numeric_cols].std(),
            }

        return self

    def transform(self, X):
        X_norm = X.copy()
        numeric_cols = X.select_dtypes(include=[np.number]).columns

        for speaker_id in X[self.speaker_id_col].unique():
            mask = X[self.speaker_id_col] == speaker_id

            if speaker_id in self.speaker_stats:
                stats = self.speaker_stats[speaker_id]
            else:
                # Unseen speaker: use global stats
                stats = {
                    'mean': X[numeric_cols].mean(),
                    'std': X[numeric_cols].std(),
                }

            X_norm.loc[mask, numeric_cols] = (
                (X.loc[mask, numeric_cols] - stats['mean']) / (stats['std'] + 1e-8)
            )

        return X_norm

# Build pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

pipeline = Pipeline([
    ('extract', VoiceFeatureExtractor(feature_set='eGeMAPSv02')),
    ('normalize', SpeakerNormalizer()),
    ('select', SelectFromModel(RandomForestClassifier(n_estimators=100), threshold='median')),
    ('classify', RandomForestClassifier(n_estimators=500, max_depth=10)),
])

# Train
audio_paths_train = ['audio1.wav', 'audio2.wav', ...]
pipeline.fit(audio_paths_train, y_train)

# Predict
audio_paths_test = ['test1.wav', 'test2.wav', ...]
predictions = pipeline.predict(audio_paths_test)

9. Common Pitfalls & How to Avoid Them

Pitfall 1: Data Leakage via Normalization

Mistake: Normalize entire dataset before train/test split

# WRONG
X_normalized = (X - X.mean()) / X.std()  # Uses test data statistics!
X_train, X_test = train_test_split(X_normalized)

Fix: Fit scaler on train only, apply to test

# CORRECT
X_train, X_test = train_test_split(X)

scaler = StandardScaler()
scaler.fit(X_train)  # Only train data

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use train statistics

Pitfall 2: Speaker Leakage in Train/Test Split

Mistake: Same speaker in train and test

# WRONG: Random split may put same speaker in train and test
X_train, X_test = train_test_split(X, y)

Fix: Use GroupKFold for speaker-independent evaluation

# CORRECT: Ensure train and test have different speakers
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)

for train_idx, test_idx in gkf.split(X, y, groups=speaker_ids):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Train and evaluate
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)

Pitfall 3: Overfitting to Functionals

Problem: Using 98 functionals on 25 LLDs → 2,450 features, many redundant

Fix: Start with minimal functionals (mean, std, range), add others only if CV score improves

# Minimal functionals (interpretable)
functionals_minimal = ['mean', 'std', 'min', 'max', 'range']

# Extended functionals (add if needed)
functionals_extended = functionals_minimal + [
    'quartile1', 'quartile2', 'quartile3',
    'linregc1',  # Slope
    'skewness', 'kurtosis'
]

The Bottom Line: Feature Engineering Workflow

For most voice analysis tasks, follow this workflow:

Extract eGeMAPS features (88 features, best baseline)
Add domain-specific features (5-20 features based on task)
Normalize per-speaker or per-gender (if multiple recordings available)
Add delta features (if using LLDs, not functionals)
Train Random Forest baseline (no feature selection)
If accuracy insufficient:
- Extract ComParE (6,373 features)
- Use Random Forest feature importance to select top 100-500
- Retrain on selected features
If still insufficient:
- Add Wav2vec 2.0 embeddings (768 features)
- Ensemble classical ML + deep learning

Expected performance progression:

Raw features (no engineering): 60-70% accuracy
+ Normalization: 65-75%
+ Domain features: 70-80%
+ Feature selection: 72-83%
+ Deep learning ensemble: 75-88%

Remember: Feature engineering is iterative. Start simple (eGeMAPS + Random Forest), then add complexity only where CV score improves.

Voice Mirror's feature engineering pipeline combines eGeMAPS baseline (88 features), task-specific features (age tremor, PD micropauses, depression prosody), per-speaker normalization, and Random Forest feature selection from ComParE (6,373 → 200 features). Our hybrid approach delivers 78-89% accuracy across 20+ voice analysis tasks while maintaining interpretability for clinical applications.