Feature Engineering Best Practices for Voice Analysis: From Raw Audio to Predictive Power
Master feature engineering for voice analysis ML systems. Learn speaker normalization, temporal features, domain-specific engineering, feature selection, and production-ready pipelines.
Feature Engineering Best Practices: Turning Voice Data Into Predictive Gold
In machine learning, features matter more than algorithms. A Random Forest trained on well-engineered features will outperform a sophisticated deep learning model trained on raw or poorly-processed features—especially with limited data.
In voice analysis, feature engineering is the art of transforming raw acoustic measurements (pitch, energy, MFCCs) into meaningful predictors that capture the nuances of age, emotion, health, and personality.
This guide shares battle-tested practices from production voice analysis systems—lessons learned from 100+ research papers and real-world deployments.
The Feature Engineering Pipeline
Raw Audio → Acoustic Features → Normalization → Derived Features → Feature Selection → Model Input
5 stages:
- Acoustic extraction: openSMILE, librosa, or Wav2vec 2.0 embeddings
- Normalization: Speaker-independent scaling, z-score, or per-speaker normalization
- Derived features: Ratios, deltas, domain-specific combinations
- Feature selection: Reduce 6,000+ features to 50-500 most informative
- Validation: Cross-validation, outlier detection, missing value handling
Let's dive into each stage with practical examples.
1. Acoustic Feature Extraction: Choosing Your Base
Your choice of base features determines the ceiling of your model's performance.
Option A: Handcrafted Acoustic Features (openSMILE)
Best for: Classical ML, interpretability, limited data (<5,000 samples)
import opensmile
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals,
)
features = smile.process_file('audio.wav') # (1, 88)
Pros:
- Interpretable: Each feature has clear acoustic meaning (pitch mean, jitter, formants)
- Fast: CPU-only, 10-minute audio in <1 second
- Validated: eGeMAPS used in 200+ papers with established benchmarks
- Low sample requirement: 100-1,000 samples sufficient for Random Forest
Cons:
- Limited ceiling: 2-5% lower accuracy than deep learning on large datasets
- Manual engineering required: Need domain knowledge to add task-specific features
Option B: Deep Learning Embeddings (Wav2vec 2.0)
Best for: Large datasets (>5,000 samples), SOTA accuracy requirements
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
# Load audio
audio, sr = librosa.load('audio.wav', sr=16000)
# Extract embeddings
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
embeddings = model(**inputs).last_hidden_state # (1, time_steps, 768)
# Pool across time (mean pooling)
features = embeddings.mean(dim=1).squeeze().numpy() # (768,)
Pros:
- SOTA performance: 2-5% higher accuracy than handcrafted features
- No manual engineering: Pre-trained on 60,000+ hours, learns generic speech representations
- Transfer learning: Fine-tune on your task with 500-2,000 samples
Cons:
- Black box: 768-dimensional embeddings lack interpretability
- GPU required: 10-100ms inference latency
- Larger models: 300MB+ model size vs 50MB for openSMILE
Option C: Hybrid Approach (Recommended)
Best for: Production systems balancing accuracy and interpretability
import opensmile
from transformers import Wav2Vec2Model
import numpy as np
# Extract both types
opensmile_features = smile.process_file('audio.wav') # (1, 88)
wav2vec_features = extract_wav2vec_embeddings('audio.wav') # (768,)
# Concatenate
combined = np.concatenate([
opensmile_features.values.flatten(),
wav2vec_features
]) # (856,)
# Train ensemble
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
# Model 1: Classical ML on openSMILE
model_classical = RandomForestClassifier(n_estimators=200)
model_classical.fit(X_opensmile_train, y_train)
# Model 2: Logistic Regression on Wav2vec embeddings
model_deep = LogisticRegression()
model_deep.fit(X_wav2vec_train, y_train)
# Ensemble
ensemble = VotingClassifier(
estimators=[('classical', model_classical), ('deep', model_deep)],
voting='soft',
weights=[0.4, 0.6] # Favor deep learning slightly
)
Result: 1-3% accuracy boost over either approach alone, plus interpretable fallback.
2. Speaker Normalization: Accounting for Individual Differences
The problem: Male speakers have F0 ~100 Hz, female speakers ~200 Hz. Raw pitch predicts gender with 98% accuracy—but obscures other signals (emotion, health, age).
Solution: Normalize features to remove speaker-specific characteristics.
Technique 1: Z-Score Normalization (Per-Speaker)
When to use: You have multiple recordings per speaker (longitudinal data, conversation turns)
import numpy as np
def zscore_normalize_per_speaker(features, speaker_ids):
"""
Normalize features per speaker: (x - speaker_mean) / speaker_std
Args:
features: (n_samples, n_features)
speaker_ids: (n_samples,) - speaker ID for each sample
Returns:
normalized_features: (n_samples, n_features)
"""
normalized = np.zeros_like(features)
for speaker_id in np.unique(speaker_ids):
speaker_mask = speaker_ids == speaker_id
speaker_features = features[speaker_mask]
# Compute per-speaker mean and std
mean = speaker_features.mean(axis=0)
std = speaker_features.std(axis=0) + 1e-8 # Avoid division by zero
# Normalize
normalized[speaker_mask] = (speaker_features - mean) / std
return normalized
Example: Depression detection from conversation
# Speaker A: 5 turns, Speaker B: 5 turns
features = np.array([...]) # (10, 88)
speaker_ids = np.array(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
normalized = zscore_normalize_per_speaker(features, speaker_ids)
# Now F0 mean is relative to each speaker's baseline
# Speaker A high F0 turn: +1.5 std above their mean
# Speaker B high F0 turn: +1.5 std above their mean
# Both are comparable despite different absolute F0
Research validation: Improves depression detection accuracy from 68% → 74% (AVEC 2014, Valstar et al.)
Technique 2: Percentile Normalization (Population-Based)
When to use: Single recording per speaker, large training set (>1,000 samples)
from sklearn.preprocessing import QuantileTransformer
# Fit on training data
qt = QuantileTransformer(output_distribution='normal', n_quantiles=1000)
qt.fit(X_train)
# Transform train and test
X_train_norm = qt.transform(X_train)
X_test_norm = qt.transform(X_test)
How it works: Maps feature distributions to standard normal (mean=0, std=1) using empirical quantiles.
Advantage over z-score: Robust to outliers (extreme pitch values don't skew normalization)
Technique 3: Gender-Specific Normalization
When to use: Gender information available, analyzing gender-independent traits (emotion, personality)
def normalize_by_gender(features, genders):
"""
Normalize features separately for male/female speakers
"""
normalized = np.zeros_like(features)
for gender in ['male', 'female']:
gender_mask = genders == gender
gender_features = features[gender_mask]
# Z-score normalize within gender
mean = gender_features.mean(axis=0)
std = gender_features.std(axis=0) + 1e-8
normalized[gender_mask] = (gender_features - mean) / std
return normalized
Research validation: Emotion recognition accuracy 78% → 83% (gender-specific norms remove gender confound, IEMOCAP dataset)
Technique 4: Formant Position Normalization
Specific to formants (F1, F2, F3): Account for vocal tract length differences
def lobanov_normalize_formants(f1, f2, f3):
"""
Lobanov normalization: z-score per formant, per speaker
Standard in phonetics research (Lobanov, 1971)
"""
# Stack formants
formants = np.column_stack([f1, f2, f3]) # (n_samples, 3)
# Z-score per formant dimension
mean = formants.mean(axis=0)
std = formants.std(axis=0)
normalized = (formants - mean) / std
return normalized[:, 0], normalized[:, 1], normalized[:, 2]
# Or use Bark scale (psychoacoustic)
def hz_to_bark(f_hz):
"""Convert Hz to Bark scale (perceptual frequency)"""
return 26.81 * f_hz / (1960 + f_hz) - 0.53
f1_bark = hz_to_bark(f1_hz)
f2_bark = hz_to_bark(f2_hz)
3. Temporal Features: Capturing Change Over Time
Static features (mean F0, mean energy) miss dynamic patterns: How does pitch change? How fast does energy rise?
Delta Features (First-Order Derivatives)
What they capture: Rate of change
import numpy as np
def compute_delta_features(features, window=2):
"""
Compute delta (first derivative) of features
Args:
features: (time_steps, n_features) - frame-by-frame features
window: Number of frames for derivative computation
Returns:
delta: (time_steps, n_features)
"""
delta = np.zeros_like(features)
for t in range(window, len(features) - window):
# Regression slope over 2*window+1 frames
numerator = sum(i * (features[t + i] - features[t - i])
for i in range(1, window + 1))
denominator = 2 * sum(i**2 for i in range(1, window + 1))
delta[t] = numerator / denominator
return delta
# Extract frame-level features
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
llds = smile.process_file('audio.wav') # (time_steps, 25)
# Compute deltas
delta = compute_delta_features(llds.values)
delta_delta = compute_delta_features(delta) # Second derivative (acceleration)
# Concatenate
features_with_dynamics = np.concatenate([
llds.values,
delta,
delta_delta
], axis=1) # (time_steps, 75) - 3x feature count
Use cases:
- Emotion: Rising pitch (delta > 0) indicates excitement/stress
- Parkinson's: Reduced pitch variation (low delta std) indicates vocal rigidity
- Depression: Monotone speech (low delta range)
Slope Features (Linear Regression)
What they capture: Overall trend across utterance
from scipy.stats import linregress
def compute_slope_features(features):
"""
Compute linear regression slope for each feature over time
Returns:
slopes: (n_features,) - slope per feature
intercepts: (n_features,) - intercept per feature
"""
n_features = features.shape[1]
slopes = np.zeros(n_features)
intercepts = np.zeros(n_features)
time = np.arange(len(features))
for i in range(n_features):
result = linregress(time, features[:, i])
slopes[i] = result.slope
intercepts[i] = result.intercept
return slopes, intercepts
# Example: F0 slope
f0 = llds['F0semitoneFrom27.5Hz_sma3nz'].values
slope, intercept = linregress(np.arange(len(f0)), f0)
if slope > 0:
print("Rising pitch over utterance (excitement, question intonation)")
else:
print("Falling pitch over utterance (statement, fatigue)")
Turn-Taking Features (Conversation Dynamics)
For multi-speaker recordings (interviews, conversations)
def extract_turn_taking_features(speaker_segments):
"""
Args:
speaker_segments: List of (start_time, end_time, speaker_id) tuples
Returns:
features: Dict of turn-taking metrics
"""
turns = []
for i, (start, end, speaker) in enumerate(speaker_segments):
duration = end - start
# Gap/overlap with next turn
if i < len(speaker_segments) - 1:
next_start = speaker_segments[i + 1][0]
gap = next_start - end # Positive = pause, negative = overlap
else:
gap = 0
turns.append({
'speaker': speaker,
'duration': duration,
'gap': gap,
})
# Aggregate features
features = {}
for speaker in set(t['speaker'] for t in turns):
speaker_turns = [t for t in turns if t['speaker'] == speaker]
features[f'{speaker}_turn_count'] = len(speaker_turns)
features[f'{speaker}_avg_turn_duration'] = np.mean([t['duration'] for t in speaker_turns])
features[f'{speaker}_talk_time_ratio'] = sum(t['duration'] for t in speaker_turns) / sum(t['duration'] for t in turns)
features[f'{speaker}_avg_gap'] = np.mean([t['gap'] for t in speaker_turns])
features[f'{speaker}_interruptions'] = sum(1 for t in speaker_turns if t['gap'] < -0.2) # Overlap > 200ms
return features
Research validation: Turn-taking features improve depression detection (patient talks less, longer pauses: 71% → 76% accuracy, Cummins et al. 2015)
4. Domain-Specific Feature Engineering
Generic features (eGeMAPS) work well across tasks—but task-specific features can provide 3-8% accuracy boost.
Age Detection: Voice Aging Markers
def extract_age_features(features_df):
"""
Domain-specific features for age detection
"""
age_features = {}
# 1. Tremor (age-related vocal instability)
f0 = features_df['F0semitoneFrom27.5Hz_sma3nz']
age_features['f0_tremor'] = compute_tremor(f0, tremor_band=(4, 8)) # 4-8 Hz modulation
# 2. Jitter (pitch perturbation increases with age)
age_features['jitter_mean'] = features_df['jitterLocal_sma3nz'].mean()
# 3. Shimmer (amplitude perturbation increases with age)
age_features['shimmer_mean'] = features_df['shimmerLocaldB_sma3nz'].mean()
# 4. HNR (harmonics-to-noise ratio decreases with age)
age_features['hnr_mean'] = features_df['HNRdBACF_sma3nz'].mean()
# 5. Formant centralization (vowel space shrinks with age)
f1 = features_df['F1frequency_sma3nz']
f2 = features_df['F2frequency_sma3nz']
age_features['vowel_space_area'] = compute_vowel_space_area(f1, f2)
# 6. Speaking rate (decreases with age)
voiced_frames = features_df['voicingFinalUnclipped_sma3nz'] > 0.5
total_time = len(features_df) * 0.01 # 10ms per frame
voiced_time = voiced_frames.sum() * 0.01
age_features['articulation_rate'] = voiced_time / total_time
return age_features
def compute_tremor(f0, tremor_band=(4, 8), frame_rate=100):
"""
Detect 4-8 Hz tremor (age-related)
"""
from scipy.signal import welch
# Compute power spectral density
freqs, psd = welch(f0, fs=frame_rate, nperseg=256)
# Integrate power in tremor band
tremor_mask = (freqs >= tremor_band[0]) & (freqs <= tremor_band[1])
tremor_power = np.trapz(psd[tremor_mask], freqs[tremor_mask])
# Normalize by total power
total_power = np.trapz(psd, freqs)
tremor_ratio = tremor_power / total_power
return tremor_ratio
Research validation: Age-specific features reduce MAE from 6.4 → 5.1 years (Bahari et al., 2014)
Parkinson's Detection: Motor Speech Markers
def extract_parkinsons_features(features_df):
"""
Domain-specific features for Parkinson's screening
"""
pd_features = {}
# 1. Reduced pitch variation (vocal rigidity)
f0 = features_df['F0semitoneFrom27.5Hz_sma3nz']
pd_features['f0_range'] = f0.max() - f0.min()
pd_features['f0_std'] = f0.std()
pd_features['f0_iqr'] = f0.quantile(0.75) - f0.quantile(0.25)
# 2. Increased jitter (irregular vibration)
pd_features['jitter_mean'] = features_df['jitterLocal_sma3nz'].mean()
pd_features['jitter_std'] = features_df['jitterLocal_sma3nz'].std()
# 3. Increased shimmer (irregular amplitude)
pd_features['shimmer_mean'] = features_df['shimmerLocaldB_sma3nz'].mean()
# 4. Reduced HNR (breathier voice)
pd_features['hnr_mean'] = features_df['HNRdBACF_sma3nz'].mean()
# 5. Micropauses (brief voicing breaks)
voiced = features_df['voicingFinalUnclipped_sma3nz'] > 0.5
voicing_changes = np.diff(voiced.astype(int))
micropauses = (voicing_changes == -1).sum() # Voiced → unvoiced transitions
pd_features['micropause_rate'] = micropauses / (len(features_df) * 0.01) # Per second
# 6. Formant precision (reduced in PD)
f1_std = features_df['F1frequency_sma3nz'].std()
f2_std = features_df['F2frequency_sma3nz'].std()
pd_features['formant_precision'] = 1 / (f1_std + f2_std) # Inverse of variability
return pd_features
Research validation: PD-specific features achieve 90-97% F1-score vs 78-85% with generic features (Tsanas et al., 2012)
Depression Detection: Affective Prosody
def extract_depression_features(features_df):
"""
Domain-specific features for depression screening
"""
dep_features = {}
# 1. Reduced pitch variation (flat affect)
f0 = features_df['F0semitoneFrom27.5Hz_sma3nz']
dep_features['f0_range'] = f0.max() - f0.min()
dep_features['f0_std'] = f0.std()
# 2. Lower mean pitch
dep_features['f0_mean'] = f0.mean()
# 3. Reduced loudness variation
loudness = features_df['loudness_sma3']
dep_features['loudness_std'] = loudness.std()
# 4. Longer pauses
voiced = features_df['voicingFinalUnclipped_sma3nz'] > 0.5
unvoiced_runs = get_run_lengths(~voiced)
dep_features['pause_mean_duration'] = np.mean(unvoiced_runs) * 0.01 # Seconds
dep_features['pause_max_duration'] = np.max(unvoiced_runs) * 0.01
# 5. Slower articulation rate
voiced_time = voiced.sum() * 0.01
total_time = len(features_df) * 0.01
dep_features['articulation_rate'] = voiced_time / total_time
# 6. Reduced spectral energy (less expressive)
mfcc1_std = features_df['mfcc1_sma3'].std()
dep_features['spectral_variability'] = mfcc1_std
return dep_features
def get_run_lengths(binary_array):
"""Get lengths of consecutive True runs"""
changes = np.diff(np.concatenate(([0], binary_array.astype(int), [0])))
run_starts = np.where(changes == 1)[0]
run_ends = np.where(changes == -1)[0]
return run_ends - run_starts
5. Feature Selection: Reducing Dimensionality
The curse of dimensionality: 6,373 ComParE features with 500 samples → severe overfitting.
Goal: Reduce to 50-500 most informative features.
Method 1: Random Forest Feature Importance
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Train Random Forest
rf = RandomForestClassifier(n_estimators=500, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
# Get feature importances
importances = rf.feature_importances_
feature_names = X_train.columns
# Sort by importance
indices = np.argsort(importances)[::-1]
# Select top K features
K = 100
selected_features = feature_names[indices[:K]]
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
# Retrain on selected features
rf_final = RandomForestClassifier(n_estimators=500)
rf_final.fit(X_train_selected, y_train)
print(f"Accuracy (all features): {rf.score(X_test, y_test):.3f}")
print(f"Accuracy (top {K} features): {rf_final.score(X_test_selected, y_test):.3f}")
# Often: Selected features perform BETTER (less overfitting)
Typical result: 6,373 features (72% accuracy) → 100 features (78% accuracy)
Method 2: Lasso Regularization (L1)
from sklearn.linear_model import LassoCV
# Cross-validated Lasso (automatic alpha selection)
lasso = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso.fit(X_train, y_train)
# Get non-zero coefficients (selected features)
selected_mask = lasso.coef_ != 0
selected_features = X_train.columns[selected_mask]
print(f"Selected {selected_mask.sum()} / {len(X_train.columns)} features")
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
Advantage: Automatic, no hyperparameter tuning
Method 3: Correlation-Based Feature Selection
Remove redundant features: If F1 and F2 are highly correlated (r > 0.9), keep only one.
import pandas as pd
def remove_correlated_features(X, threshold=0.9):
"""
Remove features with correlation > threshold
"""
corr_matrix = X.corr().abs()
# Get upper triangle (avoid duplicate pairs)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find features with correlation > threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print(f"Dropping {len(to_drop)} correlated features")
return X.drop(columns=to_drop)
X_train_uncorrelated = remove_correlated_features(X_train, threshold=0.9)
Method 4: Sequential Feature Selection
Greedy search: Add features one-by-one, keep those that improve CV score
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier
# Forward selection: start with 0 features, add best one at each step
sfs = SequentialFeatureSelector(
RandomForestClassifier(n_estimators=100),
n_features_to_select=50,
direction='forward',
cv=5,
n_jobs=-1
)
sfs.fit(X_train, y_train)
# Get selected features
selected_features = X_train.columns[sfs.get_support()]
X_train_selected = X_train[selected_features]
Warning: Computationally expensive (trains K × F models, where K = n_features_to_select, F = total features)
6. Handling Missing Data
Common scenario: Pitch (F0) undefined during unvoiced segments (consonants, silence).
Strategy 1: Forward Fill
import pandas as pd
# Forward fill: carry last valid value forward
df['F0_filled'] = df['F0'].fillna(method='ffill')
# Or backward fill
df['F0_filled'] = df['F0'].fillna(method='bfill')
Strategy 2: Interpolation
# Linear interpolation
df['F0_interpolated'] = df['F0'].interpolate(method='linear')
# Or cubic spline (smoother)
df['F0_interpolated'] = df['F0'].interpolate(method='cubic')
Strategy 3: Use Unvoiced Flag as Feature
# Instead of filling, use voicing as separate feature
df['is_voiced'] = (~df['F0'].isna()).astype(int)
# Fill with mean (but model learns voicing separately)
df['F0_filled'] = df['F0'].fillna(df['F0'].mean())
# Now model can learn:
# - F0 value (when voiced)
# - Voicing ratio (% time voiced)
Strategy 4: Exclude Missing-Heavy Features
def drop_features_with_missing(X, threshold=0.5):
"""
Drop features with >threshold fraction missing
"""
missing_fraction = X.isna().mean()
to_drop = missing_fraction[missing_fraction > threshold].index
print(f"Dropping {len(to_drop)} features with >{threshold*100}% missing")
return X.drop(columns=to_drop)
X_clean = drop_features_with_missing(X, threshold=0.5)
7. Feature Validation & Quality Checks
Check 1: Distribution Sanity
import matplotlib.pyplot as plt
def plot_feature_distributions(X, features_to_plot=10):
"""
Plot histograms of first N features
"""
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.flatten()
for i, feature in enumerate(X.columns[:features_to_plot]):
axes[i].hist(X[feature].dropna(), bins=50)
axes[i].set_title(feature)
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Count')
plt.tight_layout()
plt.savefig('feature_distributions.png')
plot_feature_distributions(X_train)
Red flags:
- All zeros: Feature not computed correctly
- Single value: No variance, won't help model
- Extreme outliers: May need clipping or log transform
Check 2: Train/Test Distribution Shift
from scipy.stats import ks_2samp
def check_distribution_shift(X_train, X_test, threshold=0.05):
"""
Use Kolmogorov-Smirnov test to detect train/test shift
"""
shifted_features = []
for feature in X_train.columns:
statistic, p_value = ks_2samp(
X_train[feature].dropna(),
X_test[feature].dropna()
)
if p_value < threshold:
shifted_features.append((feature, p_value))
if shifted_features:
print(f"⚠️ {len(shifted_features)} features show train/test shift:")
for feature, p_value in sorted(shifted_features, key=lambda x: x[1])[:10]:
print(f" {feature}: p={p_value:.4f}")
else:
print("✓ No significant distribution shift detected")
check_distribution_shift(X_train, X_test)
Check 3: Feature-Label Correlation
import pandas as pd
def analyze_feature_label_correlation(X, y, top_k=20):
"""
Find features most correlated with target
"""
correlations = {}
for feature in X.columns:
corr = np.corrcoef(X[feature].fillna(X[feature].mean()), y)[0, 1]
correlations[feature] = abs(corr)
# Sort by absolute correlation
sorted_corr = sorted(correlations.items(), key=lambda x: x[1], reverse=True)
print(f"Top {top_k} features correlated with target:")
for i, (feature, corr) in enumerate(sorted_corr[:top_k], 1):
print(f"{i}. {feature}: {corr:.3f}")
return sorted_corr
correlations = analyze_feature_label_correlation(X_train, y_train)
8. Production Feature Pipeline
Putting it all together: reproducible, version-controlled pipeline
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
class VoiceFeatureExtractor(BaseEstimator, TransformerMixin):
"""
Custom transformer for voice feature extraction
"""
def __init__(self, feature_set='eGeMAPS'):
self.feature_set = feature_set
self.smile = None
def fit(self, X, y=None):
# Initialize openSMILE
import opensmile
self.smile = opensmile.Smile(
feature_set=getattr(opensmile.FeatureSet, self.feature_set),
feature_level=opensmile.FeatureLevel.Functionals,
)
return self
def transform(self, X):
# X is list of audio file paths
features_list = []
for audio_path in X:
features = self.smile.process_file(audio_path)
features_list.append(features)
return pd.concat(features_list, ignore_index=True)
class SpeakerNormalizer(BaseEstimator, TransformerMixin):
"""
Z-score normalization per speaker
"""
def __init__(self, speaker_id_col='speaker_id'):
self.speaker_id_col = speaker_id_col
self.speaker_stats = {}
def fit(self, X, y=None):
# Compute per-speaker means and stds
for speaker_id in X[self.speaker_id_col].unique():
speaker_data = X[X[self.speaker_id_col] == speaker_id]
numeric_cols = speaker_data.select_dtypes(include=[np.number]).columns
self.speaker_stats[speaker_id] = {
'mean': speaker_data[numeric_cols].mean(),
'std': speaker_data[numeric_cols].std(),
}
return self
def transform(self, X):
X_norm = X.copy()
numeric_cols = X.select_dtypes(include=[np.number]).columns
for speaker_id in X[self.speaker_id_col].unique():
mask = X[self.speaker_id_col] == speaker_id
if speaker_id in self.speaker_stats:
stats = self.speaker_stats[speaker_id]
else:
# Unseen speaker: use global stats
stats = {
'mean': X[numeric_cols].mean(),
'std': X[numeric_cols].std(),
}
X_norm.loc[mask, numeric_cols] = (
(X.loc[mask, numeric_cols] - stats['mean']) / (stats['std'] + 1e-8)
)
return X_norm
# Build pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
pipeline = Pipeline([
('extract', VoiceFeatureExtractor(feature_set='eGeMAPSv02')),
('normalize', SpeakerNormalizer()),
('select', SelectFromModel(RandomForestClassifier(n_estimators=100), threshold='median')),
('classify', RandomForestClassifier(n_estimators=500, max_depth=10)),
])
# Train
audio_paths_train = ['audio1.wav', 'audio2.wav', ...]
pipeline.fit(audio_paths_train, y_train)
# Predict
audio_paths_test = ['test1.wav', 'test2.wav', ...]
predictions = pipeline.predict(audio_paths_test)
9. Common Pitfalls & How to Avoid Them
Pitfall 1: Data Leakage via Normalization
Mistake: Normalize entire dataset before train/test split
# WRONG
X_normalized = (X - X.mean()) / X.std() # Uses test data statistics!
X_train, X_test = train_test_split(X_normalized)
Fix: Fit scaler on train only, apply to test
# CORRECT
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
scaler.fit(X_train) # Only train data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use train statistics
Pitfall 2: Speaker Leakage in Train/Test Split
Mistake: Same speaker in train and test
# WRONG: Random split may put same speaker in train and test
X_train, X_test = train_test_split(X, y)
Fix: Use GroupKFold for speaker-independent evaluation
# CORRECT: Ensure train and test have different speakers
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=speaker_ids):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train and evaluate
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
Pitfall 3: Overfitting to Functionals
Problem: Using 98 functionals on 25 LLDs → 2,450 features, many redundant
Fix: Start with minimal functionals (mean, std, range), add others only if CV score improves
# Minimal functionals (interpretable)
functionals_minimal = ['mean', 'std', 'min', 'max', 'range']
# Extended functionals (add if needed)
functionals_extended = functionals_minimal + [
'quartile1', 'quartile2', 'quartile3',
'linregc1', # Slope
'skewness', 'kurtosis'
]
The Bottom Line: Feature Engineering Workflow
For most voice analysis tasks, follow this workflow:
- Extract eGeMAPS features (88 features, best baseline)
- Add domain-specific features (5-20 features based on task)
- Normalize per-speaker or per-gender (if multiple recordings available)
- Add delta features (if using LLDs, not functionals)
- Train Random Forest baseline (no feature selection)
- If accuracy insufficient:
- Extract ComParE (6,373 features)
- Use Random Forest feature importance to select top 100-500
- Retrain on selected features
- If still insufficient:
- Add Wav2vec 2.0 embeddings (768 features)
- Ensemble classical ML + deep learning
Expected performance progression:
- Raw features (no engineering): 60-70% accuracy
- + Normalization: 65-75%
- + Domain features: 70-80%
- + Feature selection: 72-83%
- + Deep learning ensemble: 75-88%
Remember: Feature engineering is iterative. Start simple (eGeMAPS + Random Forest), then add complexity only where CV score improves.
Voice Mirror's feature engineering pipeline combines eGeMAPS baseline (88 features), task-specific features (age tremor, PD micropauses, depression prosody), per-speaker normalization, and Random Forest feature selection from ComParE (6,373 → 200 features). Our hybrid approach delivers 78-89% accuracy across 20+ voice analysis tasks while maintaining interpretability for clinical applications.