openSMILE Configuration Guide: Mastering Feature Extraction for Voice Analysis
Complete guide to openSMILE configuration, from standard feature sets (GeMAPS, eGeMAPS) to custom pipelines. Learn component architecture, Python integration, and production optimization.
openSMILE Configuration Guide: Your Complete Feature Extraction Toolkit
When you're building voice analysis systems, openSMILE (open-source Speech and Music Interpretation by Large-space Extraction) is your Swiss Army knife. This powerful toolkit extracts 6,000+ acoustic features from audio with millisecond precision—but mastering its configuration unlocks the real power.
In this guide, you'll learn how to configure openSMILE for any voice analysis task, from using standard feature sets to building custom extraction pipelines.
Why openSMILE?
The gold standard for voice research: openSMILE has been used in 1,000+ academic papers, including winning systems in the INTERSPEECH Computational Paralinguistics Challenge (ComParE) since 2009.
Key advantages:
- Comprehensive: 6,000+ features covering pitch, energy, spectral, cepstral, voice quality
- Fast: Real-time processing (processes 10-minute audio in <1 second on modern CPU)
- Research-validated: Standard feature sets (GeMAPS, eGeMAPS) used across 200+ studies
- Flexible: Modular component system lets you build custom pipelines
- Production-ready: C++ core with Python bindings, runs on server or edge devices
openSMILE Architecture Overview
openSMILE uses a component-based pipeline architecture:
Audio Input → Framing → LLD Extraction → Functionals → Feature Vector Output
1. Audio Input: Reads WAV, MP3, or live audio stream
2. Framing: Splits audio into overlapping windows (e.g., 25ms windows, 10ms shift)
3. Low-Level Descriptors (LLDs): Frame-by-frame features extracted every 10ms:
- F0 (pitch)
- Energy (loudness)
- MFCCs (spectral shape)
- Jitter, shimmer (voice quality)
- Spectral flux, rolloff, centroid
4. Functionals: Statistical summaries of LLDs across larger time windows:
- Mean, standard deviation, min, max, range
- Quartiles (25th, 50th, 75th percentile)
- Slopes (linear regression of feature over time)
- Moments (skewness, kurtosis)
5. Output: Fixed-length feature vector (e.g., 88 features for GeMAPS) for machine learning
Standard Feature Sets
openSMILE provides pre-configured feature sets optimized for different tasks. You don't need to understand every component—just pick the right set for your use case.
1. GeMAPS (Geneva Minimalistic Acoustic Parameter Set)
Who should use it: Voice analysis beginners, applications prioritizing interpretability
Features: 88 features (18 LLDs × functionals + 6 temporal features)
LLDs extracted:
- F0 (fundamental frequency / pitch)
- Loudness
- Jitter, shimmer (voice quality)
- HNR (harmonics-to-noise ratio)
- Alpha ratio (spectral slope)
- Hammarberg index (spectral balance)
- Spectral slopes (0-500Hz, 500-1500Hz)
- F1, F2, F3 (first 3 formant frequencies)
- F1, F2, F3 bandwidths
- MFCC 1-4 (mel-frequency cepstral coefficients)
Temporal features:
- Rate of loudness peaks (speech rhythm)
- Mean length/duration of continuous voicing/unvoicing (pause patterns)
Research validation: Developed by the University of Geneva and TUM, used in 100+ papers. Performs comparably to larger feature sets for emotion (78% vs 81%), depression (68% vs 72%), and personality (r=0.24 vs r=0.28).
When to use GeMAPS:
- Interpretability matters: Clinical applications where you need to explain predictions
- Limited data: 88 features reduces overfitting risk with <500 samples
- Fast inference: Lower dimensionality → faster predictions
2. eGeMAPS (extended GeMAPS)
Who should use it: Most voice analysis applications (best balance of performance vs complexity)
Features: 88 features (25 LLDs × functionals + 13 temporal features)
Additional LLDs vs GeMAPS:
- MFCC 5-12 (8 additional cepstral coefficients)
- Spectral flux (rate of spectral change)
- Loudness slopes (dynamic changes in loudness)
Research validation: The most widely used feature set in voice analysis research (2016-2024). Used in AVEC Depression Challenge (71% accuracy), ComParE Emotion Challenge (78% accuracy), and 50+ health screening studies.
Performance vs GeMAPS:
- Emotion recognition: 81% vs 78% (eGeMAPS better)
- Depression screening: 72% vs 68% (eGeMAPS better)
- Gender classification: 98% vs 98% (no difference)
- Age estimation: MAE 5.9 vs 6.4 years (eGeMAPS better)
When to use eGeMAPS:
- Default choice: Best starting point for 90% of voice analysis tasks
- Sufficient data: >500 samples to avoid overfitting
- Need SOTA performance: Extra features capture subtle variations
3. ComParE (Computational Paralinguistics Challenge)
Who should use it: Researchers, applications with >5,000 samples, when you need maximum information
Features: 6,373 features (65 LLDs × 98 functionals)
LLDs extracted (in addition to eGeMAPS):
- MFCC 1-14 (all coefficients)
- Delta coefficients (temporal derivatives of MFCCs)
- LSP (Line Spectral Pairs) frequencies 0-7
- F0 envelope (pitch variation patterns)
- Voicing probability
- Spectral harmonicity, psychoacoustic sharpness, spectral variance
Functionals (98 statistical summaries):
- All GeMAPS functionals (mean, std, min, max, range, quartiles, slope)
- Additional moments: skewness, kurtosis
- Position of extrema (argmin, argmax)
- Linear/quadratic regression coefficients
- Percentile ranges (95th-5th, 75th-25th)
- Up-level time (% time above mean)
Research validation: Official feature set for INTERSPEECH ComParE challenges (2009-present). Winning systems have used ComParE for emotion (85% accuracy), speaker traits (83%), and medical conditions (89%).
Performance gains vs eGeMAPS:
- Complex tasks: 3-8% accuracy improvement (emotion, depression, personality)
- Simple tasks: No improvement (gender, age)
- Requires feature selection: Use Random Forest importance or Lasso to reduce to 100-500 features
When to use ComParE:
- Large datasets: >5,000 samples to avoid overfitting
- Complex tasks: Subtle conditions (depression, cognitive load)
- Feature selection: Plan to use dimensionality reduction
- Research context: Benchmarking against prior studies
Python Integration: The Easy Way
The opensmile Python package (by audEERING, openSMILE creators) provides zero-configuration feature extraction for standard sets.
Installation
pip install opensmile
Basic Usage: Extract eGeMAPS Features
import opensmile
import numpy as np
# Initialize feature extractor
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals,
)
# Extract features from audio file
features = smile.process_file('audio.wav')
# Returns pandas DataFrame with 88 columns (eGeMAPS features)
print(features.shape) # (1, 88)
print(features.columns[:5])
# ['F0semitoneFrom27.5Hz_sma3nz_amean',
# 'F0semitoneFrom27.5Hz_sma3nz_stddevNorm',
# 'F0semitoneFrom27.5Hz_sma3nz_percentile20.0', ...]
Processing Multiple Files
import os
import pandas as pd
audio_files = ['speaker1.wav', 'speaker2.wav', 'speaker3.wav']
all_features = []
for audio_path in audio_files:
features = smile.process_file(audio_path)
all_features.append(features)
# Concatenate into single DataFrame
df = pd.concat(all_features, ignore_index=True)
print(df.shape) # (3, 88)
Switching Feature Sets
# GeMAPS (minimal)
smile_gemaps = opensmile.Smile(
feature_set=opensmile.FeatureSet.GeMAPSv01b,
feature_level=opensmile.FeatureLevel.Functionals,
)
# ComParE (comprehensive)
smile_compare = opensmile.Smile(
feature_set=opensmile.FeatureSet.ComParE_2016,
feature_level=opensmile.FeatureLevel.Functionals,
)
# eGeMAPS with LLDs (frame-by-frame features)
smile_llds = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
features_llds = smile_llds.process_file('audio.wav')
print(features_llds.shape) # (time_frames, 25) - 25 LLDs per frame
Real-Time Processing with Live Audio
import sounddevice as sd
# Initialize extractor
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals,
)
# Record 5 seconds
sample_rate = 16000
audio = sd.rec(int(5 * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()
# Process numpy array directly
features = smile.process_signal(audio.flatten(), sample_rate)
print(features.shape) # (1, 88)
Custom Configuration Files
For advanced use cases—custom feature combinations, real-time processing, or integrating with C++ applications—you'll work with openSMILE configuration files (.conf).
Configuration files define the component graph: which components run, in what order, and how data flows between them.
Configuration File Structure
openSMILE configs use INI-style syntax:
[component_name:component_type]
reader.dmLevel = wave
writer.dmLevel = llds
parameter1 = value1
parameter2 = value2
Key concepts:
- Data Memory (DM): Internal buffer where components read/write data. Each "level" is a named buffer (e.g., "wave", "pitch", "llds").
- Components: Processing units that read from one DM level and write to another.
- Component types: Pre-defined classes (cWaveSource, cFramer, cPitchShs, cFunctionals).
Example: Minimal Pitch Extraction Config
;; pitch_extraction.conf
;; Extract F0 (pitch) from audio
[componentInstances:cComponentManager]
instance[waveSource].type = cWaveSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[csvSink].type = cCsvSink
;;;;;;;;;; 1. Audio Input ;;;;;;;;;;
[waveSource:cWaveSource]
filename = input.wav
writer.dmLevel = wave
;;;;;;;;;; 2. Framing ;;;;;;;;;;
[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025 ; 25ms windows
frameStep = 0.010 ; 10ms shift (overlap)
frameCenterSpecial = left
;;;;;;;;;; 3. Pitch Extraction ;;;;;;;;;;
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = pitch
F0min = 50 ; Minimum F0 (Hz)
F0max = 500 ; Maximum F0 (Hz)
;;;;;;;;;; 4. Output to CSV ;;;;;;;;;;
[csvSink:cCsvSink]
reader.dmLevel = pitch
filename = pitch_output.csv
append = 0
timestamp = 1
Running the config:
SMILExtract -C pitch_extraction.conf
Output: CSV file with time-stamped pitch values (one row per 10ms frame).
Example: Gender Classification Feature Set
Let's build a custom config that extracts 6 features optimal for gender classification:
- F0 (pitch) mean, std
- Formant 1 (F1) mean
- Formant 2 (F2) mean
- Spectral centroid mean
- HNR (harmonics-to-noise ratio) mean
;; gender_features.conf
[componentInstances:cComponentManager]
instance[waveSource].type = cWaveSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[formantExtractor].type = cFormantLpc
instance[spectralExtractor].type = cSpectral
instance[hnrExtractor].type = cHarmonics
instance[functionals].type = cFunctionals
instance[csvSink].type = cCsvSink
;;;;;;;;;; Audio Input ;;;;;;;;;;
[waveSource:cWaveSource]
filename = \cm[inputfile(I){input.wav}:name of input file]
writer.dmLevel = wave
;;;;;;;;;; Framing ;;;;;;;;;;
[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025
frameStep = 0.010
;;;;;;;;;; F0 Extraction ;;;;;;;;;;
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds
F0min = 50
F0max = 500
;;;;;;;;;; Formant Extraction ;;;;;;;;;;
[formantExtractor:cFormantLpc]
reader.dmLevel = frames
writer.dmLevel = llds
nFormants = 3
saveIntensity = 0
;;;;;;;;;; Spectral Features ;;;;;;;;;;
[spectralExtractor:cSpectral]
reader.dmLevel = frames
writer.dmLevel = llds
centroid = 1
flux = 0
rollOff = 0
;;;;;;;;;; HNR Extraction ;;;;;;;;;;
[hnrExtractor:cHarmonics]
reader.dmLevel = frames
writer.dmLevel = llds
outputHnr = 1
;;;;;;;;;; Functionals (Statistical Summaries) ;;;;;;;;;;
[functionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = functionals
functionalsEnabled = mean;stddev
;;;;;;;;;; Output ;;;;;;;;;;
[csvSink:cCsvSink]
reader.dmLevel = functionals
filename = \cm[outputfile(O){output.csv}:output CSV file]
append = 0
timestamp = 0
instanceName = 1
Running with custom input/output:
SMILExtract -C gender_features.conf -I speaker1.wav -O speaker1_features.csv
Output: CSV with 6 features (F0 mean/std, F1 mean, F2 mean, spectral centroid mean, HNR mean).
Key Components Reference
Here are the most useful openSMILE components for voice analysis:
Audio Input/Output
| Component | Purpose | Key Parameters |
|---|---|---|
cWaveSource |
Read audio file | filename, monoMixdown=1 (convert stereo to mono) |
cPortaudioSource |
Live microphone input | audioBuffersize, sampleRate |
cCsvSink |
Write features to CSV | filename, append, timestamp |
Preprocessing
| Component | Purpose | Key Parameters |
|---|---|---|
cFramer |
Split audio into frames | frameSize=0.025 (25ms), frameStep=0.010 (10ms shift) |
cWindower |
Apply window function | winFunc=ham (Hamming), winFunc=han (Hanning) |
cPreemphasis |
High-pass filter | k=0.97 (pre-emphasis coefficient) |
Feature Extraction
| Component | Purpose | Key Parameters |
|---|---|---|
cPitchShs |
F0 (pitch) via subharmonic summation | F0min, F0max, voicingCutoff |
cIntensity |
Loudness (RMS energy) | intensity=1, loudness=1 |
cMfcc |
MFCCs (spectral shape) | nMfcc=13, htkcompatible=1 |
cFormantLpc |
Formants (F1, F2, F3) via LPC | nFormants=3, lpcOrder=8 |
cSpectral |
Spectral features (centroid, flux, rolloff) | centroid=1, flux=1, rollOff=1 |
cHarmonics |
HNR (harmonics-to-noise ratio) | outputHnr=1, F0reader.dmLevel (requires pitch) |
cJitterDDP |
Jitter (pitch perturbation) | jitterLocal=1, jitterDDP=1 |
cShimmer |
Shimmer (amplitude perturbation) | shimmerLocal=1 |
Functionals (Statistical Summaries)
| Component | Purpose | Key Parameters |
|---|---|---|
cFunctionals |
Apply statistical functions to LLDs | functionalsEnabled=mean;stddev;min;max;range;quartile1;quartile2;quartile3 |
cDeltaRegression |
Temporal derivatives (delta, delta-delta) | nameAppend=de (suffix for delta features) |
Advanced Configuration Patterns
1. Real-Time Feature Extraction
For live applications (e.g., voice assistants, real-time emotion detection), you need incremental functionals that update every N frames instead of waiting for full audio.
[componentInstances:cComponentManager]
instance[portaudioInput].type = cPortaudioSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[functionals].type = cFunctionals
instance[realtimeOutput].type = cCsvSink
[portaudioInput:cPortaudioSource]
writer.dmLevel = wave
audioBuffersize = 4000
sampleRate = 16000
[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025
frameStep = 0.010
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds
[functionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = functionals
; Key setting: update functionals every 1 second
frameMode = sliding
frameSize = 1.0
frameStep = 0.1
functionalsEnabled = mean;stddev
[realtimeOutput:cCsvSink]
reader.dmLevel = functionals
filename = realtime_features.csv
append = 1
timestamp = 1
Result: Features update every 100ms (frameStep=0.1), computing statistics over sliding 1-second windows.
2. Multi-Level Feature Extraction
Extract features at multiple time scales (frame-level, utterance-level, turn-level):
[componentInstances:cComponentManager]
; ... (framer, pitch extractor, etc.)
instance[frameLevelFunctionals].type = cFunctionals
instance[utteranceLevelFunctionals].type = cFunctionals
; Frame-level: 1-second windows
[frameLevelFunctionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = frame_features
frameMode = sliding
frameSize = 1.0
frameStep = 0.5
functionalsEnabled = mean;stddev
; Utterance-level: Full audio
[utteranceLevelFunctionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = utterance_features
frameMode = full
functionalsEnabled = mean;stddev;min;max;quartile1;quartile2;quartile3;linregc1;linregc2
3. Feature Normalization
Normalize features for speaker independence:
[componentInstances:cComponentManager]
; ... (feature extraction)
instance[normalizer].type = cVectorOperation
[normalizer:cVectorOperation]
reader.dmLevel = llds
writer.dmLevel = llds_norm
; Z-score normalization: (x - mean) / std
operation = mean-normalize
; Or min-max scaling: (x - min) / (max - min)
; operation = minmax-normalize
Performance Optimization
1. Reduce Feature Count
Problem: 6,373 ComParE features cause overfitting and slow inference.
Solution: Extract minimal LLDs, then add only essential functionals.
import opensmile
# Extract only LLDs (no functionals)
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
llds = smile.process_file('audio.wav') # (time_frames, 25)
# Compute custom functionals in Python
import numpy as np
features = {
'f0_mean': llds['F0semitoneFrom27.5Hz_sma3nz'].mean(),
'f0_std': llds['F0semitoneFrom27.5Hz_sma3nz'].std(),
'f0_range': llds['F0semitoneFrom27.5Hz_sma3nz'].max() - llds['F0semitoneFrom27.5Hz_sma3nz'].min(),
'loudness_mean': llds['loudness_sma3'].mean(),
# ... (only features you need)
}
Result: 10-20 custom features instead of 6,373.
2. Batch Processing
Problem: Processing files one-by-one is slow (repeated model initialization).
Solution: Process multiple files in single session.
import opensmile
import glob
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals,
)
audio_files = glob.glob('dataset/**/*.wav', recursive=True)
all_features = []
for i, audio_path in enumerate(audio_files):
if i % 100 == 0:
print(f"Processed {i}/{len(audio_files)}")
features = smile.process_file(audio_path)
features['filename'] = audio_path
all_features.append(features)
df = pd.concat(all_features, ignore_index=True)
df.to_csv('all_features.csv', index=False)
Performance: Processes 10-minute audio in ~1 second on modern CPU (single-threaded). For large datasets, parallelize with multiprocessing:
from multiprocessing import Pool
def extract_features(audio_path):
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals,
)
return smile.process_file(audio_path)
with Pool(processes=8) as pool:
all_features = pool.map(extract_features, audio_files)
3. Caching for Real-Time Applications
Problem: Re-extracting features for every prediction is wasteful.
Solution: Cache features in Redis or in-memory.
import redis
import pickle
redis_client = redis.Redis(host='localhost', port=6379)
def get_features_cached(audio_path):
cache_key = f"features:{audio_path}"
# Check cache
cached = redis_client.get(cache_key)
if cached:
return pickle.loads(cached)
# Extract features
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals,
)
features = smile.process_file(audio_path)
# Store in cache (24-hour expiry)
redis_client.setex(cache_key, 86400, pickle.dumps(features))
return features
Common Configuration Mistakes
1. Incorrect Frame Size/Shift
Mistake:
frameSize = 25 ; Interpreted as 25 SECONDS (way too large)
frameStep = 10
Fix:
frameSize = 0.025 ; 25 milliseconds
frameStep = 0.010 ; 10 milliseconds
2. Missing Data Memory Levels
Mistake:
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = pitch
[functionals:cFunctionals]
reader.dmLevel = llds ; ERROR: No component writes to 'llds'
Fix: Ensure writer of one component matches reader of next.
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds ; Changed to 'llds'
[functionals:cFunctionals]
reader.dmLevel = llds ; Now matches
3. Pitch Range Too Narrow
Mistake:
[pitchExtractor:cPitchShs]
F0min = 80 ; Excludes female speakers (F0 often 180-250 Hz)
F0max = 250 ; Excludes male speakers (F0 often 85-180 Hz)
Fix: Use wide range for mixed-gender datasets.
F0min = 50 ; Covers male speakers (down to 85 Hz)
F0max = 500 ; Covers female speakers (up to 250 Hz) + margin
4. Forgetting Mono Conversion
Mistake: Processing stereo audio without mixing to mono causes feature duplication.
Fix:
[waveSource:cWaveSource]
monoMixdown = 1 ; Convert stereo to mono
openSMILE vs Custom Feature Extraction
When to use openSMILE:
- Research comparison: Need to benchmark against prior studies
- Interpretability: Clinical applications requiring explainable features
- Low-latency: Real-time processing on CPU
- Production C++ integration: Embed in mobile/edge devices
When to use custom deep learning features (Wav2vec 2.0, etc.):
- SOTA performance: Need maximum accuracy (2-5% improvement over openSMILE)
- End-to-end training: Fine-tune feature extraction with task-specific loss
- Large datasets: >10,000 samples to train deep models
- GPU availability: Can afford 10-100ms inference latency
Hybrid approach (recommended):
- openSMILE for classical ML: Fast baseline (Random Forest on eGeMAPS: 70-85% accuracy)
- Wav2vec 2.0 for deep learning: SOTA ensemble model (75-95% accuracy)
- Combine predictions: Weighted average or stacking (additional 1-3% boost)
Production Deployment
Option 1: Python Package (Easiest)
# Dockerfile
FROM python:3.10-slim
RUN pip install opensmile scikit-learn
COPY extract_features.py /app/
CMD ["python", "/app/extract_features.py"]
Pros: Zero configuration, easy debugging, integrates with Python ML stack
Cons: Slower startup (~100ms to initialize), larger Docker image (~500MB)
Option 2: C++ Binary (Fastest)
# Dockerfile
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y opensmile
COPY custom_config.conf /app/
CMD ["SMILExtract", "-C", "/app/custom_config.conf", "-I", "/input/audio.wav", "-O", "/output/features.csv"]
Pros: Fastest startup (<10ms), smallest image (~50MB), lowest memory
Cons: Harder to debug, requires config file maintenance
Option 3: FastAPI Microservice
# feature_service.py
from fastapi import FastAPI, UploadFile
import opensmile
import tempfile
import os
app = FastAPI()
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals,
)
@app.post("/extract_features")
async def extract_features(audio: UploadFile):
# Save uploaded audio to temp file
with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_audio:
temp_audio.write(await audio.read())
temp_path = temp_audio.name
# Extract features
features = smile.process_file(temp_path)
os.unlink(temp_path)
# Return as JSON
return features.to_dict(orient='records')[0]
Usage:
curl -X POST "http://localhost:8000/extract_features" \
-F "audio=@audio.wav"
Pros: Language-agnostic API, scales horizontally, easy monitoring
Debugging openSMILE
1. Enable Verbose Logging
SMILExtract -C config.conf -l 3 # Log level 3 (verbose)
Output shows: Component initialization, data flow between levels, timing per component.
2. Inspect Intermediate Data
Add multiple cCsvSink components to dump intermediate features:
[lld_dump:cCsvSink]
reader.dmLevel = llds ; Dump LLDs before functionals
filename = debug_llds.csv
[functionals_dump:cCsvSink]
reader.dmLevel = functionals ; Dump final features
filename = debug_functionals.csv
3. Python Debugging with LLDs
import opensmile
import matplotlib.pyplot as plt
# Extract frame-by-frame features
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
llds = smile.process_file('audio.wav')
# Plot pitch over time
plt.figure(figsize=(12, 4))
plt.plot(llds.index, llds['F0semitoneFrom27.5Hz_sma3nz'])
plt.xlabel('Time (s)')
plt.ylabel('F0 (semitones from 27.5 Hz)')
plt.title('Pitch Contour')
plt.savefig('pitch_contour.png')
Resources & Further Learning
Official Documentation:
- openSMILE Documentation - Official docs from audEERING
- Python Package Docs -
opensmilePython API reference - GitHub Repository - Source code and example configs
Key Research Papers:
- openSMILE architecture: Eyben et al. (2010) - "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor" (ACM MM)
- GeMAPS feature set: Eyben et al. (2016) - "The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing" (IEEE TASLP)
- ComParE feature set: Schuller et al. (2013) - "The INTERSPEECH 2013 Computational Paralinguistics Challenge" (INTERSPEECH)
Example Configs:
/path/to/opensmile/config/- Standard feature set configs (installed with openSMILE)gemaps/v01b/eGeMAPSv01b.conf- eGeMAPS standard configcompare/ComParE_2016.conf- ComParE challenge config
The Bottom Line
openSMILE is your feature extraction Swiss Army knife—but you don't need to master every component to be effective.
For 90% of voice analysis tasks:
- Use the Python package (
pip install opensmile) - Start with eGeMAPS (88 features, best balance)
- Extract functionals (not LLDs) for utterance-level classification
- Train classical ML model (Random Forest) as baseline
- If accuracy insufficient, try ComParE (6,373 features) with feature selection
For custom pipelines (real-time, edge devices, C++ integration):
- Start with minimal example config (pitch extraction)
- Add components incrementally (formants, spectral, voice quality)
- Test intermediate outputs with
cCsvSinkat each stage - Optimize: reduce LLDs to essential features, use sliding functionals for real-time
The goal isn't to extract 6,000 features—it's to extract the right features for your specific task. openSMILE gives you the tools; your domain knowledge determines which ones to use.
Voice Mirror uses openSMILE eGeMAPS (88 features) for interpretable baseline models and ComParE (6,373 features with Random Forest feature selection) for SOTA ensemble models. Our hybrid approach combines classical ML interpretability with deep learning accuracy across 20+ voice analysis tasks.