openSMILE Configuration Guide: Your Complete Feature Extraction Toolkit

When you're building voice analysis systems, openSMILE (open-source Speech and Music Interpretation by Large-space Extraction) is your Swiss Army knife. This powerful toolkit extracts 6,000+ acoustic features from audio with millisecond precision—but mastering its configuration unlocks the real power.

In this guide, you'll learn how to configure openSMILE for any voice analysis task, from using standard feature sets to building custom extraction pipelines.

Why openSMILE?

The gold standard for voice research: openSMILE has been used in 1,000+ academic papers, including winning systems in the INTERSPEECH Computational Paralinguistics Challenge (ComParE) since 2009.

Key advantages:

Comprehensive: 6,000+ features covering pitch, energy, spectral, cepstral, voice quality
Fast: Real-time processing (processes 10-minute audio in <1 second on modern CPU)
Research-validated: Standard feature sets (GeMAPS, eGeMAPS) used across 200+ studies
Flexible: Modular component system lets you build custom pipelines
Production-ready: C++ core with Python bindings, runs on server or edge devices

openSMILE Architecture Overview

openSMILE uses a component-based pipeline architecture:

Audio Input → Framing → LLD Extraction → Functionals → Feature Vector Output

1. Audio Input: Reads WAV, MP3, or live audio stream

2. Framing: Splits audio into overlapping windows (e.g., 25ms windows, 10ms shift)

3. Low-Level Descriptors (LLDs): Frame-by-frame features extracted every 10ms:

F0 (pitch)
Energy (loudness)
MFCCs (spectral shape)
Jitter, shimmer (voice quality)
Spectral flux, rolloff, centroid

4. Functionals: Statistical summaries of LLDs across larger time windows:

Mean, standard deviation, min, max, range
Quartiles (25th, 50th, 75th percentile)
Slopes (linear regression of feature over time)
Moments (skewness, kurtosis)

5. Output: Fixed-length feature vector (e.g., 88 features for GeMAPS) for machine learning

Standard Feature Sets

openSMILE provides pre-configured feature sets optimized for different tasks. You don't need to understand every component—just pick the right set for your use case.

1. GeMAPS (Geneva Minimalistic Acoustic Parameter Set)

Who should use it: Voice analysis beginners, applications prioritizing interpretability

Features: 88 features (18 LLDs × functionals + 6 temporal features)

LLDs extracted:

F0 (fundamental frequency / pitch)
Loudness
Jitter, shimmer (voice quality)
HNR (harmonics-to-noise ratio)
Alpha ratio (spectral slope)
Hammarberg index (spectral balance)
Spectral slopes (0-500Hz, 500-1500Hz)
F1, F2, F3 (first 3 formant frequencies)
F1, F2, F3 bandwidths
MFCC 1-4 (mel-frequency cepstral coefficients)

Temporal features:

Rate of loudness peaks (speech rhythm)
Mean length/duration of continuous voicing/unvoicing (pause patterns)

Research validation: Developed by the University of Geneva and TUM, used in 100+ papers. Performs comparably to larger feature sets for emotion (78% vs 81%), depression (68% vs 72%), and personality (r=0.24 vs r=0.28).

When to use GeMAPS:

Interpretability matters: Clinical applications where you need to explain predictions
Limited data: 88 features reduces overfitting risk with <500 samples
Fast inference: Lower dimensionality → faster predictions

2. eGeMAPS (extended GeMAPS)

Who should use it: Most voice analysis applications (best balance of performance vs complexity)

Features: 88 features (25 LLDs × functionals + 13 temporal features)

Additional LLDs vs GeMAPS:

MFCC 5-12 (8 additional cepstral coefficients)
Spectral flux (rate of spectral change)
Loudness slopes (dynamic changes in loudness)

Research validation: The most widely used feature set in voice analysis research (2016-2024). Used in AVEC Depression Challenge (71% accuracy), ComParE Emotion Challenge (78% accuracy), and 50+ health screening studies.

Performance vs GeMAPS:

Emotion recognition: 81% vs 78% (eGeMAPS better)
Depression screening: 72% vs 68% (eGeMAPS better)
Gender classification: 98% vs 98% (no difference)
Age estimation: MAE 5.9 vs 6.4 years (eGeMAPS better)

When to use eGeMAPS:

Default choice: Best starting point for 90% of voice analysis tasks
Sufficient data: >500 samples to avoid overfitting
Need SOTA performance: Extra features capture subtle variations

3. ComParE (Computational Paralinguistics Challenge)

Who should use it: Researchers, applications with >5,000 samples, when you need maximum information

Features: 6,373 features (65 LLDs × 98 functionals)

LLDs extracted (in addition to eGeMAPS):

MFCC 1-14 (all coefficients)
Delta coefficients (temporal derivatives of MFCCs)
LSP (Line Spectral Pairs) frequencies 0-7
F0 envelope (pitch variation patterns)
Voicing probability
Spectral harmonicity, psychoacoustic sharpness, spectral variance

Functionals (98 statistical summaries):

All GeMAPS functionals (mean, std, min, max, range, quartiles, slope)
Additional moments: skewness, kurtosis
Position of extrema (argmin, argmax)
Linear/quadratic regression coefficients
Percentile ranges (95th-5th, 75th-25th)
Up-level time (% time above mean)

Research validation: Official feature set for INTERSPEECH ComParE challenges (2009-present). Winning systems have used ComParE for emotion (85% accuracy), speaker traits (83%), and medical conditions (89%).

Performance gains vs eGeMAPS:

Complex tasks: 3-8% accuracy improvement (emotion, depression, personality)
Simple tasks: No improvement (gender, age)
Requires feature selection: Use Random Forest importance or Lasso to reduce to 100-500 features

When to use ComParE:

Large datasets: >5,000 samples to avoid overfitting
Complex tasks: Subtle conditions (depression, cognitive load)
Feature selection: Plan to use dimensionality reduction
Research context: Benchmarking against prior studies

Python Integration: The Easy Way

The opensmile Python package (by audEERING, openSMILE creators) provides zero-configuration feature extraction for standard sets.

Installation

pip install opensmile

Basic Usage: Extract eGeMAPS Features

import opensmile
import numpy as np

# Initialize feature extractor
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# Extract features from audio file
features = smile.process_file('audio.wav')

# Returns pandas DataFrame with 88 columns (eGeMAPS features)
print(features.shape)  # (1, 88)
print(features.columns[:5])
# ['F0semitoneFrom27.5Hz_sma3nz_amean',
#  'F0semitoneFrom27.5Hz_sma3nz_stddevNorm',
#  'F0semitoneFrom27.5Hz_sma3nz_percentile20.0', ...]

Processing Multiple Files

import os
import pandas as pd

audio_files = ['speaker1.wav', 'speaker2.wav', 'speaker3.wav']

all_features = []
for audio_path in audio_files:
    features = smile.process_file(audio_path)
    all_features.append(features)

# Concatenate into single DataFrame
df = pd.concat(all_features, ignore_index=True)
print(df.shape)  # (3, 88)

Switching Feature Sets

# GeMAPS (minimal)
smile_gemaps = opensmile.Smile(
    feature_set=opensmile.FeatureSet.GeMAPSv01b,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# ComParE (comprehensive)
smile_compare = opensmile.Smile(
    feature_set=opensmile.FeatureSet.ComParE_2016,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# eGeMAPS with LLDs (frame-by-frame features)
smile_llds = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

features_llds = smile_llds.process_file('audio.wav')
print(features_llds.shape)  # (time_frames, 25) - 25 LLDs per frame

Real-Time Processing with Live Audio

import sounddevice as sd

# Initialize extractor
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# Record 5 seconds
sample_rate = 16000
audio = sd.rec(int(5 * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()

# Process numpy array directly
features = smile.process_signal(audio.flatten(), sample_rate)
print(features.shape)  # (1, 88)

Custom Configuration Files

For advanced use cases—custom feature combinations, real-time processing, or integrating with C++ applications—you'll work with openSMILE configuration files (.conf).

Configuration files define the component graph: which components run, in what order, and how data flows between them.

Configuration File Structure

openSMILE configs use INI-style syntax:

[component_name:component_type]
reader.dmLevel = wave
writer.dmLevel = llds
parameter1 = value1
parameter2 = value2

Key concepts:

Data Memory (DM): Internal buffer where components read/write data. Each "level" is a named buffer (e.g., "wave", "pitch", "llds").
Components: Processing units that read from one DM level and write to another.
Component types: Pre-defined classes (cWaveSource, cFramer, cPitchShs, cFunctionals).

Example: Minimal Pitch Extraction Config

;; pitch_extraction.conf
;; Extract F0 (pitch) from audio

[componentInstances:cComponentManager]
instance[waveSource].type = cWaveSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[csvSink].type = cCsvSink

;;;;;;;;;; 1. Audio Input ;;;;;;;;;;
[waveSource:cWaveSource]
filename = input.wav
writer.dmLevel = wave

;;;;;;;;;; 2. Framing ;;;;;;;;;;
[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025  ; 25ms windows
frameStep = 0.010  ; 10ms shift (overlap)
frameCenterSpecial = left

;;;;;;;;;; 3. Pitch Extraction ;;;;;;;;;;
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = pitch
F0min = 50   ; Minimum F0 (Hz)
F0max = 500  ; Maximum F0 (Hz)

;;;;;;;;;; 4. Output to CSV ;;;;;;;;;;
[csvSink:cCsvSink]
reader.dmLevel = pitch
filename = pitch_output.csv
append = 0
timestamp = 1

Running the config:

SMILExtract -C pitch_extraction.conf

Output: CSV file with time-stamped pitch values (one row per 10ms frame).

Example: Gender Classification Feature Set

Let's build a custom config that extracts 6 features optimal for gender classification:

F0 (pitch) mean, std
Formant 1 (F1) mean
Formant 2 (F2) mean
Spectral centroid mean
HNR (harmonics-to-noise ratio) mean

;; gender_features.conf

[componentInstances:cComponentManager]
instance[waveSource].type = cWaveSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[formantExtractor].type = cFormantLpc
instance[spectralExtractor].type = cSpectral
instance[hnrExtractor].type = cHarmonics
instance[functionals].type = cFunctionals
instance[csvSink].type = cCsvSink

;;;;;;;;;; Audio Input ;;;;;;;;;;
[waveSource:cWaveSource]
filename = \cm[inputfile(I){input.wav}:name of input file]
writer.dmLevel = wave

;;;;;;;;;; Framing ;;;;;;;;;;
[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025
frameStep = 0.010

;;;;;;;;;; F0 Extraction ;;;;;;;;;;
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds
F0min = 50
F0max = 500

;;;;;;;;;; Formant Extraction ;;;;;;;;;;
[formantExtractor:cFormantLpc]
reader.dmLevel = frames
writer.dmLevel = llds
nFormants = 3
saveIntensity = 0

;;;;;;;;;; Spectral Features ;;;;;;;;;;
[spectralExtractor:cSpectral]
reader.dmLevel = frames
writer.dmLevel = llds
centroid = 1
flux = 0
rollOff = 0

;;;;;;;;;; HNR Extraction ;;;;;;;;;;
[hnrExtractor:cHarmonics]
reader.dmLevel = frames
writer.dmLevel = llds
outputHnr = 1

;;;;;;;;;; Functionals (Statistical Summaries) ;;;;;;;;;;
[functionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = functionals
functionalsEnabled = mean;stddev

;;;;;;;;;; Output ;;;;;;;;;;
[csvSink:cCsvSink]
reader.dmLevel = functionals
filename = \cm[outputfile(O){output.csv}:output CSV file]
append = 0
timestamp = 0
instanceName = 1

Running with custom input/output:

SMILExtract -C gender_features.conf -I speaker1.wav -O speaker1_features.csv

Output: CSV with 6 features (F0 mean/std, F1 mean, F2 mean, spectral centroid mean, HNR mean).

Key Components Reference

Here are the most useful openSMILE components for voice analysis:

Audio Input/Output

Component	Purpose	Key Parameters
`cWaveSource`	Read audio file	`filename`, `monoMixdown=1` (convert stereo to mono)
`cPortaudioSource`	Live microphone input	`audioBuffersize`, `sampleRate`
`cCsvSink`	Write features to CSV	`filename`, `append`, `timestamp`

Preprocessing

Component	Purpose	Key Parameters
`cFramer`	Split audio into frames	`frameSize=0.025` (25ms), `frameStep=0.010` (10ms shift)
`cWindower`	Apply window function	`winFunc=ham` (Hamming), `winFunc=han` (Hanning)
`cPreemphasis`	High-pass filter	`k=0.97` (pre-emphasis coefficient)

Feature Extraction

Component	Purpose	Key Parameters
`cPitchShs`	F0 (pitch) via subharmonic summation	`F0min`, `F0max`, `voicingCutoff`
`cIntensity`	Loudness (RMS energy)	`intensity=1`, `loudness=1`
`cMfcc`	MFCCs (spectral shape)	`nMfcc=13`, `htkcompatible=1`
`cFormantLpc`	Formants (F1, F2, F3) via LPC	`nFormants=3`, `lpcOrder=8`
`cSpectral`	Spectral features (centroid, flux, rolloff)	`centroid=1`, `flux=1`, `rollOff=1`
`cHarmonics`	HNR (harmonics-to-noise ratio)	`outputHnr=1`, `F0reader.dmLevel` (requires pitch)
`cJitterDDP`	Jitter (pitch perturbation)	`jitterLocal=1`, `jitterDDP=1`
`cShimmer`	Shimmer (amplitude perturbation)	`shimmerLocal=1`

Functionals (Statistical Summaries)

Component	Purpose	Key Parameters
`cFunctionals`	Apply statistical functions to LLDs	`functionalsEnabled=mean;stddev;min;max;range;quartile1;quartile2;quartile3`
`cDeltaRegression`	Temporal derivatives (delta, delta-delta)	`nameAppend=de` (suffix for delta features)

Advanced Configuration Patterns

1. Real-Time Feature Extraction

For live applications (e.g., voice assistants, real-time emotion detection), you need incremental functionals that update every N frames instead of waiting for full audio.

[componentInstances:cComponentManager]
instance[portaudioInput].type = cPortaudioSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[functionals].type = cFunctionals
instance[realtimeOutput].type = cCsvSink

[portaudioInput:cPortaudioSource]
writer.dmLevel = wave
audioBuffersize = 4000
sampleRate = 16000

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025
frameStep = 0.010

[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds

[functionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = functionals
; Key setting: update functionals every 1 second
frameMode = sliding
frameSize = 1.0
frameStep = 0.1
functionalsEnabled = mean;stddev

[realtimeOutput:cCsvSink]
reader.dmLevel = functionals
filename = realtime_features.csv
append = 1
timestamp = 1

Result: Features update every 100ms (frameStep=0.1), computing statistics over sliding 1-second windows.

2. Multi-Level Feature Extraction

Extract features at multiple time scales (frame-level, utterance-level, turn-level):

[componentInstances:cComponentManager]
; ... (framer, pitch extractor, etc.)
instance[frameLevelFunctionals].type = cFunctionals
instance[utteranceLevelFunctionals].type = cFunctionals

; Frame-level: 1-second windows
[frameLevelFunctionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = frame_features
frameMode = sliding
frameSize = 1.0
frameStep = 0.5
functionalsEnabled = mean;stddev

; Utterance-level: Full audio
[utteranceLevelFunctionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = utterance_features
frameMode = full
functionalsEnabled = mean;stddev;min;max;quartile1;quartile2;quartile3;linregc1;linregc2

3. Feature Normalization

Normalize features for speaker independence:

[componentInstances:cComponentManager]
; ... (feature extraction)
instance[normalizer].type = cVectorOperation

[normalizer:cVectorOperation]
reader.dmLevel = llds
writer.dmLevel = llds_norm
; Z-score normalization: (x - mean) / std
operation = mean-normalize
; Or min-max scaling: (x - min) / (max - min)
; operation = minmax-normalize

Performance Optimization

1. Reduce Feature Count

Problem: 6,373 ComParE features cause overfitting and slow inference.

Solution: Extract minimal LLDs, then add only essential functionals.

import opensmile

# Extract only LLDs (no functionals)
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

llds = smile.process_file('audio.wav')  # (time_frames, 25)

# Compute custom functionals in Python
import numpy as np

features = {
    'f0_mean': llds['F0semitoneFrom27.5Hz_sma3nz'].mean(),
    'f0_std': llds['F0semitoneFrom27.5Hz_sma3nz'].std(),
    'f0_range': llds['F0semitoneFrom27.5Hz_sma3nz'].max() - llds['F0semitoneFrom27.5Hz_sma3nz'].min(),
    'loudness_mean': llds['loudness_sma3'].mean(),
    # ... (only features you need)
}

Result: 10-20 custom features instead of 6,373.

2. Batch Processing

Problem: Processing files one-by-one is slow (repeated model initialization).

Solution: Process multiple files in single session.

import opensmile
import glob

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

audio_files = glob.glob('dataset/**/*.wav', recursive=True)

all_features = []
for i, audio_path in enumerate(audio_files):
    if i % 100 == 0:
        print(f"Processed {i}/{len(audio_files)}")

    features = smile.process_file(audio_path)
    features['filename'] = audio_path
    all_features.append(features)

df = pd.concat(all_features, ignore_index=True)
df.to_csv('all_features.csv', index=False)

Performance: Processes 10-minute audio in ~1 second on modern CPU (single-threaded). For large datasets, parallelize with multiprocessing:

from multiprocessing import Pool

def extract_features(audio_path):
    smile = opensmile.Smile(
        feature_set=opensmile.FeatureSet.eGeMAPSv02,
        feature_level=opensmile.FeatureLevel.Functionals,
    )
    return smile.process_file(audio_path)

with Pool(processes=8) as pool:
    all_features = pool.map(extract_features, audio_files)

3. Caching for Real-Time Applications

Problem: Re-extracting features for every prediction is wasteful.

Solution: Cache features in Redis or in-memory.

import redis
import pickle

redis_client = redis.Redis(host='localhost', port=6379)

def get_features_cached(audio_path):
    cache_key = f"features:{audio_path}"

    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return pickle.loads(cached)

    # Extract features
    smile = opensmile.Smile(
        feature_set=opensmile.FeatureSet.eGeMAPSv02,
        feature_level=opensmile.FeatureLevel.Functionals,
    )
    features = smile.process_file(audio_path)

    # Store in cache (24-hour expiry)
    redis_client.setex(cache_key, 86400, pickle.dumps(features))

    return features

Common Configuration Mistakes

1. Incorrect Frame Size/Shift

Mistake:

frameSize = 25  ; Interpreted as 25 SECONDS (way too large)
frameStep = 10

Fix:

frameSize = 0.025  ; 25 milliseconds
frameStep = 0.010  ; 10 milliseconds

2. Missing Data Memory Levels

Mistake:

[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = pitch

[functionals:cFunctionals]
reader.dmLevel = llds  ; ERROR: No component writes to 'llds'

Fix: Ensure writer of one component matches reader of next.

[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds  ; Changed to 'llds'

[functionals:cFunctionals]
reader.dmLevel = llds  ; Now matches

3. Pitch Range Too Narrow

Mistake:

[pitchExtractor:cPitchShs]
F0min = 80   ; Excludes female speakers (F0 often 180-250 Hz)
F0max = 250  ; Excludes male speakers (F0 often 85-180 Hz)

Fix: Use wide range for mixed-gender datasets.

F0min = 50   ; Covers male speakers (down to 85 Hz)
F0max = 500  ; Covers female speakers (up to 250 Hz) + margin

4. Forgetting Mono Conversion

Mistake: Processing stereo audio without mixing to mono causes feature duplication.

Fix:

[waveSource:cWaveSource]
monoMixdown = 1  ; Convert stereo to mono

openSMILE vs Custom Feature Extraction

When to use openSMILE:

Research comparison: Need to benchmark against prior studies
Interpretability: Clinical applications requiring explainable features
Low-latency: Real-time processing on CPU
Production C++ integration: Embed in mobile/edge devices

When to use custom deep learning features (Wav2vec 2.0, etc.):

SOTA performance: Need maximum accuracy (2-5% improvement over openSMILE)
End-to-end training: Fine-tune feature extraction with task-specific loss
Large datasets: >10,000 samples to train deep models
GPU availability: Can afford 10-100ms inference latency

Hybrid approach (recommended):

openSMILE for classical ML: Fast baseline (Random Forest on eGeMAPS: 70-85% accuracy)
Wav2vec 2.0 for deep learning: SOTA ensemble model (75-95% accuracy)
Combine predictions: Weighted average or stacking (additional 1-3% boost)

Production Deployment

Option 1: Python Package (Easiest)

# Dockerfile
FROM python:3.10-slim

RUN pip install opensmile scikit-learn

COPY extract_features.py /app/
CMD ["python", "/app/extract_features.py"]

Pros: Zero configuration, easy debugging, integrates with Python ML stack

Cons: Slower startup (~100ms to initialize), larger Docker image (~500MB)

Option 2: C++ Binary (Fastest)

# Dockerfile
FROM ubuntu:22.04

RUN apt-get update && apt-get install -y opensmile

COPY custom_config.conf /app/
CMD ["SMILExtract", "-C", "/app/custom_config.conf", "-I", "/input/audio.wav", "-O", "/output/features.csv"]

Pros: Fastest startup (<10ms), smallest image (~50MB), lowest memory

Cons: Harder to debug, requires config file maintenance

Option 3: FastAPI Microservice

# feature_service.py
from fastapi import FastAPI, UploadFile
import opensmile
import tempfile
import os

app = FastAPI()

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

@app.post("/extract_features")
async def extract_features(audio: UploadFile):
    # Save uploaded audio to temp file
    with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_audio:
        temp_audio.write(await audio.read())
        temp_path = temp_audio.name

    # Extract features
    features = smile.process_file(temp_path)
    os.unlink(temp_path)

    # Return as JSON
    return features.to_dict(orient='records')[0]

Usage:

curl -X POST "http://localhost:8000/extract_features" \
  -F "audio=@audio.wav"

Pros: Language-agnostic API, scales horizontally, easy monitoring

Debugging openSMILE

1. Enable Verbose Logging

SMILExtract -C config.conf -l 3  # Log level 3 (verbose)

Output shows: Component initialization, data flow between levels, timing per component.

2. Inspect Intermediate Data

Add multiple cCsvSink components to dump intermediate features:

[lld_dump:cCsvSink]
reader.dmLevel = llds  ; Dump LLDs before functionals
filename = debug_llds.csv

[functionals_dump:cCsvSink]
reader.dmLevel = functionals  ; Dump final features
filename = debug_functionals.csv

3. Python Debugging with LLDs

import opensmile
import matplotlib.pyplot as plt

# Extract frame-by-frame features
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

llds = smile.process_file('audio.wav')

# Plot pitch over time
plt.figure(figsize=(12, 4))
plt.plot(llds.index, llds['F0semitoneFrom27.5Hz_sma3nz'])
plt.xlabel('Time (s)')
plt.ylabel('F0 (semitones from 27.5 Hz)')
plt.title('Pitch Contour')
plt.savefig('pitch_contour.png')

Resources & Further Learning

Official Documentation:

openSMILE Documentation - Official docs from audEERING
Python Package Docs - opensmile Python API reference
GitHub Repository - Source code and example configs

Key Research Papers:

openSMILE architecture: Eyben et al. (2010) - "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor" (ACM MM)
GeMAPS feature set: Eyben et al. (2016) - "The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing" (IEEE TASLP)
ComParE feature set: Schuller et al. (2013) - "The INTERSPEECH 2013 Computational Paralinguistics Challenge" (INTERSPEECH)

Example Configs:

/path/to/opensmile/config/ - Standard feature set configs (installed with openSMILE)
gemaps/v01b/eGeMAPSv01b.conf - eGeMAPS standard config
compare/ComParE_2016.conf - ComParE challenge config

The Bottom Line

openSMILE is your feature extraction Swiss Army knife—but you don't need to master every component to be effective.

For 90% of voice analysis tasks:

Use the Python package (pip install opensmile)
Start with eGeMAPS (88 features, best balance)
Extract functionals (not LLDs) for utterance-level classification
Train classical ML model (Random Forest) as baseline
If accuracy insufficient, try ComParE (6,373 features) with feature selection

For custom pipelines (real-time, edge devices, C++ integration):

Start with minimal example config (pitch extraction)
Add components incrementally (formants, spectral, voice quality)
Test intermediate outputs with cCsvSink at each stage
Optimize: reduce LLDs to essential features, use sliding functionals for real-time

The goal isn't to extract 6,000 features—it's to extract the right features for your specific task. openSMILE gives you the tools; your domain knowledge determines which ones to use.

Voice Mirror uses openSMILE eGeMAPS (88 features) for interpretable baseline models and ComParE (6,373 features with Random Forest feature selection) for SOTA ensemble models. Our hybrid approach combines classical ML interpretability with deep learning accuracy across 20+ voice analysis tasks.

openSMILE Configuration Guide: Your Complete Feature Extraction Toolkit

Why openSMILE?

openSMILE Architecture Overview

Standard Feature Sets

1. GeMAPS (Geneva Minimalistic Acoustic Parameter Set)

2. eGeMAPS (extended GeMAPS)

3. ComParE (Computational Paralinguistics Challenge)

Python Integration: The Easy Way

Installation

Basic Usage: Extract eGeMAPS Features

Processing Multiple Files

Switching Feature Sets

Real-Time Processing with Live Audio

Custom Configuration Files

Configuration File Structure

Example: Minimal Pitch Extraction Config

Example: Gender Classification Feature Set

Key Components Reference

Audio Input/Output

Preprocessing

Feature Extraction

Functionals (Statistical Summaries)

Advanced Configuration Patterns

1. Real-Time Feature Extraction

2. Multi-Level Feature Extraction

3. Feature Normalization

Performance Optimization

1. Reduce Feature Count

2. Batch Processing

3. Caching for Real-Time Applications

Common Configuration Mistakes

1. Incorrect Frame Size/Shift

2. Missing Data Memory Levels

3. Pitch Range Too Narrow

4. Forgetting Mono Conversion

openSMILE vs Custom Feature Extraction

Production Deployment

Option 1: Python Package (Easiest)

Option 2: C++ Binary (Fastest)

Option 3: FastAPI Microservice

Debugging openSMILE

1. Enable Verbose Logging

2. Inspect Intermediate Data

3. Python Debugging with LLDs

Resources & Further Learning

The Bottom Line

Related Articles

LiveKit Setup for Voice Analysis: Building Real-Time Voice Applications

Speech-to-Text for Voice Analysis: Comparing Whisper, Deepgram, Google, and AWS

Training ML Models for Voice Analysis: From Data Collection to Production

Ready to Try Voice-First Dating?