Voice AI TechnologyFebruary 17, 2025·16 min read

openSMILE Configuration Guide: Mastering Feature Extraction for Voice Analysis

Complete guide to openSMILE configuration, from standard feature sets (GeMAPS, eGeMAPS) to custom pipelines. Learn component architecture, Python integration, and production optimization.

Dr. Marcus Chen
Audio Signal Processing Engineer & Voice Analysis Specialist

openSMILE Configuration Guide: Your Complete Feature Extraction Toolkit

When you're building voice analysis systems, openSMILE (open-source Speech and Music Interpretation by Large-space Extraction) is your Swiss Army knife. This powerful toolkit extracts 6,000+ acoustic features from audio with millisecond precision—but mastering its configuration unlocks the real power.

In this guide, you'll learn how to configure openSMILE for any voice analysis task, from using standard feature sets to building custom extraction pipelines.

Why openSMILE?

The gold standard for voice research: openSMILE has been used in 1,000+ academic papers, including winning systems in the INTERSPEECH Computational Paralinguistics Challenge (ComParE) since 2009.

Key advantages:

  • Comprehensive: 6,000+ features covering pitch, energy, spectral, cepstral, voice quality
  • Fast: Real-time processing (processes 10-minute audio in <1 second on modern CPU)
  • Research-validated: Standard feature sets (GeMAPS, eGeMAPS) used across 200+ studies
  • Flexible: Modular component system lets you build custom pipelines
  • Production-ready: C++ core with Python bindings, runs on server or edge devices

openSMILE Architecture Overview

openSMILE uses a component-based pipeline architecture:

Audio Input → Framing → LLD Extraction → Functionals → Feature Vector Output

1. Audio Input: Reads WAV, MP3, or live audio stream

2. Framing: Splits audio into overlapping windows (e.g., 25ms windows, 10ms shift)

3. Low-Level Descriptors (LLDs): Frame-by-frame features extracted every 10ms:

  • F0 (pitch)
  • Energy (loudness)
  • MFCCs (spectral shape)
  • Jitter, shimmer (voice quality)
  • Spectral flux, rolloff, centroid

4. Functionals: Statistical summaries of LLDs across larger time windows:

  • Mean, standard deviation, min, max, range
  • Quartiles (25th, 50th, 75th percentile)
  • Slopes (linear regression of feature over time)
  • Moments (skewness, kurtosis)

5. Output: Fixed-length feature vector (e.g., 88 features for GeMAPS) for machine learning

Standard Feature Sets

openSMILE provides pre-configured feature sets optimized for different tasks. You don't need to understand every component—just pick the right set for your use case.

1. GeMAPS (Geneva Minimalistic Acoustic Parameter Set)

Who should use it: Voice analysis beginners, applications prioritizing interpretability

Features: 88 features (18 LLDs × functionals + 6 temporal features)

LLDs extracted:

  • F0 (fundamental frequency / pitch)
  • Loudness
  • Jitter, shimmer (voice quality)
  • HNR (harmonics-to-noise ratio)
  • Alpha ratio (spectral slope)
  • Hammarberg index (spectral balance)
  • Spectral slopes (0-500Hz, 500-1500Hz)
  • F1, F2, F3 (first 3 formant frequencies)
  • F1, F2, F3 bandwidths
  • MFCC 1-4 (mel-frequency cepstral coefficients)

Temporal features:

  • Rate of loudness peaks (speech rhythm)
  • Mean length/duration of continuous voicing/unvoicing (pause patterns)

Research validation: Developed by the University of Geneva and TUM, used in 100+ papers. Performs comparably to larger feature sets for emotion (78% vs 81%), depression (68% vs 72%), and personality (r=0.24 vs r=0.28).

When to use GeMAPS:

  • Interpretability matters: Clinical applications where you need to explain predictions
  • Limited data: 88 features reduces overfitting risk with <500 samples
  • Fast inference: Lower dimensionality → faster predictions

2. eGeMAPS (extended GeMAPS)

Who should use it: Most voice analysis applications (best balance of performance vs complexity)

Features: 88 features (25 LLDs × functionals + 13 temporal features)

Additional LLDs vs GeMAPS:

  • MFCC 5-12 (8 additional cepstral coefficients)
  • Spectral flux (rate of spectral change)
  • Loudness slopes (dynamic changes in loudness)

Research validation: The most widely used feature set in voice analysis research (2016-2024). Used in AVEC Depression Challenge (71% accuracy), ComParE Emotion Challenge (78% accuracy), and 50+ health screening studies.

Performance vs GeMAPS:

  • Emotion recognition: 81% vs 78% (eGeMAPS better)
  • Depression screening: 72% vs 68% (eGeMAPS better)
  • Gender classification: 98% vs 98% (no difference)
  • Age estimation: MAE 5.9 vs 6.4 years (eGeMAPS better)

When to use eGeMAPS:

  • Default choice: Best starting point for 90% of voice analysis tasks
  • Sufficient data: >500 samples to avoid overfitting
  • Need SOTA performance: Extra features capture subtle variations

3. ComParE (Computational Paralinguistics Challenge)

Who should use it: Researchers, applications with >5,000 samples, when you need maximum information

Features: 6,373 features (65 LLDs × 98 functionals)

LLDs extracted (in addition to eGeMAPS):

  • MFCC 1-14 (all coefficients)
  • Delta coefficients (temporal derivatives of MFCCs)
  • LSP (Line Spectral Pairs) frequencies 0-7
  • F0 envelope (pitch variation patterns)
  • Voicing probability
  • Spectral harmonicity, psychoacoustic sharpness, spectral variance

Functionals (98 statistical summaries):

  • All GeMAPS functionals (mean, std, min, max, range, quartiles, slope)
  • Additional moments: skewness, kurtosis
  • Position of extrema (argmin, argmax)
  • Linear/quadratic regression coefficients
  • Percentile ranges (95th-5th, 75th-25th)
  • Up-level time (% time above mean)

Research validation: Official feature set for INTERSPEECH ComParE challenges (2009-present). Winning systems have used ComParE for emotion (85% accuracy), speaker traits (83%), and medical conditions (89%).

Performance gains vs eGeMAPS:

  • Complex tasks: 3-8% accuracy improvement (emotion, depression, personality)
  • Simple tasks: No improvement (gender, age)
  • Requires feature selection: Use Random Forest importance or Lasso to reduce to 100-500 features

When to use ComParE:

  • Large datasets: >5,000 samples to avoid overfitting
  • Complex tasks: Subtle conditions (depression, cognitive load)
  • Feature selection: Plan to use dimensionality reduction
  • Research context: Benchmarking against prior studies

Python Integration: The Easy Way

The opensmile Python package (by audEERING, openSMILE creators) provides zero-configuration feature extraction for standard sets.

Installation

pip install opensmile

Basic Usage: Extract eGeMAPS Features

import opensmile
import numpy as np

# Initialize feature extractor
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# Extract features from audio file
features = smile.process_file('audio.wav')

# Returns pandas DataFrame with 88 columns (eGeMAPS features)
print(features.shape)  # (1, 88)
print(features.columns[:5])
# ['F0semitoneFrom27.5Hz_sma3nz_amean',
#  'F0semitoneFrom27.5Hz_sma3nz_stddevNorm',
#  'F0semitoneFrom27.5Hz_sma3nz_percentile20.0', ...]

Processing Multiple Files

import os
import pandas as pd

audio_files = ['speaker1.wav', 'speaker2.wav', 'speaker3.wav']

all_features = []
for audio_path in audio_files:
    features = smile.process_file(audio_path)
    all_features.append(features)

# Concatenate into single DataFrame
df = pd.concat(all_features, ignore_index=True)
print(df.shape)  # (3, 88)

Switching Feature Sets

# GeMAPS (minimal)
smile_gemaps = opensmile.Smile(
    feature_set=opensmile.FeatureSet.GeMAPSv01b,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# ComParE (comprehensive)
smile_compare = opensmile.Smile(
    feature_set=opensmile.FeatureSet.ComParE_2016,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# eGeMAPS with LLDs (frame-by-frame features)
smile_llds = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

features_llds = smile_llds.process_file('audio.wav')
print(features_llds.shape)  # (time_frames, 25) - 25 LLDs per frame

Real-Time Processing with Live Audio

import sounddevice as sd

# Initialize extractor
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

# Record 5 seconds
sample_rate = 16000
audio = sd.rec(int(5 * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()

# Process numpy array directly
features = smile.process_signal(audio.flatten(), sample_rate)
print(features.shape)  # (1, 88)

Custom Configuration Files

For advanced use cases—custom feature combinations, real-time processing, or integrating with C++ applications—you'll work with openSMILE configuration files (.conf).

Configuration files define the component graph: which components run, in what order, and how data flows between them.

Configuration File Structure

openSMILE configs use INI-style syntax:

[component_name:component_type]
reader.dmLevel = wave
writer.dmLevel = llds
parameter1 = value1
parameter2 = value2

Key concepts:

  • Data Memory (DM): Internal buffer where components read/write data. Each "level" is a named buffer (e.g., "wave", "pitch", "llds").
  • Components: Processing units that read from one DM level and write to another.
  • Component types: Pre-defined classes (cWaveSource, cFramer, cPitchShs, cFunctionals).

Example: Minimal Pitch Extraction Config

;; pitch_extraction.conf
;; Extract F0 (pitch) from audio

[componentInstances:cComponentManager]
instance[waveSource].type = cWaveSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[csvSink].type = cCsvSink

;;;;;;;;;; 1. Audio Input ;;;;;;;;;;
[waveSource:cWaveSource]
filename = input.wav
writer.dmLevel = wave

;;;;;;;;;; 2. Framing ;;;;;;;;;;
[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025  ; 25ms windows
frameStep = 0.010  ; 10ms shift (overlap)
frameCenterSpecial = left

;;;;;;;;;; 3. Pitch Extraction ;;;;;;;;;;
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = pitch
F0min = 50   ; Minimum F0 (Hz)
F0max = 500  ; Maximum F0 (Hz)

;;;;;;;;;; 4. Output to CSV ;;;;;;;;;;
[csvSink:cCsvSink]
reader.dmLevel = pitch
filename = pitch_output.csv
append = 0
timestamp = 1

Running the config:

SMILExtract -C pitch_extraction.conf

Output: CSV file with time-stamped pitch values (one row per 10ms frame).

Example: Gender Classification Feature Set

Let's build a custom config that extracts 6 features optimal for gender classification:

  • F0 (pitch) mean, std
  • Formant 1 (F1) mean
  • Formant 2 (F2) mean
  • Spectral centroid mean
  • HNR (harmonics-to-noise ratio) mean
;; gender_features.conf

[componentInstances:cComponentManager]
instance[waveSource].type = cWaveSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[formantExtractor].type = cFormantLpc
instance[spectralExtractor].type = cSpectral
instance[hnrExtractor].type = cHarmonics
instance[functionals].type = cFunctionals
instance[csvSink].type = cCsvSink

;;;;;;;;;; Audio Input ;;;;;;;;;;
[waveSource:cWaveSource]
filename = \cm[inputfile(I){input.wav}:name of input file]
writer.dmLevel = wave

;;;;;;;;;; Framing ;;;;;;;;;;
[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025
frameStep = 0.010

;;;;;;;;;; F0 Extraction ;;;;;;;;;;
[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds
F0min = 50
F0max = 500

;;;;;;;;;; Formant Extraction ;;;;;;;;;;
[formantExtractor:cFormantLpc]
reader.dmLevel = frames
writer.dmLevel = llds
nFormants = 3
saveIntensity = 0

;;;;;;;;;; Spectral Features ;;;;;;;;;;
[spectralExtractor:cSpectral]
reader.dmLevel = frames
writer.dmLevel = llds
centroid = 1
flux = 0
rollOff = 0

;;;;;;;;;; HNR Extraction ;;;;;;;;;;
[hnrExtractor:cHarmonics]
reader.dmLevel = frames
writer.dmLevel = llds
outputHnr = 1

;;;;;;;;;; Functionals (Statistical Summaries) ;;;;;;;;;;
[functionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = functionals
functionalsEnabled = mean;stddev

;;;;;;;;;; Output ;;;;;;;;;;
[csvSink:cCsvSink]
reader.dmLevel = functionals
filename = \cm[outputfile(O){output.csv}:output CSV file]
append = 0
timestamp = 0
instanceName = 1

Running with custom input/output:

SMILExtract -C gender_features.conf -I speaker1.wav -O speaker1_features.csv

Output: CSV with 6 features (F0 mean/std, F1 mean, F2 mean, spectral centroid mean, HNR mean).

Key Components Reference

Here are the most useful openSMILE components for voice analysis:

Audio Input/Output

Component Purpose Key Parameters
cWaveSource Read audio file filename, monoMixdown=1 (convert stereo to mono)
cPortaudioSource Live microphone input audioBuffersize, sampleRate
cCsvSink Write features to CSV filename, append, timestamp

Preprocessing

Component Purpose Key Parameters
cFramer Split audio into frames frameSize=0.025 (25ms), frameStep=0.010 (10ms shift)
cWindower Apply window function winFunc=ham (Hamming), winFunc=han (Hanning)
cPreemphasis High-pass filter k=0.97 (pre-emphasis coefficient)

Feature Extraction

Component Purpose Key Parameters
cPitchShs F0 (pitch) via subharmonic summation F0min, F0max, voicingCutoff
cIntensity Loudness (RMS energy) intensity=1, loudness=1
cMfcc MFCCs (spectral shape) nMfcc=13, htkcompatible=1
cFormantLpc Formants (F1, F2, F3) via LPC nFormants=3, lpcOrder=8
cSpectral Spectral features (centroid, flux, rolloff) centroid=1, flux=1, rollOff=1
cHarmonics HNR (harmonics-to-noise ratio) outputHnr=1, F0reader.dmLevel (requires pitch)
cJitterDDP Jitter (pitch perturbation) jitterLocal=1, jitterDDP=1
cShimmer Shimmer (amplitude perturbation) shimmerLocal=1

Functionals (Statistical Summaries)

Component Purpose Key Parameters
cFunctionals Apply statistical functions to LLDs functionalsEnabled=mean;stddev;min;max;range;quartile1;quartile2;quartile3
cDeltaRegression Temporal derivatives (delta, delta-delta) nameAppend=de (suffix for delta features)

Advanced Configuration Patterns

1. Real-Time Feature Extraction

For live applications (e.g., voice assistants, real-time emotion detection), you need incremental functionals that update every N frames instead of waiting for full audio.

[componentInstances:cComponentManager]
instance[portaudioInput].type = cPortaudioSource
instance[framer].type = cFramer
instance[pitchExtractor].type = cPitchShs
instance[functionals].type = cFunctionals
instance[realtimeOutput].type = cCsvSink

[portaudioInput:cPortaudioSource]
writer.dmLevel = wave
audioBuffersize = 4000
sampleRate = 16000

[framer:cFramer]
reader.dmLevel = wave
writer.dmLevel = frames
frameSize = 0.025
frameStep = 0.010

[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds

[functionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = functionals
; Key setting: update functionals every 1 second
frameMode = sliding
frameSize = 1.0
frameStep = 0.1
functionalsEnabled = mean;stddev

[realtimeOutput:cCsvSink]
reader.dmLevel = functionals
filename = realtime_features.csv
append = 1
timestamp = 1

Result: Features update every 100ms (frameStep=0.1), computing statistics over sliding 1-second windows.

2. Multi-Level Feature Extraction

Extract features at multiple time scales (frame-level, utterance-level, turn-level):

[componentInstances:cComponentManager]
; ... (framer, pitch extractor, etc.)
instance[frameLevelFunctionals].type = cFunctionals
instance[utteranceLevelFunctionals].type = cFunctionals

; Frame-level: 1-second windows
[frameLevelFunctionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = frame_features
frameMode = sliding
frameSize = 1.0
frameStep = 0.5
functionalsEnabled = mean;stddev

; Utterance-level: Full audio
[utteranceLevelFunctionals:cFunctionals]
reader.dmLevel = llds
writer.dmLevel = utterance_features
frameMode = full
functionalsEnabled = mean;stddev;min;max;quartile1;quartile2;quartile3;linregc1;linregc2

3. Feature Normalization

Normalize features for speaker independence:

[componentInstances:cComponentManager]
; ... (feature extraction)
instance[normalizer].type = cVectorOperation

[normalizer:cVectorOperation]
reader.dmLevel = llds
writer.dmLevel = llds_norm
; Z-score normalization: (x - mean) / std
operation = mean-normalize
; Or min-max scaling: (x - min) / (max - min)
; operation = minmax-normalize

Performance Optimization

1. Reduce Feature Count

Problem: 6,373 ComParE features cause overfitting and slow inference.

Solution: Extract minimal LLDs, then add only essential functionals.

import opensmile

# Extract only LLDs (no functionals)
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

llds = smile.process_file('audio.wav')  # (time_frames, 25)

# Compute custom functionals in Python
import numpy as np

features = {
    'f0_mean': llds['F0semitoneFrom27.5Hz_sma3nz'].mean(),
    'f0_std': llds['F0semitoneFrom27.5Hz_sma3nz'].std(),
    'f0_range': llds['F0semitoneFrom27.5Hz_sma3nz'].max() - llds['F0semitoneFrom27.5Hz_sma3nz'].min(),
    'loudness_mean': llds['loudness_sma3'].mean(),
    # ... (only features you need)
}

Result: 10-20 custom features instead of 6,373.

2. Batch Processing

Problem: Processing files one-by-one is slow (repeated model initialization).

Solution: Process multiple files in single session.

import opensmile
import glob

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

audio_files = glob.glob('dataset/**/*.wav', recursive=True)

all_features = []
for i, audio_path in enumerate(audio_files):
    if i % 100 == 0:
        print(f"Processed {i}/{len(audio_files)}")

    features = smile.process_file(audio_path)
    features['filename'] = audio_path
    all_features.append(features)

df = pd.concat(all_features, ignore_index=True)
df.to_csv('all_features.csv', index=False)

Performance: Processes 10-minute audio in ~1 second on modern CPU (single-threaded). For large datasets, parallelize with multiprocessing:

from multiprocessing import Pool

def extract_features(audio_path):
    smile = opensmile.Smile(
        feature_set=opensmile.FeatureSet.eGeMAPSv02,
        feature_level=opensmile.FeatureLevel.Functionals,
    )
    return smile.process_file(audio_path)

with Pool(processes=8) as pool:
    all_features = pool.map(extract_features, audio_files)

3. Caching for Real-Time Applications

Problem: Re-extracting features for every prediction is wasteful.

Solution: Cache features in Redis or in-memory.

import redis
import pickle

redis_client = redis.Redis(host='localhost', port=6379)

def get_features_cached(audio_path):
    cache_key = f"features:{audio_path}"

    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return pickle.loads(cached)

    # Extract features
    smile = opensmile.Smile(
        feature_set=opensmile.FeatureSet.eGeMAPSv02,
        feature_level=opensmile.FeatureLevel.Functionals,
    )
    features = smile.process_file(audio_path)

    # Store in cache (24-hour expiry)
    redis_client.setex(cache_key, 86400, pickle.dumps(features))

    return features

Common Configuration Mistakes

1. Incorrect Frame Size/Shift

Mistake:

frameSize = 25  ; Interpreted as 25 SECONDS (way too large)
frameStep = 10

Fix:

frameSize = 0.025  ; 25 milliseconds
frameStep = 0.010  ; 10 milliseconds

2. Missing Data Memory Levels

Mistake:

[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = pitch

[functionals:cFunctionals]
reader.dmLevel = llds  ; ERROR: No component writes to 'llds'

Fix: Ensure writer of one component matches reader of next.

[pitchExtractor:cPitchShs]
reader.dmLevel = frames
writer.dmLevel = llds  ; Changed to 'llds'

[functionals:cFunctionals]
reader.dmLevel = llds  ; Now matches

3. Pitch Range Too Narrow

Mistake:

[pitchExtractor:cPitchShs]
F0min = 80   ; Excludes female speakers (F0 often 180-250 Hz)
F0max = 250  ; Excludes male speakers (F0 often 85-180 Hz)

Fix: Use wide range for mixed-gender datasets.

F0min = 50   ; Covers male speakers (down to 85 Hz)
F0max = 500  ; Covers female speakers (up to 250 Hz) + margin

4. Forgetting Mono Conversion

Mistake: Processing stereo audio without mixing to mono causes feature duplication.

Fix:

[waveSource:cWaveSource]
monoMixdown = 1  ; Convert stereo to mono

openSMILE vs Custom Feature Extraction

When to use openSMILE:

  • Research comparison: Need to benchmark against prior studies
  • Interpretability: Clinical applications requiring explainable features
  • Low-latency: Real-time processing on CPU
  • Production C++ integration: Embed in mobile/edge devices

When to use custom deep learning features (Wav2vec 2.0, etc.):

  • SOTA performance: Need maximum accuracy (2-5% improvement over openSMILE)
  • End-to-end training: Fine-tune feature extraction with task-specific loss
  • Large datasets: >10,000 samples to train deep models
  • GPU availability: Can afford 10-100ms inference latency

Hybrid approach (recommended):

  • openSMILE for classical ML: Fast baseline (Random Forest on eGeMAPS: 70-85% accuracy)
  • Wav2vec 2.0 for deep learning: SOTA ensemble model (75-95% accuracy)
  • Combine predictions: Weighted average or stacking (additional 1-3% boost)

Production Deployment

Option 1: Python Package (Easiest)

# Dockerfile
FROM python:3.10-slim

RUN pip install opensmile scikit-learn

COPY extract_features.py /app/
CMD ["python", "/app/extract_features.py"]

Pros: Zero configuration, easy debugging, integrates with Python ML stack

Cons: Slower startup (~100ms to initialize), larger Docker image (~500MB)

Option 2: C++ Binary (Fastest)

# Dockerfile
FROM ubuntu:22.04

RUN apt-get update && apt-get install -y opensmile

COPY custom_config.conf /app/
CMD ["SMILExtract", "-C", "/app/custom_config.conf", "-I", "/input/audio.wav", "-O", "/output/features.csv"]

Pros: Fastest startup (<10ms), smallest image (~50MB), lowest memory

Cons: Harder to debug, requires config file maintenance

Option 3: FastAPI Microservice

# feature_service.py
from fastapi import FastAPI, UploadFile
import opensmile
import tempfile
import os

app = FastAPI()

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals,
)

@app.post("/extract_features")
async def extract_features(audio: UploadFile):
    # Save uploaded audio to temp file
    with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_audio:
        temp_audio.write(await audio.read())
        temp_path = temp_audio.name

    # Extract features
    features = smile.process_file(temp_path)
    os.unlink(temp_path)

    # Return as JSON
    return features.to_dict(orient='records')[0]

Usage:

curl -X POST "http://localhost:8000/extract_features" \
  -F "audio=@audio.wav"

Pros: Language-agnostic API, scales horizontally, easy monitoring

Debugging openSMILE

1. Enable Verbose Logging

SMILExtract -C config.conf -l 3  # Log level 3 (verbose)

Output shows: Component initialization, data flow between levels, timing per component.

2. Inspect Intermediate Data

Add multiple cCsvSink components to dump intermediate features:

[lld_dump:cCsvSink]
reader.dmLevel = llds  ; Dump LLDs before functionals
filename = debug_llds.csv

[functionals_dump:cCsvSink]
reader.dmLevel = functionals  ; Dump final features
filename = debug_functionals.csv

3. Python Debugging with LLDs

import opensmile
import matplotlib.pyplot as plt

# Extract frame-by-frame features
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

llds = smile.process_file('audio.wav')

# Plot pitch over time
plt.figure(figsize=(12, 4))
plt.plot(llds.index, llds['F0semitoneFrom27.5Hz_sma3nz'])
plt.xlabel('Time (s)')
plt.ylabel('F0 (semitones from 27.5 Hz)')
plt.title('Pitch Contour')
plt.savefig('pitch_contour.png')

Resources & Further Learning

Official Documentation:

Key Research Papers:

  • openSMILE architecture: Eyben et al. (2010) - "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor" (ACM MM)
  • GeMAPS feature set: Eyben et al. (2016) - "The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing" (IEEE TASLP)
  • ComParE feature set: Schuller et al. (2013) - "The INTERSPEECH 2013 Computational Paralinguistics Challenge" (INTERSPEECH)

Example Configs:

  • /path/to/opensmile/config/ - Standard feature set configs (installed with openSMILE)
  • gemaps/v01b/eGeMAPSv01b.conf - eGeMAPS standard config
  • compare/ComParE_2016.conf - ComParE challenge config

The Bottom Line

openSMILE is your feature extraction Swiss Army knife—but you don't need to master every component to be effective.

For 90% of voice analysis tasks:

  1. Use the Python package (pip install opensmile)
  2. Start with eGeMAPS (88 features, best balance)
  3. Extract functionals (not LLDs) for utterance-level classification
  4. Train classical ML model (Random Forest) as baseline
  5. If accuracy insufficient, try ComParE (6,373 features) with feature selection

For custom pipelines (real-time, edge devices, C++ integration):

  1. Start with minimal example config (pitch extraction)
  2. Add components incrementally (formants, spectral, voice quality)
  3. Test intermediate outputs with cCsvSink at each stage
  4. Optimize: reduce LLDs to essential features, use sliding functionals for real-time

The goal isn't to extract 6,000 features—it's to extract the right features for your specific task. openSMILE gives you the tools; your domain knowledge determines which ones to use.

Voice Mirror uses openSMILE eGeMAPS (88 features) for interpretable baseline models and ComParE (6,373 features with Random Forest feature selection) for SOTA ensemble models. Our hybrid approach combines classical ML interpretability with deep learning accuracy across 20+ voice analysis tasks.

#openSMILE#feature-extraction#GeMAPS#eGeMAPS#acoustic-features#audio-processing

Related Articles

Ready to Try Voice-First Dating?

Join thousands of singles having authentic conversations on Veronata

Get Started Free