Skip to content

Analyzers API Reference

The audiomancer.analyzers module provides audio analysis functionality.

Overview

analyzers

Audio analysis for audiomancer.

Provides interfaces for SynthDef parsing and audio feature extraction, including basic metadata, spectral, rhythm, tonal analysis, ML classification, and audio embeddings.

__all__ = ['ControlSpec', 'SynthControl', 'SynthDefMetadata', 'SynthDefParser', 'SynthDefStore', 'parse_synthdef', 'categorize_synthdef', 'SynthDefInfo', 'SynthDefControl', 'get_basic_metadata', 'BasicMetadata', 'extract_spectral_features', 'SpectralFeatures', 'extract_rhythm_features', 'extract_tonal_features', 'classify_instrument', 'extract_mood_tags', 'extract_genre_tags', 'extract_audio_embedding', 'cosine_similarity', 'euclidean_distance', 'load_model', 'download_model', 'list_models', 'clear_cache'] module-attribute

BasicMetadata

Bases: TypedDict

Basic audio file metadata.

Attributes:

Name Type Description
duration_ms float

Audio duration in milliseconds

sample_rate int

Sample rate in Hz

channels int

Number of audio channels

bit_depth int

Bit depth (16 assumed for librosa float32 conversion)

file_size_bytes int

File size in bytes

file_hash str

SHA256 hex digest of file contents

ControlSpec

Bases: TypedDict

Specification for a SynthDef control parameter.

Describes the valid range and characteristics of a synth control.

Example

spec = ControlSpec( ... min=200.0, ... max=4000.0, ... default=1200.0, ... warp="exp", # Exponential scaling for frequency ... step=1.0, ... )

SpectralFeatures

Bases: TypedDict

Spectral audio features.

All frequency values are in Hz, energy values are linear (0-1 range).

Attributes:

Name Type Description
spectral_centroid float

Mean brightness/center of mass of spectrum (Hz)

spectral_bandwidth float

Frequency spread around centroid (Hz proxy)

spectral_rolloff float

High-frequency content cutoff point (Hz)

zero_crossing_rate float

Measure of noisiness/percussiveness (0-1)

rms_energy float

Root-mean-square energy level (0-1 linear)

dynamic_range float

Peak-to-average ratio (dB)

SynthControl

Bases: TypedDict

A control parameter extracted from a SynthDef.

Represents a parameter that can be modified during synthesis.

Example

control = SynthControl( ... name="cutoff", ... default_value=1200.0, ... spec=ControlSpec( ... min=200.0, ... max=4000.0, ... default=1200.0, ... warp="exp", ... step=1.0, ... ), ... )

SynthDefControl dataclass

A SynthDef control parameter.

Attributes:

Name Type Description
name str

Parameter name (e.g., "freq", "cutoff")

default_value float

Default value for the parameter

spec Optional[str]

ControlSpec if specified (e.g., "\freq.asSpec")

description Optional[str]

Human-readable description (if available)

Example

ctrl = SynthControl(name="freq", default_value=440.0) ctrl.name 'freq' ctrl.default_value 440.0

SynthDefInfo dataclass

Parsed SynthDef metadata.

Attributes:

Name Type Description
name str

SynthDef name (e.g., "tb303", "simple_sine")

file_path str

Absolute path to .scd file

file_hash str

SHA256 hash of source code

num_channels int

Output channel count

has_gate bool

Whether synth has gate parameter for note-off

has_envelope bool

Whether synth uses EnvGen

ugens_used list[str]

List of UGen class names used

controls list[SynthControl]

List of control parameters

source_code str

Raw SuperCollider source code

category Optional[str]

Inferred category (bass, lead, pad, drum, fx)

tags list[str]

Additional tags for categorization

Example

info = SynthDefInfo( ... name="simple_sine", ... file_path="/path/to/simple_sine.scd", ... file_hash="abc123", ... num_channels=2, ... has_gate=True, ... has_envelope=True, ... ugens_used=["SinOsc", "EnvGen", "Out"], ... controls=[SynthControl("freq", 440.0)], ... source_code="SynthDef(...)", ... category="lead", ... )

SynthDefMetadata

Bases: TypedDict

Metadata extracted from a SynthDef file.

Contains all information parsed from a .scd file including controls, UGens used, and categorization.

Example

synthdef = SynthDefMetadata( ... id="synt_tb303", ... name="tb303", ... file_path="/synths/tb303.scd", ... file_hash="abc123", ... num_channels=1, ... has_gate=True, ... has_envelope=True, ... ugens_used=["SinOsc", "Resonz", "EnvGen", "Out"], ... category="bass", ... tags=["acid", "303", "classic"], ... source_code="SynthDef(\tb303, { ... })", ... controls=[ ... SynthControl( ... name="cutoff", ... default_value=1200.0, ... spec=ControlSpec( ... min=200.0, ... max=4000.0, ... default=1200.0, ... warp="exp", ... step=1.0, ... ), ... ), ... SynthControl( ... name="resonance", ... default_value=0.7, ... spec=ControlSpec( ... min=0.0, ... max=1.0, ... default=0.7, ... warp="linear", ... step=0.01, ... ), ... ), ... ], ... )

SynthDefParser

Bases: Protocol

Interface for parsing SuperCollider SynthDef files.

Extracts controls, UGens, and metadata from .scd files using sclang.

extract_controls(source_code: str) -> list[SynthControl]

Extract control parameters from SynthDef source code.

Parses arg declarations and infers specs from usage.

Parameters:

Name Type Description Default
source_code str

SuperCollider source code

required

Returns:

Type Description
list[SynthControl]

List of extracted controls

Example

code = ''' ... SynthDef(\tb303, { |cutoff=1200, resonance=0.7| ... var sig = Saw.ar(freq); ... sig = Resonz.ar(sig, cutoff, 1/resonance); ... Out.ar(0, sig); ... }) ... ''' controls = parser.extract_controls(code) controls [ SynthControl(name="cutoff", default_value=1200.0, ...), SynthControl(name="resonance", default_value=0.7, ...), ]

extract_ugens(source_code: str) -> list[str]

Extract UGen class names from source code.

Finds all UGen.ar() and UGen.kr() calls.

Parameters:

Name Type Description Default
source_code str

SuperCollider source code

required

Returns:

Type Description
list[str]

List of unique UGen class names

Example

code = ''' ... SynthDef(\tb303, { ... var sig = Saw.ar(440); ... sig = Resonz.ar(sig, 1200); ... Out.ar(0, sig); ... }) ... ''' ugens = parser.extract_ugens(code) ugens ["Saw", "Resonz", "Out"]

infer_category(metadata: SynthDefMetadata) -> str

Infer synth category from UGens and controls.

Categorization rules: - bass: Low-pass filter + low frequency range - lead: High resonance + envelope - pad: Long envelope + multiple oscillators - drum: Noise + short envelope - fx: No oscillators, effect UGens only

Parameters:

Name Type Description Default
metadata SynthDefMetadata

Parsed SynthDef metadata

required

Returns:

Type Description
str

Category string

Example

metadata = SynthDefMetadata( ... ugens_used=["Saw", "Resonz", "EnvGen"], ... controls=[ ... SynthControl(name="cutoff", default_value=1200, ...), ... ], ... ... ... ) category = parser.infer_category(metadata) category "bass"

parse(file_path: str, timeout: int = 5) -> SynthDefMetadata

Parse a SynthDef file and extract metadata.

Uses subprocess to run sclang with shell=False and timeout for safety. Falls back to binary parser if sclang fails.

Parameters:

Name Type Description Default
file_path str

Absolute path to .scd file

required
timeout int

Maximum time to wait for sclang (seconds)

5

Returns:

Type Description
SynthDefMetadata

Complete SynthDef metadata

Raises:

Type Description
FileNotFoundError

If file_path does not exist

ParseError

If file cannot be parsed (invalid syntax)

SubprocessTimeoutError

If sclang exceeds timeout

Example

parser = SynthDefParser() metadata = parser.parse("/synths/tb303.scd", timeout=5) metadata['name'] "tb303" len(metadata['controls']) 7 metadata['controls'][0]['name'] "cutoff"

parse_batch(file_paths: list[str], timeout: int = 5) -> list[SynthDefMetadata]

Parse multiple SynthDef files in batch.

More efficient than individual parse() calls for many files.

Parameters:

Name Type Description Default
file_paths list[str]

List of absolute paths to .scd files

required
timeout int

Maximum time per file (seconds)

5

Returns:

Type Description
list[SynthDefMetadata]

List of SynthDef metadata in same order as input

Raises:

Type Description
FileNotFoundError

If any file does not exist (fails fast)

ParseError

On first parse failure (no partial results)

Example

parser = SynthDefParser() files = ["/synths/tb303.scd", "/synths/juno.scd"] results = parser.parse_batch(files, timeout=5) len(results) 2

validate_path(file_path: str) -> bool

Validate that file path is safe and exists.

Checks for: - File existence - .scd extension - No path traversal (../) - Readable permissions

Parameters:

Name Type Description Default
file_path str

Path to validate

required

Returns:

Type Description
bool

True if valid, False otherwise

Example

parser = SynthDefParser() parser.validate_path("/synths/tb303.scd") True parser.validate_path("/etc/passwd") # Wrong extension False parser.validate_path("../../etc/passwd.scd") # Traversal False

SynthDefStore

Bases: Protocol

Interface for SynthDef storage operations.

Similar to SampleStore but for synthesizer definitions.

add(synthdef: SynthDefMetadata) -> str

Add SynthDef to database.

Parameters:

Name Type Description Default
synthdef SynthDefMetadata

Complete SynthDef metadata

required

Returns:

Type Description
str

Synth ID (format: "synt_{hash[:8]}")

Raises:

Type Description
DuplicateSynthError

If SynthDef with same name already exists

Example

synthdef = SynthDefMetadata( ... name="tb303", ... file_path="/synths/tb303.scd", ... file_hash="abc123", ... num_channels=1, ... has_gate=True, ... has_envelope=True, ... ugens_used=["Saw", "Resonz", "EnvGen", "Out"], ... source_code="SynthDef(...)", ... controls=[...], ... ) synth_id = store.add(synthdef) synth_id "synt_abc123"

delete(synth_id: str) -> bool

Delete SynthDef from database.

Parameters:

Name Type Description Default
synth_id str

Synth ID to delete

required

Returns:

Type Description
bool

True if deleted, False if not found

Example

success = store.delete("synt_abc123") success True

get(synth_id: str) -> Optional[SynthDefMetadata]

Retrieve SynthDef by ID.

Parameters:

Name Type Description Default
synth_id str

Synth ID (format: "synt_{hash[:8]}")

required

Returns:

Type Description
Optional[SynthDefMetadata]

SynthDef metadata if found, None otherwise

Example

synthdef = store.get("synt_abc123") synthdef['name'] "tb303" store.get("synt_nonexistent") None

get_by_name(name: str) -> Optional[SynthDefMetadata]

Retrieve SynthDef by name.

Parameters:

Name Type Description Default
name str

SynthDef name

required

Returns:

Type Description
Optional[SynthDefMetadata]

SynthDef metadata if found, None otherwise

Example

synthdef = store.get_by_name("tb303") synthdef['id'] "synt_abc123"

search(category: Optional[str] = None, has_gate: Optional[bool] = None, tags: Optional[list[str]] = None, limit: int = 50, offset: int = 0) -> list[SynthDefMetadata]

Search SynthDefs with filters.

Parameters:

Name Type Description Default
category Optional[str]

Filter by category (bass, lead, pad, drum, fx)

None
has_gate Optional[bool]

Filter by gate presence

None
tags Optional[list[str]]

Filter by tags (matches if ANY tag present)

None
limit int

Maximum results to return

50
offset int

Number of results to skip (for pagination)

0

Returns:

Type Description
list[SynthDefMetadata]

List of matching SynthDefs

Example
Find bass synths with gate

results = store.search( ... category="bass", ... has_gate=True, ... limit=10, ... ) len(results) <= 10 True

update(synth_id: str, updates: dict[str, Any]) -> bool

Update SynthDef fields.

Parameters:

Name Type Description Default
synth_id str

Synth ID to update

required
updates dict[str, Any]

Dictionary of field names and new values

required

Returns:

Type Description
bool

True if updated, False if not found

Example

success = store.update( ... "synt_abc123", ... {"category": "lead", "tags": ["acid", "303"]}, ... ) success True

categorize_synthdef(info: SynthDefInfo) -> str

Infer category from UGens and controls.

Categories: - bass: Low-frequency synths with filters (MoogFF, RLPF) - lead: Pitched synths with envelopes and gate - pad: Long sustained synths with gate - drum: Percussive synths without gate or noise-based - fx: Effect processors, noise generators

Parameters:

Name Type Description Default
info SynthDefInfo

SynthDefInfo to categorize

required

Returns:

Type Description
str

Category string (bass, lead, pad, drum, fx)

Example

info = SynthDefInfo(...) categorize_synthdef(info) 'bass'

classify_instrument(audio: np.ndarray, sr: int, model_path: Optional[str] = None, top_k: int = 3) -> dict[str, Any]

Classify audio into instrument categories using Essentia's pre-trained models.

Uses MTG-Jamendo instrument classification model to detect instrument presence. This requires a two-stage pipeline: 1. Extract embeddings using discogs-effnet base model 2. Pass embeddings to classification head

Parameters:

Name Type Description Default
audio ndarray

Audio samples as numpy array (mono or stereo)

required
sr int

Sample rate in Hz

required
model_path Optional[str]

Path to custom model file, or None to use default

None
top_k int

Number of top predictions to return

3

Returns:

Type Description
dict[str, Any]

Dictionary with keys:

dict[str, Any]
  • instrument_type: Most likely instrument (str)
dict[str, Any]
  • instrument_confidence: Confidence score 0-1 (float)
dict[str, Any]
  • top_predictions: List of (instrument, confidence) tuples

Raises:

Type Description
ModelLoadError

If model file not found

AnalysisFailedError

If classification fails

Example

result = classify_instrument(audio, 44100) result['instrument_type'] 'drums' result['instrument_confidence'] 0.92 result['top_predictions'][('drums', 0.92), ('percussion', 0.78), ('beat', 0.45)]

clear_cache(model_type: Optional[ModelType] = None) -> None

Clear model cache.

Parameters:

Name Type Description Default
model_type Optional[ModelType]

Specific model to clear, or None to clear all

None
Example

clear_cache("musicnn") # Clear specific model clear_cache() # Clear all models

cosine_similarity(embedding1: list[float], embedding2: list[float]) -> float

Compute cosine similarity between two embeddings.

Since embeddings are L2-normalized, cosine similarity is just the dot product.

Parameters:

Name Type Description Default
embedding1 list[float]

First embedding (128-dim)

required
embedding2 list[float]

Second embedding (128-dim)

required

Returns:

Type Description
float

Cosine similarity in range [-1, 1]

float

(1 = identical, 0 = orthogonal, -1 = opposite)

Example

emb1 = extract_audio_embedding(audio1, 44100) emb2 = extract_audio_embedding(audio2, 44100) similarity = cosine_similarity(emb1, emb2) 0 <= similarity <= 1 # For typical audio True

download_model(model_type: ModelType, force: bool = False, verify_checksum: bool = True) -> Path

Download an Essentia model from the model zoo.

Downloads to ~/.local/share/audiomancer/models/ and verifies checksum.

Parameters:

Name Type Description Default
model_type ModelType

Model type to download

required
force bool

Force re-download even if file exists

False
verify_checksum bool

Verify SHA256 checksum after download

True

Returns:

Type Description
Path

Path to downloaded model file

Raises:

Type Description
ModelLoadError

If download fails or checksum mismatch

Example

path = download_model("musicnn") path.exists() True path.stat().st_size > 0 True

euclidean_distance(embedding1: list[float], embedding2: list[float]) -> float

Compute Euclidean distance between two embeddings.

Parameters:

Name Type Description Default
embedding1 list[float]

First embedding (128-dim)

required
embedding2 list[float]

Second embedding (128-dim)

required

Returns:

Type Description
float

Euclidean distance (lower = more similar)

Example

emb1 = extract_audio_embedding(audio1, 44100) emb2 = extract_audio_embedding(audio2, 44100) distance = euclidean_distance(emb1, emb2) distance >= 0 True

extract_audio_embedding(audio: np.ndarray, sr: int, model: ModelType = 'musicnn') -> list[float]

Extract 128-dimensional audio embedding for similarity search.

Embeddings are L2-normalized fixed-size vectors that encode audio content. Similar-sounding audio will have similar embeddings (high cosine similarity).

Parameters:

Name Type Description Default
audio ndarray

Audio samples as numpy array (mono or stereo)

required
sr int

Sample rate in Hz

required
model ModelType

Embedding model to use: - "musicnn": MusiCNN embeddings (recommended for music) - "vggish": VGGish embeddings (general audio) - "openl3": OpenL3 embeddings (environmental sounds)

'musicnn'

Returns:

Type Description
list[float]

List of 128 floats (L2-normalized embedding vector)

Raises:

Type Description
ModelLoadError

If model file not found

AnalysisFailedError

If embedding extraction fails

Example

embedding = extract_audio_embedding(audio, 44100) len(embedding) 128

Verify L2 normalization

import math math.isclose(sum(x**2 for x in embedding), 1.0, abs_tol=1e-6) True

extract_genre_tags(audio: np.ndarray, sr: int, top_k: int = 3, threshold: float = 0.1) -> list[str]

Extract genre tags using Essentia's genre classifiers.

Uses MTG-Jamendo genre classification model. This requires a two-stage pipeline: 1. Extract embeddings using discogs-effnet base model 2. Pass embeddings to classification head

Parameters:

Name Type Description Default
audio ndarray

Audio samples as numpy array (mono or stereo)

required
sr int

Sample rate in Hz

required
top_k int

Maximum number of genre tags to return

3
threshold float

Minimum confidence threshold (0-1)

0.1

Returns:

Type Description
list[str]

List of genre tags sorted by confidence

Raises:

Type Description
ModelLoadError

If model file not found

AnalysisFailedError

If classification fails

Example

genres = extract_genre_tags(audio, 44100) genres ['techno', 'electronic', 'house']

extract_mood_tags(audio: np.ndarray, sr: int, top_k: int = 3, threshold: float = 0.1) -> list[str]

Extract mood/theme tags using Essentia's mood classifiers.

Uses MTG-Jamendo mood/theme classification model. This requires a two-stage pipeline: 1. Extract embeddings using discogs-effnet base model 2. Pass embeddings to classification head

Parameters:

Name Type Description Default
audio ndarray

Audio samples as numpy array (mono or stereo)

required
sr int

Sample rate in Hz

required
top_k int

Maximum number of mood tags to return

3
threshold float

Minimum confidence threshold (0-1)

0.1

Returns:

Type Description
list[str]

List of mood tags sorted by confidence

Raises:

Type Description
ModelLoadError

If model file not found

AnalysisFailedError

If classification fails

Example

moods = extract_mood_tags(audio, 44100) moods ['dark', 'electronic', 'energetic']

extract_rhythm_features(audio: np.ndarray, sr: int) -> dict[str, Any]

Extract rhythm/tempo features using Essentia.

Algorithms: - RhythmExtractor2013: BPM detection with confidence - BeatTrackerDegara: Beat positions - OnsetDetection: Transient detection

Parameters:

Name Type Description Default
audio ndarray

Audio samples (mono or stereo)

required
sr int

Sample rate in Hz

required

Returns:

Type Description
dict[str, Any]

dict with keys:

dict[str, Any]
  • bpm: float or None (tempo in beats per minute, None for non-rhythmic)
dict[str, Any]
  • bpm_confidence: float (0-1)
dict[str, Any]
  • beat_positions: list[float] (beat times in seconds)
dict[str, Any]
  • is_loop: bool (True if audio appears to be a rhythmic loop)

Raises:

Type Description
AnalysisFailedError

If extraction fails due to invalid audio data

Example

y, sr = librosa.load("loop_125bpm.wav", sr=None) features = extract_rhythm_features(y, sr) features['bpm'] 125.0 features['bpm_confidence'] 0.95 features['is_loop'] True

extract_spectral_features(audio: np.ndarray, sr: int) -> SpectralFeatures

Extract spectral features using Essentia.

Performs frame-based spectral analysis to extract features that describe the frequency content and energy distribution of the audio signal.

Algorithms used: - Centroid: Spectral center of mass, indicates brightness (Hz) - Bandwidth: Frequency spread via 2nd central moment (Hz proxy) - Rolloff: Frequency below which 85% of energy is contained (Hz) - ZeroCrossingRate: Time-domain noisiness measure (0-1) - RMS: Root-mean-square energy level (0-1 linear) - DynamicRange: Peak-to-RMS ratio in dB

Parameters:

Name Type Description Default
audio ndarray

Audio data as numpy array (mono or stereo)

required
sr int

Sample rate in Hz

required

Returns:

Type Description
SpectralFeatures

Dictionary containing spectral features with units documented

Raises:

Type Description
AnalysisFailedError

If audio is too short, empty, or analysis fails

Example

import librosa y, sr = librosa.load("kick.wav", sr=None) features = extract_spectral_features(y, sr) features['spectral_centroid'] 1523.5 # Hz - indicates a bright sound features['rms_energy'] 0.45 # Linear energy level features['dynamic_range'] 18.2 # dB peak-to-average ratio

extract_tonal_features(audio: np.ndarray, sr: int) -> dict[str, Any]

Extract tonal/pitch features using Essentia.

Algorithms: - KeyExtractor: Key and scale detection - PitchYin: Pitch tracking for salience - TuningFrequency: Reference tuning detection

Parameters:

Name Type Description Default
audio ndarray

Audio samples (mono or stereo)

required
sr int

Sample rate in Hz

required

Returns:

Type Description
dict[str, Any]

dict with keys:

dict[str, Any]
  • key: str or None (e.g., "C", "Dm", "F#m", None for percussion/noise)
dict[str, Any]
  • key_confidence: float (0-1)
dict[str, Any]
  • tuning_frequency: float (Hz, typically ~440)
dict[str, Any]
  • pitch_salience: float (0-1, how tonal vs percussive)

Raises:

Type Description
AnalysisFailedError

If extraction fails due to invalid audio data

Example

y, sr = librosa.load("bass_c.wav", sr=None) features = extract_tonal_features(y, sr) features['key'] 'C' features['pitch_salience'] 0.85

get_basic_metadata(path: Path) -> BasicMetadata

Extract basic audio metadata.

Loads audio file using librosa and computes fundamental properties. The file is loaded with its native sample rate to preserve metadata accuracy.

Parameters:

Name Type Description Default
path Path

Path to audio file

required

Returns:

Type Description
BasicMetadata

Dictionary containing basic metadata:

BasicMetadata
  • duration_ms: Audio duration in milliseconds (float)
BasicMetadata
  • sample_rate: Native sample rate in Hz (int)
BasicMetadata
  • channels: Number of audio channels (int)
BasicMetadata
  • bit_depth: Bit depth, 16 assumed for librosa (int)
BasicMetadata
  • file_size_bytes: File size on disk in bytes (int)
BasicMetadata
  • file_hash: SHA256 hex digest for deduplication (str)

Raises:

Type Description
UnsupportedFormatError

If file cannot be loaded by librosa

AnalysisFailedError

If file is too short or contains invalid audio

Example

meta = get_basic_metadata(Path("kick.wav")) meta['duration_ms'] 250.5 meta['sample_rate'] 44100 meta['channels'] 1

list_models(include_cached_only: bool = False) -> dict[str, dict[str, Any]]

List available models and their status.

Parameters:

Name Type Description Default
include_cached_only bool

Only show cached models

False

Returns:

Type Description
dict[str, dict[str, Any]]

Dictionary mapping model type to info dict with keys:

dict[str, dict[str, Any]]
  • cached: Whether model is cached locally (bool)
dict[str, dict[str, Any]]
  • path: Path to cached model if exists (str | None)
dict[str, dict[str, Any]]
  • description: Model description (str)
Example

models = list_models() "musicnn" in models True models["musicnn"]["cached"] True

load_model(model_type: ModelType, auto_download: bool = True) -> Path

Load an Essentia model, downloading if necessary.

Parameters:

Name Type Description Default
model_type ModelType

Model type to load

required
auto_download bool

Automatically download if not cached

True

Returns:

Type Description
Path

Path to model file

Raises:

Type Description
ModelLoadError

If model not found and auto_download=False

Example

model_path = load_model("musicnn") model_path.exists() True

parse_synthdef(path: Path, timeout: float = 10.0) -> SynthDefInfo

Parse a SuperCollider SynthDef file using sclang.

Uses sclang subprocess to extract SynthDesc metadata. Falls back to regex parsing if sclang is unavailable or fails.

Parameters:

Name Type Description Default
path Path

Path to .scd file containing SynthDef

required
timeout float

Maximum time to wait for sclang (seconds)

10.0

Returns:

Type Description
SynthDefInfo

SynthDefInfo with extracted metadata

Raises:

Type Description
SynthDefError

If SynthDef is invalid or cannot be parsed

SubprocessTimeoutError

If sclang takes too long

Example

info = parse_synthdef(Path("synths/tb303.scd")) info.name 'tb303' info.controls[0].name 'out' info.ugens_used ['Saw', 'Pulse', 'Select', 'MoogFF', 'EnvGen', 'Out', 'Lag']

Basic Metadata

basic

Basic audio metadata extraction for audiomancer.

This module provides functions for extracting fundamental audio file metadata such as duration, sample rate, channel count, and file hash.

BasicMetadata

Bases: TypedDict

Basic audio file metadata.

Attributes:

Name Type Description
duration_ms float

Audio duration in milliseconds

sample_rate int

Sample rate in Hz

channels int

Number of audio channels

bit_depth int

Bit depth (16 assumed for librosa float32 conversion)

file_size_bytes int

File size in bytes

file_hash str

SHA256 hex digest of file contents

Source code in src/audiomancer/analyzers/basic.py
class BasicMetadata(TypedDict):
    """Basic audio file metadata.

    Attributes:
        duration_ms: Audio duration in milliseconds
        sample_rate: Sample rate in Hz
        channels: Number of audio channels
        bit_depth: Bit depth (16 assumed for librosa float32 conversion)
        file_size_bytes: File size in bytes
        file_hash: SHA256 hex digest of file contents
    """
    duration_ms: float
    sample_rate: int
    channels: int
    bit_depth: int
    file_size_bytes: int
    file_hash: str

get_basic_metadata(path: Path) -> BasicMetadata

Extract basic audio metadata.

Loads audio file using librosa and computes fundamental properties. The file is loaded with its native sample rate to preserve metadata accuracy.

Parameters:

Name Type Description Default
path Path

Path to audio file

required

Returns:

Type Description
BasicMetadata

Dictionary containing basic metadata:

BasicMetadata
  • duration_ms: Audio duration in milliseconds (float)
BasicMetadata
  • sample_rate: Native sample rate in Hz (int)
BasicMetadata
  • channels: Number of audio channels (int)
BasicMetadata
  • bit_depth: Bit depth, 16 assumed for librosa (int)
BasicMetadata
  • file_size_bytes: File size on disk in bytes (int)
BasicMetadata
  • file_hash: SHA256 hex digest for deduplication (str)

Raises:

Type Description
UnsupportedFormatError

If file cannot be loaded by librosa

AnalysisFailedError

If file is too short or contains invalid audio

Example

meta = get_basic_metadata(Path("kick.wav")) meta['duration_ms'] 250.5 meta['sample_rate'] 44100 meta['channels'] 1

Source code in src/audiomancer/analyzers/basic.py
def get_basic_metadata(path: Path) -> BasicMetadata:
    """Extract basic audio metadata.

    Loads audio file using librosa and computes fundamental properties.
    The file is loaded with its native sample rate to preserve metadata accuracy.

    Args:
        path: Path to audio file

    Returns:
        Dictionary containing basic metadata:
        - duration_ms: Audio duration in milliseconds (float)
        - sample_rate: Native sample rate in Hz (int)
        - channels: Number of audio channels (int)
        - bit_depth: Bit depth, 16 assumed for librosa (int)
        - file_size_bytes: File size on disk in bytes (int)
        - file_hash: SHA256 hex digest for deduplication (str)

    Raises:
        UnsupportedFormatError: If file cannot be loaded by librosa
        AnalysisFailedError: If file is too short or contains invalid audio

    Example:
        >>> meta = get_basic_metadata(Path("kick.wav"))
        >>> meta['duration_ms']
        250.5
        >>> meta['sample_rate']
        44100
        >>> meta['channels']
        1
    """
    # Validate path exists
    if not path.exists():
        raise UnsupportedFormatError(
            f"File does not exist: {path}",
            details={"path": str(path), "error": "file not found"}
        )

    # Load audio with native sample rate
    try:
        y, sr = librosa.load(str(path), sr=None, mono=False)
    except Exception as e:
        raise UnsupportedFormatError(
            f"Cannot load audio file",
            details={"path": str(path), "error": str(e)}
        )

    # Determine channel count and duration
    if y.ndim > 1:
        channels = y.shape[0]
        duration_samples = y.shape[1]
    else:
        channels = 1
        duration_samples = len(y)

    # Validate audio has content
    if duration_samples == 0:
        raise AnalysisFailedError(
            "Audio file is empty",
            details={
                "path": str(path),
                "reason": "zero samples",
                "stage": "metadata extraction"
            }
        )

    # Calculate duration in milliseconds
    duration_ms = (duration_samples / sr) * 1000.0

    # Compute SHA256 hash for deduplication
    try:
        with open(path, 'rb') as f:
            file_hash = hashlib.sha256(f.read()).hexdigest()
    except Exception as e:
        raise AnalysisFailedError(
            "Failed to compute file hash",
            details={
                "path": str(path),
                "error": str(e),
                "stage": "hash computation"
            }
        )

    # Get file size
    file_size_bytes = path.stat().st_size

    return BasicMetadata(
        duration_ms=float(duration_ms),
        sample_rate=int(sr),
        channels=int(channels),
        bit_depth=16,  # librosa loads as float32, assume 16-bit source
        file_size_bytes=int(file_size_bytes),
        file_hash=file_hash,
    )

Spectral Features

spectral

Spectral feature extraction for audiomancer.

This module provides spectral analysis functions using Essentia for extracting features like spectral centroid, bandwidth, rolloff, and energy.

SpectralFeatures

Bases: TypedDict

Spectral audio features.

All frequency values are in Hz, energy values are linear (0-1 range).

Attributes:

Name Type Description
spectral_centroid float

Mean brightness/center of mass of spectrum (Hz)

spectral_bandwidth float

Frequency spread around centroid (Hz proxy)

spectral_rolloff float

High-frequency content cutoff point (Hz)

zero_crossing_rate float

Measure of noisiness/percussiveness (0-1)

rms_energy float

Root-mean-square energy level (0-1 linear)

dynamic_range float

Peak-to-average ratio (dB)

Source code in src/audiomancer/analyzers/spectral.py
class SpectralFeatures(TypedDict):
    """Spectral audio features.

    All frequency values are in Hz, energy values are linear (0-1 range).

    Attributes:
        spectral_centroid: Mean brightness/center of mass of spectrum (Hz)
        spectral_bandwidth: Frequency spread around centroid (Hz proxy)
        spectral_rolloff: High-frequency content cutoff point (Hz)
        zero_crossing_rate: Measure of noisiness/percussiveness (0-1)
        rms_energy: Root-mean-square energy level (0-1 linear)
        dynamic_range: Peak-to-average ratio (dB)
    """
    spectral_centroid: float
    spectral_bandwidth: float
    spectral_rolloff: float
    zero_crossing_rate: float
    rms_energy: float
    dynamic_range: float

extract_spectral_features(audio: np.ndarray, sr: int) -> SpectralFeatures

Extract spectral features using Essentia.

Performs frame-based spectral analysis to extract features that describe the frequency content and energy distribution of the audio signal.

Algorithms used: - Centroid: Spectral center of mass, indicates brightness (Hz) - Bandwidth: Frequency spread via 2nd central moment (Hz proxy) - Rolloff: Frequency below which 85% of energy is contained (Hz) - ZeroCrossingRate: Time-domain noisiness measure (0-1) - RMS: Root-mean-square energy level (0-1 linear) - DynamicRange: Peak-to-RMS ratio in dB

Parameters:

Name Type Description Default
audio ndarray

Audio data as numpy array (mono or stereo)

required
sr int

Sample rate in Hz

required

Returns:

Type Description
SpectralFeatures

Dictionary containing spectral features with units documented

Raises:

Type Description
AnalysisFailedError

If audio is too short, empty, or analysis fails

Example

import librosa y, sr = librosa.load("kick.wav", sr=None) features = extract_spectral_features(y, sr) features['spectral_centroid'] 1523.5 # Hz - indicates a bright sound features['rms_energy'] 0.45 # Linear energy level features['dynamic_range'] 18.2 # dB peak-to-average ratio

Source code in src/audiomancer/analyzers/spectral.py
def extract_spectral_features(
    audio: np.ndarray,
    sr: int
) -> SpectralFeatures:
    """Extract spectral features using Essentia.

    Performs frame-based spectral analysis to extract features that describe
    the frequency content and energy distribution of the audio signal.

    Algorithms used:
    - Centroid: Spectral center of mass, indicates brightness (Hz)
    - Bandwidth: Frequency spread via 2nd central moment (Hz proxy)
    - Rolloff: Frequency below which 85% of energy is contained (Hz)
    - ZeroCrossingRate: Time-domain noisiness measure (0-1)
    - RMS: Root-mean-square energy level (0-1 linear)
    - DynamicRange: Peak-to-RMS ratio in dB

    Args:
        audio: Audio data as numpy array (mono or stereo)
        sr: Sample rate in Hz

    Returns:
        Dictionary containing spectral features with units documented

    Raises:
        AnalysisFailedError: If audio is too short, empty, or analysis fails

    Example:
        >>> import librosa
        >>> y, sr = librosa.load("kick.wav", sr=None)
        >>> features = extract_spectral_features(y, sr)
        >>> features['spectral_centroid']
        1523.5  # Hz - indicates a bright sound
        >>> features['rms_energy']
        0.45  # Linear energy level
        >>> features['dynamic_range']
        18.2  # dB peak-to-average ratio
    """
    # Ensure mono audio for analysis
    if audio.ndim > 1:
        audio = np.mean(audio, axis=0)

    # Ensure float32 for Essentia compatibility
    audio = audio.astype(np.float32)

    # Validate audio has content
    if len(audio) == 0:
        raise AnalysisFailedError(
            "Cannot analyze empty audio",
            details={
                "reason": "zero samples",
                "stage": "spectral analysis preparation"
            }
        )

    # Check for minimum length (need at least one frame)
    frame_size = 2048
    if len(audio) < frame_size:
        raise AnalysisFailedError(
            "Audio too short for spectral analysis",
            details={
                "reason": f"audio length {len(audio)} < frame size {frame_size}",
                "stage": "spectral analysis preparation",
                "min_samples": frame_size,
                "actual_samples": len(audio)
            }
        )

    # Initialize Essentia algorithms
    try:
        spectrum_extractor = es.Spectrum()
        windowing = es.Windowing(type='hann', size=frame_size)
        centroid_algo = es.Centroid(range=sr/2)
        central_moments = es.CentralMoments()
        rolloff_algo = es.RollOff()
        zcr_algo = es.ZeroCrossingRate()
        rms_algo = es.RMS()
    except Exception as e:
        raise AnalysisFailedError(
            "Failed to initialize Essentia algorithms",
            details={
                "error": str(e),
                "stage": "algorithm initialization"
            }
        )

    # Frame-based analysis
    hop_size = 512
    centroids = []
    bandwidths = []
    rolloffs = []
    zcrs = []
    rms_values = []

    try:
        for i in range(0, len(audio) - frame_size, hop_size):
            frame = audio[i:i+frame_size]

            # Apply Hann window
            windowed = windowing(frame)

            # Compute spectrum
            spectrum = spectrum_extractor(windowed)

            # Extract features
            centroids.append(centroid_algo(spectrum))

            # Use 2nd central moment as bandwidth proxy
            moments = central_moments(spectrum)
            if len(moments) > 1:
                bandwidths.append(abs(moments[1]))
            else:
                bandwidths.append(0.0)

            rolloffs.append(rolloff_algo(spectrum))
            zcrs.append(zcr_algo(frame))
            rms_values.append(rms_algo(frame))

    except Exception as e:
        raise AnalysisFailedError(
            "Spectral feature extraction failed",
            details={
                "error": str(e),
                "stage": "frame-based analysis"
            }
        )

    # Validate we got features
    if len(centroids) == 0:
        raise AnalysisFailedError(
            "No features extracted",
            details={
                "reason": "no valid frames",
                "stage": "spectral analysis"
            }
        )

    # Compute dynamic range
    peak = float(np.max(np.abs(audio)))
    rms = float(np.sqrt(np.mean(audio**2)))

    # Avoid log of zero
    if rms > 0 and peak > 0:
        dynamic_range = 20 * np.log10(peak / rms)
    else:
        dynamic_range = 0.0

    # Convert lists to arrays for mean calculation
    centroids_arr = np.array(centroids)
    bandwidths_arr = np.array(bandwidths)
    rolloffs_arr = np.array(rolloffs)
    zcrs_arr = np.array(zcrs)
    rms_arr = np.array(rms_values)

    # Validate no NaN or inf values
    def validate_value(value: float, name: str) -> float:
        if np.isnan(value) or np.isinf(value):
            raise AnalysisFailedError(
                f"Invalid {name} value",
                details={
                    "reason": f"{name} is NaN or inf",
                    "stage": "feature validation",
                    "value": str(value)
                }
            )
        return float(value)

    # Return mean values across all frames
    return SpectralFeatures(
        spectral_centroid=validate_value(float(np.mean(centroids_arr)), "spectral_centroid"),
        spectral_bandwidth=validate_value(float(np.mean(bandwidths_arr)), "spectral_bandwidth"),
        spectral_rolloff=validate_value(float(np.mean(rolloffs_arr)), "spectral_rolloff"),
        zero_crossing_rate=validate_value(float(np.mean(zcrs_arr)), "zero_crossing_rate"),
        rms_energy=validate_value(float(np.mean(rms_arr)), "rms_energy"),
        dynamic_range=validate_value(dynamic_range, "dynamic_range"),
    )

Rhythm Features

rhythm

Rhythm analysis for audiomancer.

Extracts tempo, beat positions, and loop detection using Essentia.

extract_rhythm_features(audio: np.ndarray, sr: int) -> dict[str, Any]

Extract rhythm/tempo features using Essentia.

Algorithms: - RhythmExtractor2013: BPM detection with confidence - BeatTrackerDegara: Beat positions - OnsetDetection: Transient detection

Parameters:

Name Type Description Default
audio ndarray

Audio samples (mono or stereo)

required
sr int

Sample rate in Hz

required

Returns:

Type Description
dict[str, Any]

dict with keys:

dict[str, Any]
  • bpm: float or None (tempo in beats per minute, None for non-rhythmic)
dict[str, Any]
  • bpm_confidence: float (0-1)
dict[str, Any]
  • beat_positions: list[float] (beat times in seconds)
dict[str, Any]
  • is_loop: bool (True if audio appears to be a rhythmic loop)

Raises:

Type Description
AnalysisFailedError

If extraction fails due to invalid audio data

Example

y, sr = librosa.load("loop_125bpm.wav", sr=None) features = extract_rhythm_features(y, sr) features['bpm'] 125.0 features['bpm_confidence'] 0.95 features['is_loop'] True

Source code in src/audiomancer/analyzers/rhythm.py
def extract_rhythm_features(
    audio: np.ndarray,
    sr: int
) -> dict[str, Any]:
    """
    Extract rhythm/tempo features using Essentia.

    Algorithms:
    - RhythmExtractor2013: BPM detection with confidence
    - BeatTrackerDegara: Beat positions
    - OnsetDetection: Transient detection

    Args:
        audio: Audio samples (mono or stereo)
        sr: Sample rate in Hz

    Returns:
        dict with keys:
        - bpm: float or None (tempo in beats per minute, None for non-rhythmic)
        - bpm_confidence: float (0-1)
        - beat_positions: list[float] (beat times in seconds)
        - is_loop: bool (True if audio appears to be a rhythmic loop)

    Raises:
        AnalysisFailedError: If extraction fails due to invalid audio data

    Example:
        >>> y, sr = librosa.load("loop_125bpm.wav", sr=None)
        >>> features = extract_rhythm_features(y, sr)
        >>> features['bpm']
        125.0
        >>> features['bpm_confidence']
        0.95
        >>> features['is_loop']
        True
    """
    try:
        # Ensure mono float32
        if audio.ndim > 1:
            audio = np.mean(audio, axis=0)
        audio = audio.astype(np.float32)

        # Validate audio is not all zeros or NaN
        if len(audio) == 0:
            raise AnalysisFailedError(
                "Cannot analyze empty audio",
                details={"stage": "rhythm_extraction", "reason": "empty input"}
            )

        if np.all(audio == 0) or np.any(np.isnan(audio)) or np.any(np.isinf(audio)):
            # Silence or invalid data - return None for BPM
            return {
                'bpm': None,
                'bpm_confidence': 0.0,
                'beat_positions': [],
                'is_loop': False,
            }

        # BPM extraction
        rhythm_extractor = es.RhythmExtractor2013(method="multifeature")
        bpm, beats, beats_confidence, _, _ = rhythm_extractor(audio)

        # Convert beat positions to seconds (they come as samples from Essentia)
        beat_positions = [float(b) for b in beats]

        # Calculate average confidence (beats_confidence is a float, not array)
        avg_confidence = float(beats_confidence) if not np.isnan(beats_confidence) else 0.0

        # Only report BPM if:
        # 1. BPM > 0 (Essentia returns 0 for non-rhythmic content)
        # 2. Confidence is reasonable (> 0.2)
        # 3. We have some beats detected
        bpm_value = None
        if bpm > 0 and avg_confidence > 0.2 and len(beats) > 0:
            bpm_value = float(bpm)

        # Determine if loop (check if duration matches bar boundaries)
        duration_sec = len(audio) / sr
        is_loop = False

        if bpm_value is not None and bpm_value > 0:
            beat_duration = 60.0 / bpm_value
            bar_duration = beat_duration * 4  # 4/4 time signature
            bars = duration_sec / bar_duration

            # Is loop if:
            # 1. Close to whole number of bars (within 10%)
            # 2. Duration is less than 30 seconds (typical loop length)
            # 3. We have at least 1 bar
            if bars >= 1 and abs(bars - round(bars)) < 0.1 and duration_sec < 30:
                is_loop = True

        return {
            'bpm': bpm_value,
            'bpm_confidence': avg_confidence,
            'beat_positions': beat_positions,
            'is_loop': is_loop,
        }

    except Exception as e:
        if isinstance(e, AnalysisFailedError):
            raise
        raise AnalysisFailedError(
            "Rhythm extraction failed",
            details={
                "stage": "rhythm_extraction",
                "error": str(e),
                "audio_shape": audio.shape if hasattr(audio, 'shape') else None,
                "sample_rate": sr,
            }
        ) from e

Audio Embeddings

embeddings

Audio embedding extraction for audiomancer.

This module provides functions for extracting fixed-dimension audio embeddings for similarity search and clustering using pre-trained Essentia models.

All embeddings are L2-normalized 128-dimensional vectors.

Includes SimilarityIndex for fast nearest-neighbor search using FAISS.

SimilarityIndex

Fast similarity search index using FAISS.

Enables efficient nearest-neighbor search across thousands of audio embeddings. Uses IndexFlatIP (inner product) which equals cosine similarity for L2-normalized vectors.

Example

Build index from embeddings

embeddings = [extract_audio_embedding(audio, sr) for audio in samples] index = SimilarityIndex() index.add(embeddings)

Search for similar samples

query = extract_audio_embedding(query_audio, sr) similarities, indices = index.search(query, k=5)

Save/load for persistence

index.save("samples.index") loaded = SimilarityIndex.load("samples.index")

Source code in src/audiomancer/analyzers/embeddings.py
class SimilarityIndex:
    """Fast similarity search index using FAISS.

    Enables efficient nearest-neighbor search across thousands of audio embeddings.
    Uses IndexFlatIP (inner product) which equals cosine similarity for L2-normalized vectors.

    Example:
        >>> # Build index from embeddings
        >>> embeddings = [extract_audio_embedding(audio, sr) for audio in samples]
        >>> index = SimilarityIndex()
        >>> index.add(embeddings)
        >>>
        >>> # Search for similar samples
        >>> query = extract_audio_embedding(query_audio, sr)
        >>> similarities, indices = index.search(query, k=5)
        >>>
        >>> # Save/load for persistence
        >>> index.save("samples.index")
        >>> loaded = SimilarityIndex.load("samples.index")
    """

    def __init__(self, dimension: int = 128):
        """Create a new similarity index.

        Args:
            dimension: Embedding dimension (default 128 for audiomancer)

        Raises:
            ImportError: If faiss-cpu is not installed
        """
        if not FAISS_AVAILABLE:
            raise ImportError(
                "faiss-cpu is required for SimilarityIndex. "
                "Install with: pip install faiss-cpu"
            )

        self._dimension = dimension
        # IndexFlatIP uses inner product (dot product)
        # For L2-normalized vectors, this equals cosine similarity
        self._index = faiss.IndexFlatIP(dimension)

    @property
    def dimension(self) -> int:
        """Get the embedding dimension."""
        return self._dimension

    @property
    def ntotal(self) -> int:
        """Get the number of embeddings in the index."""
        return self._index.ntotal

    def add(self, embeddings: list[list[float]]) -> None:
        """Add embeddings to the index.

        Args:
            embeddings: List of embedding vectors (each 128-dim, L2-normalized)

        Raises:
            ValueError: If embeddings have wrong dimension
        """
        if not embeddings:
            return

        # Convert to numpy array with correct dtype
        arr = np.array(embeddings, dtype=np.float32)

        if arr.ndim == 1:
            arr = arr.reshape(1, -1)

        if arr.shape[1] != self._dimension:
            raise ValueError(
                f"Expected {self._dimension}-dimensional embeddings, "
                f"got {arr.shape[1]}"
            )

        self._index.add(arr)

    def search(
        self,
        query: list[float],
        k: int = 5,
    ) -> tuple[list[float], list[int]]:
        """Search for k most similar embeddings.

        Args:
            query: Query embedding (128-dim, L2-normalized)
            k: Number of nearest neighbors to return

        Returns:
            Tuple of (similarities, indices):
            - similarities: Cosine similarity scores (1.0 = identical)
            - indices: Indices of matching embeddings in add() order

        Raises:
            ValueError: If query has wrong dimension or index is empty
        """
        if self._index.ntotal == 0:
            raise ValueError("Index is empty. Add embeddings first.")

        # Convert to numpy array
        query_arr = np.array([query], dtype=np.float32)

        if query_arr.shape[1] != self._dimension:
            raise ValueError(
                f"Expected {self._dimension}-dimensional query, "
                f"got {query_arr.shape[1]}"
            )

        # Limit k to number of indexed vectors
        k = min(k, self._index.ntotal)

        # Search returns (distances, indices)
        # For IndexFlatIP, "distances" are actually similarity scores
        similarities, indices = self._index.search(query_arr, k)

        return similarities[0].tolist(), indices[0].tolist()

    def save(self, path: str | Path) -> None:
        """Save the index to disk.

        Args:
            path: Path to save the index file
        """
        faiss.write_index(self._index, str(path))

    @classmethod
    def load(cls, path: str | Path) -> "SimilarityIndex":
        """Load an index from disk.

        Args:
            path: Path to the index file

        Returns:
            Loaded SimilarityIndex

        Raises:
            ImportError: If faiss-cpu is not installed
            FileNotFoundError: If index file doesn't exist
        """
        if not FAISS_AVAILABLE:
            raise ImportError(
                "faiss-cpu is required for SimilarityIndex. "
                "Install with: pip install faiss-cpu"
            )

        path = Path(path)
        if not path.exists():
            raise FileNotFoundError(f"Index file not found: {path}")

        index = faiss.read_index(str(path))

        # Create instance and set the loaded index
        instance = cls.__new__(cls)
        instance._dimension = index.d
        instance._index = index

        return instance

dimension: int property

Get the embedding dimension.

ntotal: int property

Get the number of embeddings in the index.

__init__(dimension: int = 128)

Create a new similarity index.

Parameters:

Name Type Description Default
dimension int

Embedding dimension (default 128 for audiomancer)

128

Raises:

Type Description
ImportError

If faiss-cpu is not installed

Source code in src/audiomancer/analyzers/embeddings.py
def __init__(self, dimension: int = 128):
    """Create a new similarity index.

    Args:
        dimension: Embedding dimension (default 128 for audiomancer)

    Raises:
        ImportError: If faiss-cpu is not installed
    """
    if not FAISS_AVAILABLE:
        raise ImportError(
            "faiss-cpu is required for SimilarityIndex. "
            "Install with: pip install faiss-cpu"
        )

    self._dimension = dimension
    # IndexFlatIP uses inner product (dot product)
    # For L2-normalized vectors, this equals cosine similarity
    self._index = faiss.IndexFlatIP(dimension)

add(embeddings: list[list[float]]) -> None

Add embeddings to the index.

Parameters:

Name Type Description Default
embeddings list[list[float]]

List of embedding vectors (each 128-dim, L2-normalized)

required

Raises:

Type Description
ValueError

If embeddings have wrong dimension

Source code in src/audiomancer/analyzers/embeddings.py
def add(self, embeddings: list[list[float]]) -> None:
    """Add embeddings to the index.

    Args:
        embeddings: List of embedding vectors (each 128-dim, L2-normalized)

    Raises:
        ValueError: If embeddings have wrong dimension
    """
    if not embeddings:
        return

    # Convert to numpy array with correct dtype
    arr = np.array(embeddings, dtype=np.float32)

    if arr.ndim == 1:
        arr = arr.reshape(1, -1)

    if arr.shape[1] != self._dimension:
        raise ValueError(
            f"Expected {self._dimension}-dimensional embeddings, "
            f"got {arr.shape[1]}"
        )

    self._index.add(arr)

load(path: str | Path) -> SimilarityIndex classmethod

Load an index from disk.

Parameters:

Name Type Description Default
path str | Path

Path to the index file

required

Returns:

Type Description
SimilarityIndex

Loaded SimilarityIndex

Raises:

Type Description
ImportError

If faiss-cpu is not installed

FileNotFoundError

If index file doesn't exist

Source code in src/audiomancer/analyzers/embeddings.py
@classmethod
def load(cls, path: str | Path) -> "SimilarityIndex":
    """Load an index from disk.

    Args:
        path: Path to the index file

    Returns:
        Loaded SimilarityIndex

    Raises:
        ImportError: If faiss-cpu is not installed
        FileNotFoundError: If index file doesn't exist
    """
    if not FAISS_AVAILABLE:
        raise ImportError(
            "faiss-cpu is required for SimilarityIndex. "
            "Install with: pip install faiss-cpu"
        )

    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f"Index file not found: {path}")

    index = faiss.read_index(str(path))

    # Create instance and set the loaded index
    instance = cls.__new__(cls)
    instance._dimension = index.d
    instance._index = index

    return instance

save(path: str | Path) -> None

Save the index to disk.

Parameters:

Name Type Description Default
path str | Path

Path to save the index file

required
Source code in src/audiomancer/analyzers/embeddings.py
def save(self, path: str | Path) -> None:
    """Save the index to disk.

    Args:
        path: Path to save the index file
    """
    faiss.write_index(self._index, str(path))

search(query: list[float], k: int = 5) -> tuple[list[float], list[int]]

Search for k most similar embeddings.

Parameters:

Name Type Description Default
query list[float]

Query embedding (128-dim, L2-normalized)

required
k int

Number of nearest neighbors to return

5

Returns:

Type Description
list[float]

Tuple of (similarities, indices):

list[int]
  • similarities: Cosine similarity scores (1.0 = identical)
tuple[list[float], list[int]]
  • indices: Indices of matching embeddings in add() order

Raises:

Type Description
ValueError

If query has wrong dimension or index is empty

Source code in src/audiomancer/analyzers/embeddings.py
def search(
    self,
    query: list[float],
    k: int = 5,
) -> tuple[list[float], list[int]]:
    """Search for k most similar embeddings.

    Args:
        query: Query embedding (128-dim, L2-normalized)
        k: Number of nearest neighbors to return

    Returns:
        Tuple of (similarities, indices):
        - similarities: Cosine similarity scores (1.0 = identical)
        - indices: Indices of matching embeddings in add() order

    Raises:
        ValueError: If query has wrong dimension or index is empty
    """
    if self._index.ntotal == 0:
        raise ValueError("Index is empty. Add embeddings first.")

    # Convert to numpy array
    query_arr = np.array([query], dtype=np.float32)

    if query_arr.shape[1] != self._dimension:
        raise ValueError(
            f"Expected {self._dimension}-dimensional query, "
            f"got {query_arr.shape[1]}"
        )

    # Limit k to number of indexed vectors
    k = min(k, self._index.ntotal)

    # Search returns (distances, indices)
    # For IndexFlatIP, "distances" are actually similarity scores
    similarities, indices = self._index.search(query_arr, k)

    return similarities[0].tolist(), indices[0].tolist()

cosine_similarity(embedding1: list[float], embedding2: list[float]) -> float

Compute cosine similarity between two embeddings.

Since embeddings are L2-normalized, cosine similarity is just the dot product.

Parameters:

Name Type Description Default
embedding1 list[float]

First embedding (128-dim)

required
embedding2 list[float]

Second embedding (128-dim)

required

Returns:

Type Description
float

Cosine similarity in range [-1, 1]

float

(1 = identical, 0 = orthogonal, -1 = opposite)

Example

emb1 = extract_audio_embedding(audio1, 44100) emb2 = extract_audio_embedding(audio2, 44100) similarity = cosine_similarity(emb1, emb2) 0 <= similarity <= 1 # For typical audio True

Source code in src/audiomancer/analyzers/embeddings.py
def cosine_similarity(embedding1: list[float], embedding2: list[float]) -> float:
    """Compute cosine similarity between two embeddings.

    Since embeddings are L2-normalized, cosine similarity is just the dot product.

    Args:
        embedding1: First embedding (128-dim)
        embedding2: Second embedding (128-dim)

    Returns:
        Cosine similarity in range [-1, 1]
        (1 = identical, 0 = orthogonal, -1 = opposite)

    Example:
        >>> emb1 = extract_audio_embedding(audio1, 44100)
        >>> emb2 = extract_audio_embedding(audio2, 44100)
        >>> similarity = cosine_similarity(emb1, emb2)
        >>> 0 <= similarity <= 1  # For typical audio
        True
    """
    arr1 = np.array(embedding1, dtype=np.float32)
    arr2 = np.array(embedding2, dtype=np.float32)

    # Dot product (since both are L2-normalized)
    return float(np.dot(arr1, arr2))

euclidean_distance(embedding1: list[float], embedding2: list[float]) -> float

Compute Euclidean distance between two embeddings.

Parameters:

Name Type Description Default
embedding1 list[float]

First embedding (128-dim)

required
embedding2 list[float]

Second embedding (128-dim)

required

Returns:

Type Description
float

Euclidean distance (lower = more similar)

Example

emb1 = extract_audio_embedding(audio1, 44100) emb2 = extract_audio_embedding(audio2, 44100) distance = euclidean_distance(emb1, emb2) distance >= 0 True

Source code in src/audiomancer/analyzers/embeddings.py
def euclidean_distance(embedding1: list[float], embedding2: list[float]) -> float:
    """Compute Euclidean distance between two embeddings.

    Args:
        embedding1: First embedding (128-dim)
        embedding2: Second embedding (128-dim)

    Returns:
        Euclidean distance (lower = more similar)

    Example:
        >>> emb1 = extract_audio_embedding(audio1, 44100)
        >>> emb2 = extract_audio_embedding(audio2, 44100)
        >>> distance = euclidean_distance(emb1, emb2)
        >>> distance >= 0
        True
    """
    arr1 = np.array(embedding1, dtype=np.float32)
    arr2 = np.array(embedding2, dtype=np.float32)

    return float(np.linalg.norm(arr1 - arr2))

extract_audio_embedding(audio: np.ndarray, sr: int, model: ModelType = 'musicnn') -> list[float]

Extract 128-dimensional audio embedding for similarity search.

Embeddings are L2-normalized fixed-size vectors that encode audio content. Similar-sounding audio will have similar embeddings (high cosine similarity).

Parameters:

Name Type Description Default
audio ndarray

Audio samples as numpy array (mono or stereo)

required
sr int

Sample rate in Hz

required
model ModelType

Embedding model to use: - "musicnn": MusiCNN embeddings (recommended for music) - "vggish": VGGish embeddings (general audio) - "openl3": OpenL3 embeddings (environmental sounds)

'musicnn'

Returns:

Type Description
list[float]

List of 128 floats (L2-normalized embedding vector)

Raises:

Type Description
ModelLoadError

If model file not found

AnalysisFailedError

If embedding extraction fails

Example

embedding = extract_audio_embedding(audio, 44100) len(embedding) 128

Verify L2 normalization

import math math.isclose(sum(x**2 for x in embedding), 1.0, abs_tol=1e-6) True

Source code in src/audiomancer/analyzers/embeddings.py
def extract_audio_embedding(
    audio: np.ndarray,
    sr: int,
    model: ModelType = "musicnn",
) -> list[float]:
    """Extract 128-dimensional audio embedding for similarity search.

    Embeddings are L2-normalized fixed-size vectors that encode audio content.
    Similar-sounding audio will have similar embeddings (high cosine similarity).

    Args:
        audio: Audio samples as numpy array (mono or stereo)
        sr: Sample rate in Hz
        model: Embedding model to use:
            - "musicnn": MusiCNN embeddings (recommended for music)
            - "vggish": VGGish embeddings (general audio)
            - "openl3": OpenL3 embeddings (environmental sounds)

    Returns:
        List of 128 floats (L2-normalized embedding vector)

    Raises:
        ModelLoadError: If model file not found
        AnalysisFailedError: If embedding extraction fails

    Example:
        >>> embedding = extract_audio_embedding(audio, 44100)
        >>> len(embedding)
        128
        >>> # Verify L2 normalization
        >>> import math
        >>> math.isclose(sum(x**2 for x in embedding), 1.0, abs_tol=1e-6)
        True
    """
    try:
        # Ensure mono audio
        if audio.ndim > 1:
            audio = np.mean(audio, axis=0)

        # Resample to model's expected sample rate
        # Most Essentia models expect 16kHz
        target_sr = 16000
        if sr != target_sr:
            import librosa
            audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)

        # Get cached model and extract embedding
        embedding_extractor = get_embedding_model(model)

        # Extract embedding based on model type
        if model == "musicnn":
            embedding = _extract_musicnn_embedding_cached(audio, embedding_extractor)
        elif model == "vggish":
            embedding = _extract_vggish_embedding_cached(audio, embedding_extractor)
        elif model == "openl3":
            embedding = _extract_openl3_embedding_cached(audio, embedding_extractor)
        else:
            raise ModelLoadError(
                f"Unknown embedding model: {model}",
                details={"model": model}
            )

        # Ensure embedding is 128-dimensional
        if len(embedding) != 128:
            raise AnalysisFailedError(
                f"Expected 128-dimensional embedding, got {len(embedding)}",
                details={
                    "model": model,
                    "embedding_dim": len(embedding),
                }
            )

        # L2 normalize
        embedding_array = np.array(embedding, dtype=np.float32)
        norm = np.linalg.norm(embedding_array)

        if norm == 0:
            raise AnalysisFailedError(
                "Zero-norm embedding (silent audio?)",
                details={"model": model}
            )

        normalized_embedding = embedding_array / norm

        # Verify normalization
        verification_norm = np.linalg.norm(normalized_embedding)
        if not np.isclose(verification_norm, 1.0, atol=1e-6):
            raise AnalysisFailedError(
                f"L2 normalization failed: norm={verification_norm}",
                details={"model": model, "norm": float(verification_norm)}
            )

        return normalized_embedding.tolist()

    except ModelLoadError:
        raise
    except AnalysisFailedError:
        raise
    except Exception as e:
        raise AnalysisFailedError(
            "Embedding extraction failed",
            details={
                "error": str(e),
                "stage": "embedding extraction",
                "model": model,
            }
        )