Auralis — 6D Audio Embedding Visualizer

Published: April 24, 2026

A browser-based audio visualization platform that maps any sound onto seven different machine-learning embedding spaces and renders it as a luminous, navigable 3D trajectory. Auralis exists to answer a question that is normally invisible: how does a model actually “hear” a sound, and how do different representations disagree about it?

Business impact

Audio embeddings quietly power search, recommendation, classification, and generative audio — but they are abstract high-dimensional vectors that no stakeholder can inspect. Auralis turns that black box into something you can see, compare, and reason about. By projecting seven representations of the same clip into one shared 6D space, it becomes both an analytical instrument (which representation separates these sounds? where does a model confuse them?) and a communication tool (showing a non-specialist what “semantic audio similarity” means, live, in a browser tab).

Metric	Result
Embedding tracks	7 (Features, PCA, t-SNE, UMAP, Tonnetz, YAMNet, CLAP)
Render modes	10 (Comet, Tube, Galaxy, Aurora, Light-painting, …)
Curated library	102 licensed clips (space, nature, music, human-made)
Deployment	Live, browser-only; heavy CLAP runtime precomputed offline

Strategic context

Most audio tools stop at the spectrogram — useful to an engineer, opaque to everyone else — and the embedding spaces inside modern audio models are never shown at all. Auralis closes that gap. It is the difference between telling someone that CLAP groups “the sound of rain” near actual rain and showing them the two clips landing in the same region of space while their raw spectra look nothing alike. As a portfolio piece it demonstrates fluency across the full stack of audio intelligence — classical DSP, deep audio models, dimensionality reduction, and real-time 3D rendering — fused into one coherent product rather than a notebook.

6D embedding architecture

Key Performance Indicators — what it surfaces

The value is in making representation differences legible and comparable, not in a single accuracy number.

KPI	Baseline (typical audio tooling)	With Auralis	Impact
Representation visibility	One spectrogram view	7 embedding tracks, switchable	Compare what each model “hears” of the same sound
Exploration	Flat 2D waveform / spectrogram	Navigable 6D trajectory (XYZ + color + size + time)	Sound becomes a space, not a signal
Semantic grouping	Not exposed	CLAP audio-text track clusters by meaning	Semantically similar sounds group even when spectra differ
Accessibility of ML	Notebook + NumPy arrays	Interactive browser app with shareable URLs	A non-specialist can explore embeddings directly
Reproducibility	Ad-hoc per analysis	Persisted per-track projection models	Every clip maps into the same, comparable space

The challenge

A sound is a time series; an embedding is a high-dimensional vector per frame; a screen is two-dimensional. Bridging all three in real time is the core problem. Each clip is analyzed into seven embedding tracks — interpretable spectral Features, PCA/t-SNE/UMAP projections of MFCC frames (one linear, two manifold methods), the Tonnetz harmonic space, YAMNet’s 1024-D AudioSet event embeddings, and CLAP’s 512-D contrastive audio-text embeddings — and every track is min-max normalized to a common 6D so that any feature can drive any visual axis. The CLAP model is heavy (a full torch/transformers stack), so its embeddings are computed offline and the production app ships only the precomputed vectors, keeping the deployed surface light enough to run entirely in the browser.

System architecture

Offline analysis pipeline: librosa extracts spectral features and MFCCs per frame; scikit-learn and UMAP fit the corpus-wide PCA / t-SNE / UMAP projections; TensorFlow/YAMNet and CLAP produce the deep embeddings. Per-track projection models are persisted so every clip maps into a consistent, comparable space.
6D → visual mapping: spatial position (X, Y, Z) plus color (4th dimension), point size (5th), and time as the implicit 6th axis — past frames fade into a trail, so the shape of the trajectory itself encodes the structure of the sound.
Real-time renderer: a React + Three.js (@react-three/fiber) frontend draws the trajectory in ten render modes, synchronized to Web Audio API playback; full panel state (clip, track, axis mapping, render mode) round-trips through the URL hash for shareable views.
Serving layer: a FastAPI backend serves the precomputed per-clip JSON and the audio assets; the SPA is static and CDN-friendly.

Technology stack

Backend / pipeline: Python, FastAPI, librosa, scikit-learn, UMAP, TensorFlow (YAMNet), CLAP (offline)
Frontend: TypeScript, React, Vite, Three.js / @react-three/fiber, Zustand, Web Audio API
Representations: spectral DSP, MFCC, Tonnetz, PCA / t-SNE / UMAP, deep audio embeddings (YAMNet, CLAP)
Deployment: systemd + nginx on a Hetzner VPS; CLAP embeddings precomputed so production stays light

Live application

▶ Live demo — auralis.fasl-work.com — pick a clip, choose an embedding track, and fly through its 6D trajectory.

View on GitHub

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Felipe Santibañez-Leal