SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection

April 27, 2026

Introduction

SynSFX (Synthetic Sound Effects) is a large-scale benchmark for non-speech audio deepfake detection. While speech anti-spoofing has advanced rapidly, detectors trained on speech, such as AASIST and RawNet2, often collapse to near-random performance on synthetic sound effects.

SynSFX addresses this gap with a transparent, reproducible corpus of 43,374 audio clips totaling approximately 180 hours, spanning authentic environmental recordings and outputs from seven state-of-the-art text-to-audio (TTA) models.

Key design features:

Unprecedented scale for isolated sound-effect forensics
Seven diverse generators spanning diffusion, transformer, latent diffusion, multimodal, and flow-matching architectures
Shared Prompt Subset with 1,890 identical prompts across all generators for controlled cross-model comparison
Predefined train / validation / test splits for standardized benchmarking
Dataset access: Download the SynSFX dataset
Paper: SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection and Evaluation (IEEE submission)
License: Academic research only

Data Sources

SynSFX comprises two primary partitions.

Authentic Audio Subset (16,922 clips · ~120 h)

The authentic subset is curated from five established open-source repositories:

Source	Clips	Duration	Role
AudioCaps	4,000	11.0 h	Environmental recordings with captions
Clotho	3,839	24.0 h	Natural environmental audio
ESC-50	2,000	2.8 h	50 everyday sound categories
TACoS	5,000	31.1 h	Activity-related audio events
WavCaps	2,083	50.9 h	Large-scale web-sourced ambient sounds

Synthetic Audio Subset (26,460 clips · ~58 h)

The synthetic subset is generated using seven TTA architectures, anonymized as A1-A7 in the released corpus:

Model	Architecture family	Sample rate
A1	Diffusion (AudioLDM v1)	16 kHz
A2	Diffusion (AudioLDM v2)	16 kHz
A3	Transformer (AudioCraft / AudioGen)	16 kHz
A4	Latent diffusion (Stable Audio)	44.1 kHz
A5	Diffusion (Make-An-Audio)	16 kHz
A6	Multimodal (MMAudio)	44.1 kHz
A7	Flow matching (TangoFlux)	44.1 kHz

Prompts were expanded from concise baselines, such as "footsteps on gravel", into rich scene descriptions using LLMs including ChatGPT and Gemini, then filtered by human review.

Corpus Architecture

The corpus uses 28,350 unique textual prompts, structured for both diversity and controlled comparison:

Subset	Prompts	Description
Shared Prompt Subset	1,890	Identical prompts sent to all seven generators, isolating generator-specific artifacts
Exclusive Prompt Subsets	~1,890 per model	Unique prompts per architecture, ensuring statistical balance

All clips are stored as uncompressed WAV at each model's native sample rate to preserve generation artifacts.

Dataset Split

SynSFX is released with predefined splits for reproducible evaluation:

Partition	Clips (approx.)	Purpose
Train	~30,400	Model fine-tuning
Validation	~4,300	Hyperparameter tuning
Test (in-domain)	~4,300	Evaluation on seen generators (A1-A7)
Test (out-of-domain)	1,113	Zero-shot evaluation on unseen commercial generator and UrbanSound8K real audio

Metadata

Each split ships with a metadata file: synsfx_train.txt, synsfx_dev.txt, and synsfx_test.txt. Each line contains:

audio_path	generator	prompt_id	class_label	split_tag
clips/A3/000142.wav	A3	SP-0042	synthetic	shared_prompt
clips/real/clotho_0183.wav	-	-	authentic	exclusive

Access and Citation

The dataset can be downloaded through OfSpectrum's private storage route:

Download the SynSFX dataset

Please cite the paper when using SynSFX in academic work:

@article{synsfx2026,
  title={SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection and Evaluation},
  author={OfSpectrum Research Team},
  year={2026}
}