SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection
April 27, 2026

Introduction
SynSFX (Synthetic Sound Effects) is a large-scale benchmark for non-speech audio deepfake detection. While speech anti-spoofing has advanced rapidly, detectors trained on speech, such as AASIST and RawNet2, often collapse to near-random performance on synthetic sound effects.
SynSFX addresses this gap with a transparent, reproducible corpus of 43,374 audio clips totaling approximately 180 hours, spanning authentic environmental recordings and outputs from seven state-of-the-art text-to-audio (TTA) models.
Key design features:
- Unprecedented scale for isolated sound-effect forensics
- Seven diverse generators spanning diffusion, transformer, latent diffusion, multimodal, and flow-matching architectures
- Shared Prompt Subset with 1,890 identical prompts across all generators for controlled cross-model comparison
- Predefined train / validation / test splits for standardized benchmarking
- Dataset access: Download the SynSFX dataset
- Paper: SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection and Evaluation (IEEE submission)
- License: Academic research only
Data Sources
SynSFX comprises two primary partitions.
Authentic Audio Subset (16,922 clips · ~120 h)
The authentic subset is curated from five established open-source repositories:
| Source | Clips | Duration | Role |
|---|---|---|---|
| AudioCaps | 4,000 | 11.0 h | Environmental recordings with captions |
| Clotho | 3,839 | 24.0 h | Natural environmental audio |
| ESC-50 | 2,000 | 2.8 h | 50 everyday sound categories |
| TACoS | 5,000 | 31.1 h | Activity-related audio events |
| WavCaps | 2,083 | 50.9 h | Large-scale web-sourced ambient sounds |
Synthetic Audio Subset (26,460 clips · ~58 h)
The synthetic subset is generated using seven TTA architectures, anonymized as A1-A7 in the released corpus:
| Model | Architecture family | Sample rate |
|---|---|---|
| A1 | Diffusion (AudioLDM v1) | 16 kHz |
| A2 | Diffusion (AudioLDM v2) | 16 kHz |
| A3 | Transformer (AudioCraft / AudioGen) | 16 kHz |
| A4 | Latent diffusion (Stable Audio) | 44.1 kHz |
| A5 | Diffusion (Make-An-Audio) | 16 kHz |
| A6 | Multimodal (MMAudio) | 44.1 kHz |
| A7 | Flow matching (TangoFlux) | 44.1 kHz |
Prompts were expanded from concise baselines, such as "footsteps on gravel", into rich scene descriptions using LLMs including ChatGPT and Gemini, then filtered by human review.
Corpus Architecture
The corpus uses 28,350 unique textual prompts, structured for both diversity and controlled comparison:
| Subset | Prompts | Description |
|---|---|---|
| Shared Prompt Subset | 1,890 | Identical prompts sent to all seven generators, isolating generator-specific artifacts |
| Exclusive Prompt Subsets | ~1,890 per model | Unique prompts per architecture, ensuring statistical balance |
All clips are stored as uncompressed WAV at each model's native sample rate to preserve generation artifacts.
Dataset Split
SynSFX is released with predefined splits for reproducible evaluation:
| Partition | Clips (approx.) | Purpose |
|---|---|---|
| Train | ~30,400 | Model fine-tuning |
| Validation | ~4,300 | Hyperparameter tuning |
| Test (in-domain) | ~4,300 | Evaluation on seen generators (A1-A7) |
| Test (out-of-domain) | 1,113 | Zero-shot evaluation on unseen commercial generator and UrbanSound8K real audio |
Metadata
Each split ships with a metadata file: synsfx_train.txt, synsfx_dev.txt, and synsfx_test.txt. Each line contains:
| audio_path | generator | prompt_id | class_label | split_tag |
|---|---|---|---|---|
| clips/A3/000142.wav | A3 | SP-0042 | synthetic | shared_prompt |
| clips/real/clotho_0183.wav | - | - | authentic | exclusive |
Access and Citation
The dataset can be downloaded through OfSpectrum's private storage route:
Download the SynSFX datasetPlease cite the paper when using SynSFX in academic work:
@article{synsfx2026,
title={SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection and Evaluation},
author={OfSpectrum Research Team},
year={2026}
}
