Official PyTorch implementation of "Joint Object Detection and Sound Source Separation" (ISMIR 2025).
See2Hear (S2H) is a unified framework that jointly learns object detection and sound source separation from videos. Unlike previous methods that treat these tasks independently, S2H leverages the synergy between visual localization and audio separation through end-to-end training.
- 🎯 Joint Learning: Unified training of object detection and sound separation
- 🔄 Dynamic Filtering: Selection of relevant object queries based on confidence threshold
- 🎵 State-of-the-art Performance: Best separation quality on MUSIC and MUSIC-21
- 🚀 End-to-end Training: No need for pre-extracted features or separate stages
- Python >= 3.11
- PyTorch >= 2.5.1
- CUDA >= 12.1
- FFmpeg (for audio/video processing)
# Clone repository
git clone https://github.com/snuviplab/S2H.git
cd S2H
# Create conda environment
conda create -n s2h python=3.11
conda activate s2h
# Install PyTorch (adjust CUDA version as needed)
# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121# Install required packages
pip install -r requirements.txt
# Install package in development mode
pip install -e .
# Install FFmpeg (required for audio/video processing)
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Check installation
ffmpeg -version# Install Detectron2 (required for Detic)
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
# Clone Detic repository
git clone https://github.com/facebookresearch/Detic.git --recurse-submodules third_party/Detic
cd third_party/Detic
pip install -r requirements.txt
cd ../..
# Create symbolic link for datasets (required by Detic)
ln -s third_party/Detic/datasets datasets
# Download Detic pretrained model
mkdir -p pretrained/detic
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth \
-O pretrained/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pthFirst, create the data directory structure:
export DATA_ROOT=/path/to/your/data # Change this to your data path
mkdir -p $DATA_ROOT/MUSIC
mkdir -p $DATA_ROOT/MUSIC21cd S2H
# Download MUSIC dataset JSON files
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC_solo_videos.json
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC_duet_videos.json
# Convert JSON to CSV format
python tools/convert_music_json.py \
--solo_json MUSIC_solo_videos.json \
--duet_json MUSIC_duet_videos.json \
--output_dir data/
# This creates:
# - data/train_music.csv
# - data/val_music.csv
# - data/test_music.csv# Prepare dataset
python tools/prepare_music.py --data_root $DATA_ROOT/MUSIC
# This script will:
# 1. Download videos from YouTube
# 2. Extract frames at 1 FPS
# 3. Extract audio at 11025 Hz
# 4. Update CSV files with frame counts# Generate object detection pseudo-labels
python tools/generate_pseudolabels.py \
--music_dir $DATA_ROOT/MUSIC \
--dataset music
# This creates detection files in:
# $DATA_ROOT/MUSIC/detections_detic/# Download MUSIC-21 dataset JSON
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC21_solo_videos.json
# Convert to CSV
python tools/convert_music21_json.py \
--solo_json MUSIC21_solo_videos.json \
--output_dir data/
# This creates:
# - data/train_music21.csv
# - data/val_music21.csv
# - data/test_music21.csv# Download videos and extract frames/audio
python tools/prepare_music21.py --data_root $DATA_ROOT/MUSIC21
# Generate pseudo-labels
python tools/generate_pseudolabels.py \
--music_dir $DATA_ROOT/MUSIC21 \
--dataset music21After preparation, your dataset should look like:
$DATA_ROOT/
├── MUSIC/
│ ├── frames/
│ │ ├── solo/
│ │ │ ├── accordion/
│ │ │ │ ├── xyvq9X6wgvs.mp4/
│ │ │ │ │ ├── 000001.jpg
│ │ │ │ │ ├── 000002.jpg
│ │ │ │ │ └── ...
│ │ │ │ └── ...
│ │ │ ├── acoustic_guitar/
│ │ │ └── ...
│ │ └── duet/
│ │ └── ...
│ ├── audio/
│ │ ├── solo/
│ │ │ ├── accordion/
│ │ │ │ ├── xyvq9X6wgvs.wav
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── duet/
│ │ └── ...
│ ├── detections_detic/
│ │ ├── solo/
│ │ │ ├── accordion/
│ │ │ │ ├── xyvq9X6wgvs.mp4.npy
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── duet/
│ │ └── ...
│ └── videos/ # Optional, can be deleted after processing
└── MUSIC21/
└── ... (similar structure)
Edit config.yaml to set your paths:
data:
data_dir: /path/to/your/data/MUSIC # Change this
csv_dir: data/ # CSV files location# Train on MUSIC dataset
python scripts/train.py \
--config config.yaml \
--dataset MUSIC \
--data_dir $DATA_ROOT/MUSIC \
--exp_name music_experiment
# Train on MUSIC-21 dataset
python scripts/train.py \
--config config.yaml \
--dataset MUSIC21 \
--data_dir $DATA_ROOT/MUSIC21 \
--exp_name music21_experiment # Resume from checkpoint
python scripts/train.py \
--config config.yaml \
--dataset MUSIC \
--data_dir $DATA_ROOT/MUSIC \
--exp_name music_experiment \
--resume # Evaluate MUSIC model
python scripts/evaluate.py \
--config config.yaml \
--dataset MUSIC \
--data_dir $DATA_ROOT/MUSIC \
--checkpoint experiments/music_experiment/checkpoints/model_best.pth \
--output_dir results/music_results \
--visualize
# Evaluate MUSIC-21 model
python scripts/evaluate.py \
--config config.yaml \
--dataset MUSIC21 \
--data_dir $DATA_ROOT/MUSIC21 \
--checkpoint experiments/music21_experiment/checkpoints/model_best.pth \
--output_dir results/music21_results \
--visualize # Single video inference
python scripts/inference.py \
--video_path /path/to/your/video.mp4 \
--config config.yaml \
--checkpoint experiments/music_experiment/checkpoints/model_best.pth \
--output_dir outputs/custom_video \
--dataset MUSIC | Method | SDR ↑ | SIR ↑ | SAR ↑ |
|---|---|---|---|
| Sound-of-Pixels | 5.63 | 6.85 | 9.80 |
| Co-Separation | 5.72 | 8.00 | 8.13 |
| iQuery | 8.04 | 11.63 | 11.92 |
| S2H (Ours) | 9.03 | 12.85 | 13.99 |
| Method | SDR ↑ | SIR ↑ | SAR ↑ |
|---|---|---|---|
| Sound-of-Pixels | 5.77 | 9.95 | 10.33 |
| Co-Separation | 6.17 | 8.73 | 10.18 |
| iQuery | 7.51 | 11.16 | 11.64 |
| S2H (Ours) | 9.20 | 12.54 | 14.79 |
If you find this work useful, please cite:
@inproceedings{kim2025see2hear,
title={Joint Object Detection and Sound Source Separation},
author={Kim, Sunyoo and Choi, Yunjeong and Lee, Doyeon and Lee, Seoyoung and
Lyou, Eunyi and Kim, Seungju and Noh, Junhyug and Lee, Joonseok},
booktitle={Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)},
year={2025}
}This work was supported by:
- Youlchon Foundation
- NRF grants (RS-2021-NR05515, RS-2024-00336576, RS-2023-0022663)
- IITP grants (RS-2022-II220264, RS-2024-00353131, RS-2022-00155966)
We thank the authors of DETR, AST, and Detic for their excellent work.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and feedback:
- Sunyoo Kim: meoignis@snu.ac.kr
- Junhyug Noh: junhyug@ewha.ac.kr
- Joonseok Lee: joonseok@snu.ac.kr
