S2H: See2Hear - Joint Object Detection and Sound Source Separation

Official PyTorch implementation of "Joint Object Detection and Sound Source Separation" (ISMIR 2025).

Overview

See2Hear (S2H) is a unified framework that jointly learns object detection and sound source separation from videos. Unlike previous methods that treat these tasks independently, S2H leverages the synergy between visual localization and audio separation through end-to-end training.

Key Features

🎯 Joint Learning: Unified training of object detection and sound separation
🔄 Dynamic Filtering: Selection of relevant object queries based on confidence threshold
🎵 State-of-the-art Performance: Best separation quality on MUSIC and MUSIC-21
🚀 End-to-end Training: No need for pre-extracted features or separate stages

Requirements

Python >= 3.11
PyTorch >= 2.5.1
CUDA >= 12.1
FFmpeg (for audio/video processing)

Installation

1. Clone Repository and Create Environment

# Clone repository
git clone https://github.com/snuviplab/S2H.git
cd S2H

# Create conda environment
conda create -n s2h python=3.11
conda activate s2h

# Install PyTorch (adjust CUDA version as needed)
# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

2. Install Dependencies

# Install required packages
pip install -r requirements.txt

# Install package in development mode
pip install -e .

# Install FFmpeg (required for audio/video processing)
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Check installation
ffmpeg -version

3. Install Detic for Pseudo-label Generation

# Install Detectron2 (required for Detic)
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

# Clone Detic repository
git clone https://github.com/facebookresearch/Detic.git --recurse-submodules third_party/Detic
cd third_party/Detic
pip install -r requirements.txt
cd ../..

# Create symbolic link for datasets (required by Detic)
ln -s third_party/Detic/datasets datasets

# Download Detic pretrained model
mkdir -p pretrained/detic
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth \
     -O pretrained/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

Dataset Preparation

Directory Structure

First, create the data directory structure:

export DATA_ROOT=/path/to/your/data  # Change this to your data path
mkdir -p $DATA_ROOT/MUSIC
mkdir -p $DATA_ROOT/MUSIC21

MUSIC Dataset

1. Download Video Lists

cd S2H  

# Download MUSIC dataset JSON files
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC_solo_videos.json
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC_duet_videos.json

# Convert JSON to CSV format
python tools/convert_music_json.py \
    --solo_json MUSIC_solo_videos.json \
    --duet_json MUSIC_duet_videos.json \
    --output_dir data/

# This creates:
# - data/train_music.csv
# - data/val_music.csv
# - data/test_music.csv

2. Download Videos and Extract Frames/Audio

# Prepare dataset 
python tools/prepare_music.py --data_root $DATA_ROOT/MUSIC

# This script will:
# 1. Download videos from YouTube
# 2. Extract frames at 1 FPS
# 3. Extract audio at 11025 Hz
# 4. Update CSV files with frame counts

3. Generate Pseudo-labels with Detic

# Generate object detection pseudo-labels
python tools/generate_pseudolabels.py \
    --music_dir $DATA_ROOT/MUSIC \
    --dataset music

# This creates detection files in:
# $DATA_ROOT/MUSIC/detections_detic/

MUSIC-21 Dataset

1. Download and Convert Video List

# Download MUSIC-21 dataset JSON
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC21_solo_videos.json

# Convert to CSV
python tools/convert_music21_json.py \
    --solo_json MUSIC21_solo_videos.json \
    --output_dir data/

# This creates:
# - data/train_music21.csv
# - data/val_music21.csv
# - data/test_music21.csv

2. Prepare MUSIC-21 Dataset

# Download videos and extract frames/audio
python tools/prepare_music21.py --data_root $DATA_ROOT/MUSIC21

# Generate pseudo-labels
python tools/generate_pseudolabels.py \
    --music_dir $DATA_ROOT/MUSIC21 \
    --dataset music21

Expected Dataset Structure

After preparation, your dataset should look like:

$DATA_ROOT/
├── MUSIC/
│   ├── frames/
│   │   ├── solo/
│   │   │   ├── accordion/
│   │   │   │   ├── xyvq9X6wgvs.mp4/
│   │   │   │   │   ├── 000001.jpg
│   │   │   │   │   ├── 000002.jpg
│   │   │   │   │   └── ...
│   │   │   │   └── ...
│   │   │   ├── acoustic_guitar/
│   │   │   └── ...
│   │   └── duet/
│   │       └── ...
│   ├── audio/
│   │   ├── solo/
│   │   │   ├── accordion/
│   │   │   │   ├── xyvq9X6wgvs.wav
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── duet/
│   │       └── ...
│   ├── detections_detic/
│   │   ├── solo/
│   │   │   ├── accordion/
│   │   │   │   ├── xyvq9X6wgvs.mp4.npy
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── duet/
│   │       └── ...
│   └── videos/  # Optional, can be deleted after processing
└── MUSIC21/
    └── ... (similar structure)

Configuration

Edit config.yaml to set your paths:

data:
  data_dir: /path/to/your/data/MUSIC  # Change this
  csv_dir: data/  # CSV files location

Training

Train from Scratch

# Train on MUSIC dataset
python scripts/train.py \
    --config config.yaml \
    --dataset MUSIC \
    --data_dir $DATA_ROOT/MUSIC \
    --exp_name music_experiment 

# Train on MUSIC-21 dataset
python scripts/train.py \
    --config config.yaml \
    --dataset MUSIC21 \
    --data_dir $DATA_ROOT/MUSIC21 \
    --exp_name music21_experiment

Resume Training

# Resume from checkpoint
python scripts/train.py \
    --config config.yaml \
    --dataset MUSIC \
    --data_dir $DATA_ROOT/MUSIC \
    --exp_name music_experiment \
    --resume

Evaluation

Evaluate on Test Set

# Evaluate MUSIC model
python scripts/evaluate.py \
    --config config.yaml \
    --dataset MUSIC \
    --data_dir $DATA_ROOT/MUSIC \
    --checkpoint experiments/music_experiment/checkpoints/model_best.pth \
    --output_dir results/music_results \
    --visualize 

# Evaluate MUSIC-21 model
python scripts/evaluate.py \
    --config config.yaml \
    --dataset MUSIC21 \
    --data_dir $DATA_ROOT/MUSIC21 \
    --checkpoint experiments/music21_experiment/checkpoints/model_best.pth \
    --output_dir results/music21_results \
    --visualize

Inference

Run on Custom Video

# Single video inference
python scripts/inference.py \
    --video_path /path/to/your/video.mp4 \
    --config config.yaml \
    --checkpoint experiments/music_experiment/checkpoints/model_best.pth \
    --output_dir outputs/custom_video \
    --dataset MUSIC

Results

Quantitative Results

MUSIC Dataset

Method	SDR ↑	SIR ↑	SAR ↑
Sound-of-Pixels	5.63	6.85	9.80
Co-Separation	5.72	8.00	8.13
iQuery	8.04	11.63	11.92
S2H (Ours)	9.03	12.85	13.99

MUSIC-21 Dataset

Method	SDR ↑	SIR ↑	SAR ↑
Sound-of-Pixels	5.77	9.95	10.33
Co-Separation	6.17	8.73	10.18
iQuery	7.51	11.16	11.64
S2H (Ours)	9.20	12.54	14.79

Citation

If you find this work useful, please cite:

@inproceedings{kim2025see2hear,
  title={Joint Object Detection and Sound Source Separation},
  author={Kim, Sunyoo and Choi, Yunjeong and Lee, Doyeon and Lee, Seoyoung and 
          Lyou, Eunyi and Kim, Seungju and Noh, Junhyug and Lee, Joonseok},
  booktitle={Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)},
  year={2025}
}

Acknowledgments

This work was supported by:

Youlchon Foundation
NRF grants (RS-2021-NR05515, RS-2024-00336576, RS-2023-0022663)
IITP grants (RS-2022-II220264, RS-2024-00353131, RS-2022-00155966)

We thank the authors of DETR, AST, and Detic for their excellent work.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions and feedback:

Sunyoo Kim: meoignis@snu.ac.kr
Junhyug Noh: junhyug@ewha.ac.kr
Joonseok Lee: joonseok@snu.ac.kr

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data		data
scripts		scripts
src		src
tools		tools
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt
setup.py		setup.py

License

snuviplab/S2H

Folders and files

Latest commit

History

Repository files navigation

S2H: See2Hear - Joint Object Detection and Sound Source Separation

Overview

Key Features

Requirements

Installation

1. Clone Repository and Create Environment

2. Install Dependencies

3. Install Detic for Pseudo-label Generation

Dataset Preparation

Directory Structure

MUSIC Dataset

1. Download Video Lists

2. Download Videos and Extract Frames/Audio

3. Generate Pseudo-labels with Detic

MUSIC-21 Dataset

1. Download and Convert Video List

2. Prepare MUSIC-21 Dataset

Expected Dataset Structure

Configuration

Training

Train from Scratch

Resume Training

Evaluation

Evaluate on Test Set

Inference

Run on Custom Video

Results

Quantitative Results

MUSIC Dataset

MUSIC-21 Dataset

Citation

Acknowledgments

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages