Skip to content

snuviplab/S2H

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

S2H: See2Hear - Joint Object Detection and Sound Source Separation

Paper

Official PyTorch implementation of "Joint Object Detection and Sound Source Separation" (ISMIR 2025).

Overview

See2Hear (S2H) is a unified framework that jointly learns object detection and sound source separation from videos. Unlike previous methods that treat these tasks independently, S2H leverages the synergy between visual localization and audio separation through end-to-end training.

Key Features

  • 🎯 Joint Learning: Unified training of object detection and sound separation
  • 🔄 Dynamic Filtering: Selection of relevant object queries based on confidence threshold
  • 🎵 State-of-the-art Performance: Best separation quality on MUSIC and MUSIC-21
  • 🚀 End-to-end Training: No need for pre-extracted features or separate stages

Requirements

  • Python >= 3.11
  • PyTorch >= 2.5.1
  • CUDA >= 12.1
  • FFmpeg (for audio/video processing)

Installation

1. Clone Repository and Create Environment

# Clone repository
git clone https://github.com/snuviplab/S2H.git
cd S2H

# Create conda environment
conda create -n s2h python=3.11
conda activate s2h

# Install PyTorch (adjust CUDA version as needed)
# For CUDA 12.1
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

2. Install Dependencies

# Install required packages
pip install -r requirements.txt

# Install package in development mode
pip install -e .

# Install FFmpeg (required for audio/video processing)
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Check installation
ffmpeg -version

3. Install Detic for Pseudo-label Generation

# Install Detectron2 (required for Detic)
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

# Clone Detic repository
git clone https://github.com/facebookresearch/Detic.git --recurse-submodules third_party/Detic
cd third_party/Detic
pip install -r requirements.txt
cd ../..

# Create symbolic link for datasets (required by Detic)
ln -s third_party/Detic/datasets datasets

# Download Detic pretrained model
mkdir -p pretrained/detic
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth \
     -O pretrained/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

Dataset Preparation

Directory Structure

First, create the data directory structure:

export DATA_ROOT=/path/to/your/data  # Change this to your data path
mkdir -p $DATA_ROOT/MUSIC
mkdir -p $DATA_ROOT/MUSIC21

MUSIC Dataset

1. Download Video Lists

cd S2H  

# Download MUSIC dataset JSON files
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC_solo_videos.json
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC_duet_videos.json

# Convert JSON to CSV format
python tools/convert_music_json.py \
    --solo_json MUSIC_solo_videos.json \
    --duet_json MUSIC_duet_videos.json \
    --output_dir data/

# This creates:
# - data/train_music.csv
# - data/val_music.csv
# - data/test_music.csv

2. Download Videos and Extract Frames/Audio

# Prepare dataset 
python tools/prepare_music.py --data_root $DATA_ROOT/MUSIC

# This script will:
# 1. Download videos from YouTube
# 2. Extract frames at 1 FPS
# 3. Extract audio at 11025 Hz
# 4. Update CSV files with frame counts

3. Generate Pseudo-labels with Detic

# Generate object detection pseudo-labels
python tools/generate_pseudolabels.py \
    --music_dir $DATA_ROOT/MUSIC \
    --dataset music

# This creates detection files in:
# $DATA_ROOT/MUSIC/detections_detic/

MUSIC-21 Dataset

1. Download and Convert Video List

# Download MUSIC-21 dataset JSON
wget https://github.com/roudimit/MUSIC_dataset/raw/master/MUSIC21_solo_videos.json

# Convert to CSV
python tools/convert_music21_json.py \
    --solo_json MUSIC21_solo_videos.json \
    --output_dir data/

# This creates:
# - data/train_music21.csv
# - data/val_music21.csv
# - data/test_music21.csv

2. Prepare MUSIC-21 Dataset

# Download videos and extract frames/audio
python tools/prepare_music21.py --data_root $DATA_ROOT/MUSIC21

# Generate pseudo-labels
python tools/generate_pseudolabels.py \
    --music_dir $DATA_ROOT/MUSIC21 \
    --dataset music21

Expected Dataset Structure

After preparation, your dataset should look like:

$DATA_ROOT/
├── MUSIC/
│   ├── frames/
│   │   ├── solo/
│   │   │   ├── accordion/
│   │   │   │   ├── xyvq9X6wgvs.mp4/
│   │   │   │   │   ├── 000001.jpg
│   │   │   │   │   ├── 000002.jpg
│   │   │   │   │   └── ...
│   │   │   │   └── ...
│   │   │   ├── acoustic_guitar/
│   │   │   └── ...
│   │   └── duet/
│   │       └── ...
│   ├── audio/
│   │   ├── solo/
│   │   │   ├── accordion/
│   │   │   │   ├── xyvq9X6wgvs.wav
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── duet/
│   │       └── ...
│   ├── detections_detic/
│   │   ├── solo/
│   │   │   ├── accordion/
│   │   │   │   ├── xyvq9X6wgvs.mp4.npy
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── duet/
│   │       └── ...
│   └── videos/  # Optional, can be deleted after processing
└── MUSIC21/
    └── ... (similar structure)

Configuration

Edit config.yaml to set your paths:

data:
  data_dir: /path/to/your/data/MUSIC  # Change this
  csv_dir: data/  # CSV files location

Training

Train from Scratch

# Train on MUSIC dataset
python scripts/train.py \
    --config config.yaml \
    --dataset MUSIC \
    --data_dir $DATA_ROOT/MUSIC \
    --exp_name music_experiment 

# Train on MUSIC-21 dataset
python scripts/train.py \
    --config config.yaml \
    --dataset MUSIC21 \
    --data_dir $DATA_ROOT/MUSIC21 \
    --exp_name music21_experiment 

Resume Training

# Resume from checkpoint
python scripts/train.py \
    --config config.yaml \
    --dataset MUSIC \
    --data_dir $DATA_ROOT/MUSIC \
    --exp_name music_experiment \
    --resume 

Evaluation

Evaluate on Test Set

# Evaluate MUSIC model
python scripts/evaluate.py \
    --config config.yaml \
    --dataset MUSIC \
    --data_dir $DATA_ROOT/MUSIC \
    --checkpoint experiments/music_experiment/checkpoints/model_best.pth \
    --output_dir results/music_results \
    --visualize 

# Evaluate MUSIC-21 model
python scripts/evaluate.py \
    --config config.yaml \
    --dataset MUSIC21 \
    --data_dir $DATA_ROOT/MUSIC21 \
    --checkpoint experiments/music21_experiment/checkpoints/model_best.pth \
    --output_dir results/music21_results \
    --visualize 

Inference

Run on Custom Video

# Single video inference
python scripts/inference.py \
    --video_path /path/to/your/video.mp4 \
    --config config.yaml \
    --checkpoint experiments/music_experiment/checkpoints/model_best.pth \
    --output_dir outputs/custom_video \
    --dataset MUSIC 

Results

Quantitative Results

MUSIC Dataset

Method SDR ↑ SIR ↑ SAR ↑
Sound-of-Pixels 5.63 6.85 9.80
Co-Separation 5.72 8.00 8.13
iQuery 8.04 11.63 11.92
S2H (Ours) 9.03 12.85 13.99

MUSIC-21 Dataset

Method SDR ↑ SIR ↑ SAR ↑
Sound-of-Pixels 5.77 9.95 10.33
Co-Separation 6.17 8.73 10.18
iQuery 7.51 11.16 11.64
S2H (Ours) 9.20 12.54 14.79

Citation

If you find this work useful, please cite:

@inproceedings{kim2025see2hear,
  title={Joint Object Detection and Sound Source Separation},
  author={Kim, Sunyoo and Choi, Yunjeong and Lee, Doyeon and Lee, Seoyoung and 
          Lyou, Eunyi and Kim, Seungju and Noh, Junhyug and Lee, Joonseok},
  booktitle={Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)},
  year={2025}
}

Acknowledgments

This work was supported by:

  • Youlchon Foundation
  • NRF grants (RS-2021-NR05515, RS-2024-00336576, RS-2023-0022663)
  • IITP grants (RS-2022-II220264, RS-2024-00353131, RS-2022-00155966)

We thank the authors of DETR, AST, and Detic for their excellent work.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions and feedback:

About

[ISMIR 2025] Joint Object Detection and Sound Source Separation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages