Skip to content

tue-mps/vfm-uda-plusplus

Repository files navigation

VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

by Brunó B. Englert, Gijs Dubbelman

[arXiv] [Paper]

🔔 News:

Abstract

Unsupervised Domain Adaptation (UDA) has shown remarkably strong generalization from a labeled source domain to an unlabeled target domain while requiring relatively little data. At the same time, large-scale pretraining without labels of so-called Vision Foundation Models (VFMs), has also significantly improved downstream generalization. This motivates us to research how UDA can best utilize the benefits of VFMs. The earlier work of VFM-UDA showed that beyond state-of-the-art (SotA) results can be obtained by replacing non-VFM with VFM encoders in SotA UDA methods. In this work, we take it one step further and improve on the UDA architecture and data strategy themselves. We observe that VFM-UDA, the current SotA UDA method, does not use multi-scale inductive biases or feature distillation losses, while it is known that these can improve generalization. We address both limitations in VFM-UDA++ and obtain beyond SotA generalization on standard UDA benchmarks of up to +5.3 mIoU. Inspired by work on VFM fine-tuning, such as Rein, we also explore the benefits of adding more easy-to-generate synthetic source data with easy-to-obtain unlabeled target data and realize a +6.6 mIoU over the current SotA. The improvements of VFM-UDA++ are most significant for smaller models, however, we show that for larger models, the obtained generalization is only 2.8 mIoU from that of fully-supervised learning with all target labels. Based on these strong results, we provide essential insights to help researchers and practitioners advance UDA.

Installation

  1. Create a Weights & Biases (W&B) account.

  2. Environment setup.

    conda create -n vfmudapp python=3.10 && conda activate vfmudapp
  3. Install required packages.

    python3 -m pip install --index-url https://download.pytorch.org/whl/cu124 torch==2.4.1
    python3 -m pip install -r requirements.txt
    conda install nvidia/label/cuda-12.4.0::cuda
  4. Compile deformable attention.

    cd ops
    python3 setup.py build install

Data preparation

All the zipped data should be placed under one directory. No unzipping is required.

Usage

Training

We recommend using 4 GPUs with 2 batch size per GPU. On a H100, training a VFM-UDA++ Large will take around 30h.

To train the VFM-UDA++ large model from scratch, run:

python3 main.py fit -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data  --trainer.devices "[0, 1, 2, 3]"

(replace /data with the folder where you stored the datasets)

Note: there are small variations in performance between training runs, due to the stochasticity in the process, particularly for UDA techniques. Therefore, results may differ slightly depending on the random seed.

Evaluating

To evaluate a pre-trained VFM-UDA++ model, run:

python3 main.py validate -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data  --trainer.devices "[0]" --model.network.ckpt_path "/path/to/checkpoint.ckpt"

or use huggingface urls directly

python3 main.py validate -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data  --trainer.devices "[0]" --model.network.ckpt_path "https://huggingface.co/tue-mps/vfmuda_plusplus_large_gta2city/resolve/main/vfmuda_plusplus_large_gta2city_trimmed_epoch%3D0-step%3D40000.ckpt"

(replace /data with the folder where you stored the datasets)

Model Zoo

Main results

Config Dataset Scenario Pre-training Cityscapes (miou) WildDash2 (miou) Download
VFM-UDA++, Large GTA5 to City DINOv2 79.8 69.0 Model Weights
VFM-UDA++, Large All Synth to All Real DINOv2 82.2 71.3 Model Weights
VFM-UDA++, Base Synthia to Cityscapes DINOv2 69.7 56.1 Model Weights
VFM-UDA++, Large Cityscapes to Darkzurich DINOv2 68.7 70.3 Model Weights

Note: these models are re-trained, so the results differ slightly from those reported in the paper.

ImageNet1k pretrained ViT-Adapters

For the ViT-Adapter, we use ImageNet1k pretrained weights. During training the ViT is using DINOv2 pretraining, however the ViT itself is frozen. The classification head is discarded after the ImageNet1k pretraining.

These pretraining weights are automatically loaded during the VFM-UDA++ trainings, thus downloading these weights by hand is not necessary!

Config Dataset Scenario ViT ViT-Adapter Pre-training Download
VFM-UDA++, Small IN1k Frozen Training DINOv2 Model Weights
VFM-UDA++, Base IN1k Frozen Training DINOv2 Model Weights
VFM-UDA++, Large IN1k Frozen Training DINOv2 Model Weights

Citation

@inproceedings{englert2025vfmudaplusplus,
  author    = {{Englert, Brunó B.} and {Dubbelman, Gijs}},
  title     = {{VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation}},
  booktitle = {Proceedings of the IEEE/CVF Conference on International Conference on Computer Vision Workshops (ICCVW)},
  year      = {2025},
}

Acknowledgement

We use some code from:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published