VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

by Brunó B. Englert, Gijs Dubbelman

🔔 News:

[2025-08-25] We are happy to announce that our follow-up work What is the Added Value of UDA in the VFM Era? was accepted at CVPRW25.

Abstract

Unsupervised Domain Adaptation (UDA) has shown remarkably strong generalization from a labeled source domain to an unlabeled target domain while requiring relatively little data. At the same time, large-scale pretraining without labels of so-called Vision Foundation Models (VFMs), has also significantly improved downstream generalization. This motivates us to research how UDA can best utilize the benefits of VFMs. The earlier work of VFM-UDA showed that beyond state-of-the-art (SotA) results can be obtained by replacing non-VFM with VFM encoders in SotA UDA methods. In this work, we take it one step further and improve on the UDA architecture and data strategy themselves. We observe that VFM-UDA, the current SotA UDA method, does not use multi-scale inductive biases or feature distillation losses, while it is known that these can improve generalization. We address both limitations in VFM-UDA++ and obtain beyond SotA generalization on standard UDA benchmarks of up to +5.3 mIoU. Inspired by work on VFM fine-tuning, such as Rein, we also explore the benefits of adding more easy-to-generate synthetic source data with easy-to-obtain unlabeled target data and realize a +6.6 mIoU over the current SotA. The improvements of VFM-UDA++ are most significant for smaller models, however, we show that for larger models, the obtained generalization is only 2.8 mIoU from that of fully-supervised learning with all target labels. Based on these strong results, we provide essential insights to help researchers and practitioners advance UDA.

Installation

Create a Weights & Biases (W&B) account.
- The metrics during training are visualized with W&B: https://wandb.ai

Environment setup.

conda create -n vfmudapp python=3.10 && conda activate vfmudapp

Install required packages.

python3 -m pip install --index-url https://download.pytorch.org/whl/cu124 torch==2.4.1
python3 -m pip install -r requirements.txt
conda install nvidia/label/cuda-12.4.0::cuda

Compile deformable attention.
```
cd ops
python3 setup.py build install
```

Data preparation

GTA V: Download 1 | Download 2 | Download 3 | Download 4 | Download 5 | Download 6 | Download 7 | Download 8 | Download 9 | Download 10 | Download 11 | Download 12 | Download 13 | Download 14 | Download 15 | Download 16 | Download 17 | Download 18 | Download 19 | Download 20
Synthia: Download 1

Synscapes: Download 1

Note: this step requires 700GB of free storage space

tar -xf synscapes.tar
zip -r -0 synscapes.zip synscapes/
rm -rf synscapes.tar
rm -rf synscapes

Cityscapes: Download 1 | Download 2
Mapillary: Download 1
ACDC: Download 1 | Download 2
DarkZurich: Download 1 | Download 2
BDD100K: Download 1 | Download 2
WildDash: Download 1 (Download the "old WD2 beta", not the new "Public GT Package")
- For WilDdash, an extra step is needed to create the train/val split. After the "wd_public_02.zip" is downloaded, place the files from the "wilddash_trainval_split" in the same direcetory as the zip file. After that, run:
```
chmod +x create_wilddash_ds.sh
./create_wilddash_ds.sh
```
  This creates a new zip files, which should be used during training.

All the zipped data should be placed under one directory. No unzipping is required.

Usage

Training

We recommend using 4 GPUs with 2 batch size per GPU. On a H100, training a VFM-UDA++ Large will take around 30h.

To train the VFM-UDA++ large model from scratch, run:

python3 main.py fit -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data  --trainer.devices "[0, 1, 2, 3]"

(replace /data with the folder where you stored the datasets)

Note: there are small variations in performance between training runs, due to the stochasticity in the process, particularly for UDA techniques. Therefore, results may differ slightly depending on the random seed.

Evaluating

To evaluate a pre-trained VFM-UDA++ model, run:

python3 main.py validate -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data  --trainer.devices "[0]" --model.network.ckpt_path "/path/to/checkpoint.ckpt"

or use huggingface urls directly

python3 main.py validate -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data  --trainer.devices "[0]" --model.network.ckpt_path "https://huggingface.co/tue-mps/vfmuda_plusplus_large_gta2city/resolve/main/vfmuda_plusplus_large_gta2city_trimmed_epoch%3D0-step%3D40000.ckpt"

(replace /data with the folder where you stored the datasets)

Model Zoo

Main results

Config	Dataset Scenario	Pre-training	Cityscapes (miou)	WildDash2 (miou)	Download
VFM-UDA++, Large	GTA5 to City	DINOv2	79.8	69.0	Model Weights
VFM-UDA++, Large	All Synth to All Real	DINOv2	82.2	71.3	Model Weights
VFM-UDA++, Base	Synthia to Cityscapes	DINOv2	69.7	56.1	Model Weights
VFM-UDA++, Large	Cityscapes to Darkzurich	DINOv2	68.7	70.3	Model Weights

Note: these models are re-trained, so the results differ slightly from those reported in the paper.

ImageNet1k pretrained ViT-Adapters

For the ViT-Adapter, we use ImageNet1k pretrained weights. During training the ViT is using DINOv2 pretraining, however the ViT itself is frozen. The classification head is discarded after the ImageNet1k pretraining.

These pretraining weights are automatically loaded during the VFM-UDA++ trainings, thus downloading these weights by hand is not necessary!

Config	Dataset Scenario	ViT	ViT-Adapter	Pre-training	Download
VFM-UDA++, Small	IN1k	Frozen	Training	DINOv2	Model Weights
VFM-UDA++, Base	IN1k	Frozen	Training	DINOv2	Model Weights
VFM-UDA++, Large	IN1k	Frozen	Training	DINOv2	Model Weights

Citation

@inproceedings{englert2025vfmudaplusplus,
  author    = {{Englert, Brunó B.} and {Dubbelman, Gijs}},
  title     = {{VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation}},
  booktitle = {Proceedings of the IEEE/CVF Conference on International Conference on Computer Vision Workshops (ICCVW)},
  year      = {2025},
}

Acknowledgement

We use some code from:

DINOv2 (https://github.com/facebookresearch/dinov2): Apache-2.0 License
Masked Image Consistency for Context-Enhanced Domain Adaptation (https://github.com/lhoyer/MIC): Copyright (c) 2022 ETH Zurich, Lukas Hoyer, Apache-2.0 License
SegFormer (https://github.com/NVlabs/SegFormer): Copyright (c) 2021, NVIDIA Corporation, NVIDIA Source Code License
MMCV (https://github.com/open-mmlab/mmcv): Copyright (c) OpenMMLab, Apache-2.0 License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

Abstract

Installation

Data preparation

Usage

Training

Evaluating

Model Zoo

Main results

ImageNet1k pretrained ViT-Adapters

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cfgs		cfgs
datasets		datasets
models		models
ops		ops
tools		tools
training		training
wilddash_trainval_split		wilddash_trainval_split
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt

License

tue-mps/vfm-uda-plusplus

Folders and files

Latest commit

History

Repository files navigation

VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

Abstract

Installation

Data preparation

Usage

Training

Evaluating

Model Zoo

Main results

ImageNet1k pretrained ViT-Adapters

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages