VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation
by Brunó B. Englert, Gijs Dubbelman
🔔 News:
- [2025-08-25] We are happy to announce that our follow-up work What is the Added Value of UDA in the VFM Era? was accepted at CVPRW25.
Unsupervised Domain Adaptation (UDA) has shown remarkably strong generalization from a labeled source domain to an unlabeled target domain while requiring relatively little data. At the same time, large-scale pretraining without labels of so-called Vision Foundation Models (VFMs), has also significantly improved downstream generalization. This motivates us to research how UDA can best utilize the benefits of VFMs. The earlier work of VFM-UDA showed that beyond state-of-the-art (SotA) results can be obtained by replacing non-VFM with VFM encoders in SotA UDA methods. In this work, we take it one step further and improve on the UDA architecture and data strategy themselves. We observe that VFM-UDA, the current SotA UDA method, does not use multi-scale inductive biases or feature distillation losses, while it is known that these can improve generalization. We address both limitations in VFM-UDA++ and obtain beyond SotA generalization on standard UDA benchmarks of up to +5.3 mIoU. Inspired by work on VFM fine-tuning, such as Rein, we also explore the benefits of adding more easy-to-generate synthetic source data with easy-to-obtain unlabeled target data and realize a +6.6 mIoU over the current SotA. The improvements of VFM-UDA++ are most significant for smaller models, however, we show that for larger models, the obtained generalization is only 2.8 mIoU from that of fully-supervised learning with all target labels. Based on these strong results, we provide essential insights to help researchers and practitioners advance UDA.
-
Create a Weights & Biases (W&B) account.
- The metrics during training are visualized with W&B: https://wandb.ai
-
Environment setup.
conda create -n vfmudapp python=3.10 && conda activate vfmudapp -
Install required packages.
python3 -m pip install --index-url https://download.pytorch.org/whl/cu124 torch==2.4.1 python3 -m pip install -r requirements.txt conda install nvidia/label/cuda-12.4.0::cuda
-
Compile deformable attention.
cd ops python3 setup.py build install
- GTA V: Download 1 | Download 2 | Download 3 | Download 4 | Download 5 | Download 6 | Download 7 | Download 8 | Download 9 | Download 10 | Download 11 | Download 12 | Download 13 | Download 14 | Download 15 | Download 16 | Download 17 | Download 18 | Download 19 | Download 20
- Synthia: Download 1
- Synscapes: Download 1
- Note: this step requires 700GB of free storage space
-
tar -xf synscapes.tar zip -r -0 synscapes.zip synscapes/ rm -rf synscapes.tar rm -rf synscapes
- Cityscapes: Download 1 | Download 2
- Mapillary: Download 1
- ACDC: Download 1 | Download 2
- DarkZurich: Download 1 | Download 2
- BDD100K: Download 1 | Download 2
- WildDash: Download 1 (Download the "old WD2 beta", not the new "Public GT Package")
- For WilDdash, an extra step is needed to create the train/val split. After the "wd_public_02.zip" is downloaded, place the files from the "wilddash_trainval_split" in the same direcetory as the zip file. After that, run:
This creates a new zip files, which should be used during training.
chmod +x create_wilddash_ds.sh ./create_wilddash_ds.sh
- For WilDdash, an extra step is needed to create the train/val split. After the "wd_public_02.zip" is downloaded, place the files from the "wilddash_trainval_split" in the same direcetory as the zip file. After that, run:
All the zipped data should be placed under one directory. No unzipping is required.
We recommend using 4 GPUs with 2 batch size per GPU. On a H100, training a VFM-UDA++ Large will take around 30h.
To train the VFM-UDA++ large model from scratch, run:
python3 main.py fit -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data --trainer.devices "[0, 1, 2, 3]"(replace /data with the folder where you stored the datasets)
Note: there are small variations in performance between training runs, due to the stochasticity in the process, particularly for UDA techniques. Therefore, results may differ slightly depending on the random seed.
To evaluate a pre-trained VFM-UDA++ model, run:
python3 main.py validate -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data --trainer.devices "[0]" --model.network.ckpt_path "/path/to/checkpoint.ckpt"or use huggingface urls directly
python3 main.py validate -c cfgs/vfmudaplusplus_large_gta2city.yaml --root /data --trainer.devices "[0]" --model.network.ckpt_path "https://huggingface.co/tue-mps/vfmuda_plusplus_large_gta2city/resolve/main/vfmuda_plusplus_large_gta2city_trimmed_epoch%3D0-step%3D40000.ckpt"(replace /data with the folder where you stored the datasets)
| Config | Dataset Scenario | Pre-training | Cityscapes (miou) | WildDash2 (miou) | Download |
|---|---|---|---|---|---|
| VFM-UDA++, Large | GTA5 to City | DINOv2 | 79.8 | 69.0 | Model Weights |
| VFM-UDA++, Large | All Synth to All Real | DINOv2 | 82.2 | 71.3 | Model Weights |
| VFM-UDA++, Base | Synthia to Cityscapes | DINOv2 | 69.7 | 56.1 | Model Weights |
| VFM-UDA++, Large | Cityscapes to Darkzurich | DINOv2 | 68.7 | 70.3 | Model Weights |
Note: these models are re-trained, so the results differ slightly from those reported in the paper.
For the ViT-Adapter, we use ImageNet1k pretrained weights. During training the ViT is using DINOv2 pretraining, however the ViT itself is frozen. The classification head is discarded after the ImageNet1k pretraining.
These pretraining weights are automatically loaded during the VFM-UDA++ trainings, thus downloading these weights by hand is not necessary!
| Config | Dataset Scenario | ViT | ViT-Adapter | Pre-training | Download |
|---|---|---|---|---|---|
| VFM-UDA++, Small | IN1k | Frozen | Training | DINOv2 | Model Weights |
| VFM-UDA++, Base | IN1k | Frozen | Training | DINOv2 | Model Weights |
| VFM-UDA++, Large | IN1k | Frozen | Training | DINOv2 | Model Weights |
@inproceedings{englert2025vfmudaplusplus,
author = {{Englert, Brunó B.} and {Dubbelman, Gijs}},
title = {{VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation}},
booktitle = {Proceedings of the IEEE/CVF Conference on International Conference on Computer Vision Workshops (ICCVW)},
year = {2025},
}
We use some code from:
- DINOv2 (https://github.com/facebookresearch/dinov2): Apache-2.0 License
- Masked Image Consistency for Context-Enhanced Domain Adaptation (https://github.com/lhoyer/MIC): Copyright (c) 2022 ETH Zurich, Lukas Hoyer, Apache-2.0 License
- SegFormer (https://github.com/NVlabs/SegFormer): Copyright (c) 2021, NVIDIA Corporation, NVIDIA Source Code License
- DACS (https://github.com/vikolss/DACS): Copyright (c) 2020, vikolss, MIT License
- MMCV (https://github.com/open-mmlab/mmcv): Copyright (c) OpenMMLab, Apache-2.0 License