This project shows how to track objects in videos using the powerful DINOv3 model. Let's dive in! 🏊♂️
DINOv3 is a self-supervised vision transformer (ViT) model created by Meta.
It can:
- Understand images without needing labeled data
- Produce super-robust feature embeddings for image patches
- Can be used for image segmentation, object tracking, zero-shot classification, and more
- Works even if the object rotates, scales, or changes appearance
🤓 In short: DinoV3 just knows. Everything. Period.
This project is a fun demo of object tracking on videos using DINOv3.
How it works:
- Take the first frame of your video.
- Click on the object you want to track using your mouse.
- Pass the frame through DINOv3, which splits the image into patches. Each patch gets its own feature vector. We are interested in the feature vector of the user' selected patch.
- Compute the cosine similarity between the feature vector of the selected patch and all other patches of other frames.
- Use these similarities to create a similarity heatmap. More 🟠 "orange" - more similar!
python -m venv dino-venv
source dino-venv/Scripts/activate pip install onnxruntime
pip install opencv-python
pip install tqdm matplotlib- Go to HuggingFace ONNX community and download a DINOv3 model.
- Place it in the model/ folder.
- We used fp16 ViT-S, but you can try any other variant
- Open config.py and set path_to_input_video to your video file
- The video will be cropped to a square and resized to 224×224 for convenience
python run.py- The first frame will appear
- Click on the object you want to track
- The script will process all frames and save a tracked video 🎥
- Code in this repository is licensed under the MIT License.
- The DINOv3 model weights are licensed under the DINOv3 License by Meta.
- Weights were downloaded from HuggingFace ONNX community.
- By using the model weights, you agree to the terms of the DINOv3 License.
- Images/Videos used in this project are sourced from Pixabay and Unsplash under their respective licenses.