A unified Python package for foundation model-based 6DoF object pose estimation from video using language prompts.
Estimate 6DoF object poses by simply describing the object in natural language:
| Mode | Description |
|---|---|
| Without Mesh | Automatically generates 3D mesh from the first frame using SAM3D, then tracks pose |
| With Mesh | Uses provided mesh file directly for pose estimation and tracking |
Just provide a text prompt like "red cup" or "cardboard box" - no CAD models or manual annotation required!
- SAM3: Text-prompted image segmentation (Segment Anything Model 3)
- SAM3D: Single-image 3D mesh generation from segmentation mask
- FoundationStereo: High-quality depth estimation from stereo infrared pairs
- FoundationPose: 6DoF pose estimation and tracking from RGB-D
- CUDA 12.x compatible GPU (tested on RTX 4090)
- Conda environment with Python 3.9+
git clone --recursive -b foundationperception https://github.com/MMintLab/FoundationPerception.git
cd FoundationPerception
# Or if already cloned:
git checkout foundationperception
git submodule update --init --recursiveconda create -n foundationperception python=3.9
conda activate foundationperception
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install requirements
pip install -r requirements.txt
# Install core package
pip install -e .cd FoundationStereo
pip install -e .
# Download pretrained model to pretrained_models/
cd ..cd FoundationPose
pip install -r requirements.txt
bash build_all_conda.sh
cd ..cd sam3
pip install -e .
cd ..cd sam-3d-objects
pip install -e .
cd ..Extract 6DoF object poses from images using a text prompt:
# With FoundationStereo depth estimation
python scripts/video_to_objectpose.py \
--image_dir path/to/rgb_images \
--prompt "red cup" \
--foundationstereo \
--infra1_dir path/to/infra1 \
--infra2_dir path/to/infra2 \
--baseline 0.05 \
--output output_dir
# With pre-computed depth
python scripts/video_to_objectpose.py \
--image_dir path/to/rgb_images \
--prompt "cardboard box" \
--depth_dir path/to/depth \
--output output_dirpython scripts/video_to_objectpose.py \
--image_dir path/to/rgb_images \
--prompt "object" \
--depth_dir path/to/depth \
--mesh path/to/mesh.obj \
--output output_dirThe script generates:
poses.npy- Array of 4x4 pose matrices for each frameposes_overlay.gif- Visualization with mesh overlayfirst_mask.png- Initial segmentation maskdepth/- Computed depth maps (if using FoundationStereo)mesh/- Generated mesh (if not provided)
from foundationperception import StereoDepthProcessor
# Initialize stereo depth processor
processor = StereoDepthProcessor(
color_intrinsic=K_color,
depth_intrinsic=K_depth,
extrinsics_vec=extrinsics,
baseline=0.05
)
# Process stereo images
depth, pointcloud = processor.process_images(left_ir, right_ir, color_image)foundationperception/
├── foundationperception/ # Core Python package
│ ├── __init__.py
│ ├── stereo/ # Stereo depth estimation
│ │ └── processor.py
│ └── utils.py
├── scripts/
│ └── video_to_objectpose.py # Main CLI script
├── assets/ # Example camera configs
├── FoundationStereo/ # Submodule
├── FoundationPose/ # Submodule
├── sam3/ # Submodule
├── sam-3d-objects/ # Submodule
├── requirements.txt
├── setup.py
└── README.md
For ROS1 integration with real-time depth estimation, see the ros1 branch:
git checkout ros1MIT License. Submodules have their own licenses (NVIDIA, Meta).