FoundationPerception

A unified Python package for foundation model-based 6DoF object pose estimation from video using language prompts.

Key Features

🎯 Object Pose Estimation with Language Prompts

Estimate 6DoF object poses by simply describing the object in natural language:

Mode	Description
Without Mesh	Automatically generates 3D mesh from the first frame using SAM3D, then tracks pose
With Mesh	Uses provided mesh file directly for pose estimation and tracking

Just provide a text prompt like "red cup" or "cardboard box" - no CAD models or manual annotation required!

🔧 Modular Components

SAM3: Text-prompted image segmentation (Segment Anything Model 3)
SAM3D: Single-image 3D mesh generation from segmentation mask
FoundationStereo: High-quality depth estimation from stereo infrared pairs
FoundationPose: 6DoF pose estimation and tracking from RGB-D

Installation

Prerequisites

CUDA 12.x compatible GPU (tested on RTX 4090)
Conda environment with Python 3.9+

Clone with Submodules

git clone --recursive -b foundationperception https://github.com/MMintLab/FoundationPerception.git
cd FoundationPerception

# Or if already cloned:
git checkout foundationperception
git submodule update --init --recursive

Create Conda Environment

conda create -n foundationperception python=3.9
conda activate foundationperception

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install requirements
pip install -r requirements.txt

# Install core package
pip install -e .

Install Submodules

FoundationStereo

cd FoundationStereo
pip install -e .
# Download pretrained model to pretrained_models/
cd ..

FoundationPose

cd FoundationPose
pip install -r requirements.txt
bash build_all_conda.sh
cd ..

SAM3

cd sam3
pip install -e .
cd ..

SAM3D (sam-3d-objects)

cd sam-3d-objects
pip install -e .
cd ..

Quick Start

Video to Object Pose (CLI)

Extract 6DoF object poses from images using a text prompt:

Without Mesh (Auto-generates mesh from first frame)

# With FoundationStereo depth estimation
python scripts/video_to_objectpose.py \
    --image_dir path/to/rgb_images \
    --prompt "red cup" \
    --foundationstereo \
    --infra1_dir path/to/infra1 \
    --infra2_dir path/to/infra2 \
    --baseline 0.05 \
    --output output_dir

# With pre-computed depth
python scripts/video_to_objectpose.py \
    --image_dir path/to/rgb_images \
    --prompt "cardboard box" \
    --depth_dir path/to/depth \
    --output output_dir

With Mesh (Use existing 3D model)

python scripts/video_to_objectpose.py \
    --image_dir path/to/rgb_images \
    --prompt "object" \
    --depth_dir path/to/depth \
    --mesh path/to/mesh.obj \
    --output output_dir

Output

The script generates:

poses.npy - Array of 4x4 pose matrices for each frame
poses_overlay.gif - Visualization with mesh overlay
first_mask.png - Initial segmentation mask
depth/ - Computed depth maps (if using FoundationStereo)
mesh/ - Generated mesh (if not provided)

Python API

from foundationperception import StereoDepthProcessor

# Initialize stereo depth processor
processor = StereoDepthProcessor(
    color_intrinsic=K_color,
    depth_intrinsic=K_depth,
    extrinsics_vec=extrinsics,
    baseline=0.05
)

# Process stereo images
depth, pointcloud = processor.process_images(left_ir, right_ir, color_image)

Project Structure

foundationperception/
├── foundationperception/       # Core Python package
│   ├── __init__.py
│   ├── stereo/                 # Stereo depth estimation
│   │   └── processor.py
│   └── utils.py
├── scripts/
│   └── video_to_objectpose.py  # Main CLI script
├── assets/                     # Example camera configs
├── FoundationStereo/           # Submodule
├── FoundationPose/             # Submodule
├── sam3/                       # Submodule
├── sam-3d-objects/             # Submodule
├── requirements.txt
├── setup.py
└── README.md

ROS1 Support

For ROS1 integration with real-time depth estimation, see the ros1 branch:

git checkout ros1

License

MIT License. Submodules have their own licenses (NVIDIA, Meta).

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
FoundationPose @ e3d597b		FoundationPose @ e3d597b
FoundationStereo @ 58230c5		FoundationStereo @ 58230c5
assets		assets
foundationperception		foundationperception
sam-3d-objects @ e19b169		sam-3d-objects @ e19b169
sam3 @ b26a5f3		sam3 @ b26a5f3
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md
package.xml		package.xml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FoundationPerception

Key Features

🎯 Object Pose Estimation with Language Prompts

🔧 Modular Components

Installation

Prerequisites

Clone with Submodules

Create Conda Environment

Install Submodules

FoundationStereo

FoundationPose

SAM3

SAM3D (sam-3d-objects)

Quick Start

Video to Object Pose (CLI)

Without Mesh (Auto-generates mesh from first frame)

With Mesh (Use existing 3D model)

Output

Python API

Project Structure

ROS1 Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

MMintLab/FoundationPerception

Folders and files

Latest commit

History

Repository files navigation

FoundationPerception

Key Features

🎯 Object Pose Estimation with Language Prompts

🔧 Modular Components

Installation

Prerequisites

Clone with Submodules

Create Conda Environment

Install Submodules

FoundationStereo

FoundationPose

SAM3

SAM3D (sam-3d-objects)

Quick Start

Video to Object Pose (CLI)

Without Mesh (Auto-generates mesh from first frame)

With Mesh (Use existing 3D model)

Output

Python API

Project Structure

ROS1 Support

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages