⚡ NVIDIA Warp-accelerated Lie group operations for PyPose
Features • Installation • Quick Start • Operations • Benchmarks • Development
warp-pypose provides a high-performance NVIDIA Warp-based backend for PyPose LieTensor operations. It offers significant speedups for Lie group computations on both CPU and CUDA, with full support for automatic differentiation.
- 🚀 Drop-in acceleration — Seamlessly swap PyPose backends with a single function call
- ⚡ Warp-powered kernels — Optimized parallel implementations for CPU and CUDA
- 🔄 Full autodiff support — Analytical gradients for all operations with PyTorch integration
- 📐 Comprehensive Lie group coverage — SE(3), SO(3), se(3), so(3) algebras and groups
- 🎯 FP16/FP32/FP64 precision — Multi-precision support with numerically stable implementations
- 📊 Arbitrary batch dimensions — Full broadcasting support up to 4D batches
- Python 3.10+
- PyTorch 2.0+
- CUDA 12.0+ (for GPU acceleration)
git clone https://github.com/MAC-VO/warp-pypose.git
cd warp-pypose
pip install -e .pip install torch pypose warp-langimport torch
import pypose as pp
import pypose_warp
# Create a standard PyPose SE3 LieTensor
poses = pp.randn_SE3(1000, device="cuda", dtype=torch.float32)
points = torch.randn(1000, 3, device="cuda", dtype=torch.float32)
# Convert to Warp backend for accelerated computation
poses_warp = pypose_warp.to_warp_backend(poses)
# Use exactly like PyPose — all operations are accelerated
transformed = poses_warp.Act(points) # Apply SE3 to points
matrices = poses_warp.matrix() # Convert to 4x4 matrices
logs = poses_warp.Log() # Logarithm map to se3
composed = poses_warp @ poses_warp.Inv() # Compose transformationsimport torch
import pypose as pp
from pypose_warp import to_warp_backend
# Enable gradients
poses = pp.randn_SE3(100, device="cuda", requires_grad=True)
poses_warp = to_warp_backend(poses)
points = torch.randn(100, 3, device="cuda", requires_grad=True)
# Forward pass with Warp backend
result = poses_warp.Act(points)
# Backward pass — analytical gradients computed via Warp kernels
loss = result.sum()
loss.backward()
# Gradients available on original tensors
print(poses.grad.shape) # (100, 7)
print(points.grad.shape) # (100, 3)from pypose_warp import to_warp_backend, to_pypose_backend, is_warp_backend
# Check and convert backends
poses = pp.randn_SE3(100)
if not is_warp_backend(poses):
poses = to_warp_backend(poses) # Convert to Warp for speed
# Convert back to PyPose if needed (e.g., for unsupported operations)
poses = to_pypose_backend(poses)| Operation | Description | Method |
|---|---|---|
| Act | Apply transform to 3D points | X.Act(p) |
| Act4 | Apply transform to homogeneous points | X.Act(p) (4D) |
| Mul | Compose two SE3 transforms | X @ Y |
| Inv | Invert transformation | X.Inv() |
| Log | Logarithm map to se(3) | X.Log() |
| Adj | Adjoint action on se(3) | X.Adj(a) |
| AdjT | Transpose adjoint action | X.AdjT(a) |
| Jinvp | Inverse left Jacobian action | X.Jinvp(p) |
| matrix | Convert to 4×4 matrix | X.matrix() |
| add_ | In-place update via Exp | X.add_(delta) |
| Operation | Description | Method |
|---|---|---|
| Act | Rotate 3D points | R.Act(p) |
| Act4 | Rotate homogeneous points | R.Act(p) (4D) |
| Mul | Compose rotations | R @ S |
| Log | Logarithm map to so(3) | R.Log() |
| Adj | Adjoint action on so(3) | R.Adj(a) |
| AdjT | Transpose adjoint action | R.AdjT(a) |
| Jinvp | Inverse left Jacobian action | R.Jinvp(p) |
| matrix | Convert to 3×3 matrix | R.matrix() |
| add_ | In-place update via Exp | R.add_(delta) |
| Operation | Description | Method |
|---|---|---|
| Exp | Exponential map to SE(3) | xi.Exp() |
| Mat | Twist to 4×4 matrix | xi.matrix() |
| Operation | Description | Method |
|---|---|---|
| Exp | Exponential map to SO(3) | w.Exp() |
| Mat | Angular velocity to 3×3 matrix | w.matrix() |
| Jr | Right Jacobian | w.Jr() |
Run the benchmark suite to compare Warp vs PyPose performance:
# Run all benchmarks (generates PNG charts)
python -m bench
# Run specific operator benchmarks
python -m bench.SE3_group
python -m bench.SO3_group
python -m bench.SE3_algebra
python -m bench.SO3_algebra
# Run individual operator with custom settings
python -m bench.SE3_group.Act --device cuda --dtype fp32 --size 10000Benchmarks test across:
- Devices: CPU, CUDA
- Data types: FP16, FP32, FP64
- Batch sizes: 128 to 32,768
- Modes: Forward and backward passes
Results are saved as PNG charts in the respective benchmark directories.
The recommended development environment uses Docker with NVIDIA GPU support:
# Auto-detect CUDA version and start container
./launch.sh
# Force specific CUDA version
FORCE_CUDA=12 ./launch.sh
# Mount additional paths
./launch.sh /path/to/dataset /path/to/modelsSupported configurations:
- Linux x86_64: CUDA 12.x, CUDA 13.x
- Jetson Orin: CUDA 12.x (aarch64)
- Jetson Thor: CUDA 13.x (aarch64)
# Run full test suite
pytest tests/ -v
# Run specific test file
pytest tests/test_SE3_group_Act.py -v
# Run with specific device/dtype
pytest tests/ -v -k "cuda and fp32"
# Run with coverage
pytest tests/ --cov=pypose_warp --cov-report=htmlwarp-pypose/
├── pypose_warp/
│ ├── __init__.py # Backend conversion utilities
│ ├── ltype/
│ │ ├── SE3_group/ # SE(3) Lie group operations
│ │ ├── SO3_group/ # SO(3) Lie group operations
│ │ ├── SE3_algebra/ # se(3) Lie algebra operations
│ │ ├── SO3_algebra/ # so(3) Lie algebra operations
│ │ └── common/ # Shared kernel utilities
│ └── utils/
├── bench/ # Benchmark suite
│ ├── SE3_group/
│ ├── SO3_group/
│ ├── SE3_algebra/
│ └── SO3_algebra/
├── tests/ # Comprehensive test suite
├── docker/ # Docker development environment
└── launch.sh # Container launch script
Each operator follows a consistent pattern:
- Forward kernel (
fwd.py): Warp kernel implementing the operation - Backward kernel (
bwd.py): Warp kernel for analytical gradients - Autograd wrapper (
__init__.py): PyTorch Function connecting both
Example structure for SE3_Act:
# fwd.py - Forward pass
@wp.kernel
def se3_act_kernel(...):
# Warp kernel implementation
def SE3_Act_fwd(X, p):
# Prepare tensors, launch kernel, return result
# bwd.py - Backward pass
@wp.kernel
def se3_act_bwd_kernel(...):
# Gradient computation kernel
def SE3_Act_bwd(X, out, grad_output):
# Compute gradients
# __init__.py - PyTorch integration
class SE3_Act(torch.autograd.Function):
@staticmethod
def forward(ctx, X, p):
return SE3_Act_fwd(X, p)
@staticmethod
def backward(ctx, grad_output):
return SE3_Act_bwd(...)This project is licensed under the MIT License — see the LICENSE file for details.
- PyPose — Differentiable Lie groups for robotics
- NVIDIA Warp — High-performance simulation and graphics programming