[Dtensor] SFT with LoRA is slower than without LoRA

# Describe the bug

When enabling LoRA on the DTensor backend, end-to-end training throughput degrades compared to the same setup without LoRA. This appears even when LoRA rank is small and other settings are unchanged.

# Reproduce

## Disable Lora
```bash
NRL_FORCE_REBUILD_VENVS=true uv run examples/run_sft.py \
logger.wandb_enabled=True \
logger.wandb.project=lora \
logger.wandb.name=nano_v3_lora \
policy.model_name=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
checkpointing.enabled=False \
policy.dtensor_cfg.enabled=true \
policy.dtensor_cfg._v2=true \
policy.dtensor_cfg.lora_cfg.enabled=False \
policy.dtensor_cfg.lora_cfg.use_triton=False \
policy.dtensor_cfg.lora_cfg.dim=8 \
policy.max_total_sequence_length=2048 \
policy.train_global_batch_size=16 \
policy.train_micro_batch_size=1 \
policy.optimizer.name="torch.optim.Adam" \
~policy.tokenizer.chat_template  \
sft.max_num_steps=10 \
cluster.num_nodes=2 \
cluster.gpus_per_node=8
```

## Enable Lora
```bash
NRL_FORCE_REBUILD_VENVS=true uv run examples/run_sft.py \
logger.wandb_enabled=True \
logger.wandb.project=lora \
logger.wandb.name=nano_v3_lora \
policy.model_name=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
checkpointing.enabled=False \
policy.dtensor_cfg.enabled=true \
policy.dtensor_cfg._v2=true \
policy.dtensor_cfg.lora_cfg.enabled=True \
policy.dtensor_cfg.lora_cfg.use_triton=False \
policy.dtensor_cfg.lora_cfg.dim=8 \
policy.max_total_sequence_length=2048 \
policy.train_global_batch_size=16 \
policy.train_micro_batch_size=1 \
policy.optimizer.name="torch.optim.Adam" \
~policy.tokenizer.chat_template  \
sft.max_num_steps=10 \
cluster.num_nodes=2 \
cluster.gpus_per_node=8
```

# Observed Behavior

<img width="868" height="292" alt="Image" src="https://github.com/user-attachments/assets/5d4c776c-5c5b-4503-a1a3-d675829c766b" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dtensor] SFT with LoRA is slower than without LoRA #1688

Describe the bug

Reproduce

Disable Lora

Enable Lora

Observed Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Dtensor] SFT with LoRA is slower than without LoRA #1688

Description

Describe the bug

Reproduce

Disable Lora

Enable Lora

Observed Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions