Skip to content

feat(training): add multi-GPU and mixed-precision support to OSMO LeRobot training workflow #785

@rezatnoMsirhC

Description

@rezatnoMsirhC

Summary

The training/il/workflows/osmo/lerobot-train.yaml OSMO workflow hardcodes gpu: 1 in its resources block and has no MIXED_PRECISION environment variable. This creates a feature parity gap with the AzureML equivalent (training/il/workflows/azureml/lerobot-train.yaml), which was updated in #778 to support configurable GPU count and mixed-precision mode.

A user submitting to an OSMO pool node with a multi-GPU SKU will silently get single-GPU training because gpu: 1 caps the container's device allocation regardless of the pool SKU. train.py also defaults MIXED_PRECISION to "no" when the variable is absent, so bf16/fp16 training is not accessible from OSMO submissions.

What needs to be done

Mirror the pattern already present in training/il/workflows/azureml/lerobot-train.yaml:

# resources block
resources:
  default:
    gpu: "{{ num_gpus }}"

# environment block
environment:
  MIXED_PRECISION: "{{ mixed_precision }}"

# default-values
default-values:
  num_gpus: 1
  mixed_precision: "no"

With num_gpus: 1 and mixed_precision: "no" as defaults, existing single-GPU submissions are unaffected.

Relevant files

  • training/il/workflows/osmo/lerobot-train.yaml — change target
  • training/il/workflows/azureml/lerobot-train.yaml — reference implementation
  • training/il/scripts/lerobot/train.pyMIXED_PRECISION env var already read and validated here

Context

Identified during review of #778. The only change to the OSMO workflow in that PR was adding --no-cache-dir --no-deps to the uv pip install line (a lockfile correctness fix). Multi-GPU OSMO support was explicitly deferred as out of scope for that PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or improvement request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions