Summary
The training/il/workflows/osmo/lerobot-train.yaml OSMO workflow hardcodes gpu: 1 in its resources block and has no MIXED_PRECISION environment variable. This creates a feature parity gap with the AzureML equivalent (training/il/workflows/azureml/lerobot-train.yaml), which was updated in #778 to support configurable GPU count and mixed-precision mode.
A user submitting to an OSMO pool node with a multi-GPU SKU will silently get single-GPU training because gpu: 1 caps the container's device allocation regardless of the pool SKU. train.py also defaults MIXED_PRECISION to "no" when the variable is absent, so bf16/fp16 training is not accessible from OSMO submissions.
What needs to be done
Mirror the pattern already present in training/il/workflows/azureml/lerobot-train.yaml:
# resources block
resources:
default:
gpu: "{{ num_gpus }}"
# environment block
environment:
MIXED_PRECISION: "{{ mixed_precision }}"
# default-values
default-values:
num_gpus: 1
mixed_precision: "no"
With num_gpus: 1 and mixed_precision: "no" as defaults, existing single-GPU submissions are unaffected.
Relevant files
training/il/workflows/osmo/lerobot-train.yaml — change target
training/il/workflows/azureml/lerobot-train.yaml — reference implementation
training/il/scripts/lerobot/train.py — MIXED_PRECISION env var already read and validated here
Context
Identified during review of #778. The only change to the OSMO workflow in that PR was adding --no-cache-dir --no-deps to the uv pip install line (a lockfile correctness fix). Multi-GPU OSMO support was explicitly deferred as out of scope for that PR.
Summary
The
training/il/workflows/osmo/lerobot-train.yamlOSMO workflow hardcodesgpu: 1in its resources block and has noMIXED_PRECISIONenvironment variable. This creates a feature parity gap with the AzureML equivalent (training/il/workflows/azureml/lerobot-train.yaml), which was updated in #778 to support configurable GPU count and mixed-precision mode.A user submitting to an OSMO pool node with a multi-GPU SKU will silently get single-GPU training because
gpu: 1caps the container's device allocation regardless of the pool SKU.train.pyalso defaultsMIXED_PRECISIONto"no"when the variable is absent, so bf16/fp16 training is not accessible from OSMO submissions.What needs to be done
Mirror the pattern already present in
training/il/workflows/azureml/lerobot-train.yaml:With
num_gpus: 1andmixed_precision: "no"as defaults, existing single-GPU submissions are unaffected.Relevant files
training/il/workflows/osmo/lerobot-train.yaml— change targettraining/il/workflows/azureml/lerobot-train.yaml— reference implementationtraining/il/scripts/lerobot/train.py—MIXED_PRECISIONenv var already read and validated hereContext
Identified during review of #778. The only change to the OSMO workflow in that PR was adding
--no-cache-dir --no-depsto theuv pip installline (a lockfile correctness fix). Multi-GPU OSMO support was explicitly deferred as out of scope for that PR.