Component
Training Workflows (OSMO)
Problem Statement
The OSMO LeRobot submission path still uses an exploded Azure Blob input contract (--from-blob, --storage-account, --storage-container, and --blob-prefix) and only handles a single Blob source. This differs from the newer AzureML LeRobot path, which accepts concrete dataset sources without requiring users to model every storage coordinate as a separate argument.
OSMO workflows consume concrete storage locations, not AzureML workspace control-plane assets. Treating AzureML data assets as first-class OSMO inputs would force the submit script to shell out to AzureML, resolve assets to Blob paths, validate asset type and workspace scope, and then submit Blob URLs anyway. That coupling is out of scope for this OSMO parity fix.
Proposed Solution
Update submit-osmo-lerobot-training.sh and the OSMO LeRobot workflow to support one or more direct Azure Blob dataset URLs.
The OSMO submission script should:
- Replace the canonical
--from-blob, --storage-account, --storage-container, and --blob-prefix contract with repeatable --blob-url URL arguments.
- Accept one or more HTTPS Azure Blob URLs that include account, container, and non-empty prefix information.
- Reject SAS/query-string URLs, fragments, unsupported schemes, AzureML asset identifiers, and AzureML datastore URIs.
- Default
--dataset-repo-id to dataset for Blob submissions so Hugging Face dataset IDs are not required for non-HF sources.
- Preserve Blob URL ordering and pass the source list to the workflow through
BLOB_URLS JSON.
- Use the existing OSMO workload identity path for Blob downloads and fail early when the OSMO identity lacks Blob data-plane access.
- Document the new arguments, validation rules, workload identity requirement, and unsupported input forms.
Out of Scope
OSMO should not accept AzureML data asset identifiers in this issue. A future enhancement can add a separate resolver that converts pinned AzureML uri_folder assets to Blob URLs before OSMO submission.
Also out of scope: SAS/public Blob downloads, ADLS Gen2, Azure Files, OneLake, AzureML datastore URIs, cross-subscription Blob discovery, and multi-node distributed LeRobot training.
Alternatives Considered
Add --dataset-asset to the OSMO submit script and resolve AzureML assets during submission. That would match AzureML input ergonomics superficially, but it would add AzureML control-plane coupling to an OSMO path and still reduce to Blob URL submission at runtime. Requiring concrete Blob URLs keeps the OSMO interface honest and simpler.
Additional Context
Related work:
Component
Training Workflows (OSMO)
Problem Statement
The OSMO LeRobot submission path still uses an exploded Azure Blob input contract (
--from-blob,--storage-account,--storage-container, and--blob-prefix) and only handles a single Blob source. This differs from the newer AzureML LeRobot path, which accepts concrete dataset sources without requiring users to model every storage coordinate as a separate argument.OSMO workflows consume concrete storage locations, not AzureML workspace control-plane assets. Treating AzureML data assets as first-class OSMO inputs would force the submit script to shell out to AzureML, resolve assets to Blob paths, validate asset type and workspace scope, and then submit Blob URLs anyway. That coupling is out of scope for this OSMO parity fix.
Proposed Solution
Update
submit-osmo-lerobot-training.shand the OSMO LeRobot workflow to support one or more direct Azure Blob dataset URLs.The OSMO submission script should:
--from-blob,--storage-account,--storage-container, and--blob-prefixcontract with repeatable--blob-url URLarguments.--dataset-repo-idtodatasetfor Blob submissions so Hugging Face dataset IDs are not required for non-HF sources.BLOB_URLSJSON.Out of Scope
OSMO should not accept AzureML data asset identifiers in this issue. A future enhancement can add a separate resolver that converts pinned AzureML
uri_folderassets to Blob URLs before OSMO submission.Also out of scope: SAS/public Blob downloads, ADLS Gen2, Azure Files, OneLake, AzureML datastore URIs, cross-subscription Blob discovery, and multi-node distributed LeRobot training.
Alternatives Considered
Add
--dataset-assetto the OSMO submit script and resolve AzureML assets during submission. That would match AzureML input ergonomics superficially, but it would add AzureML control-plane coupling to an OSMO path and still reduce to Blob URL submission at runtime. Requiring concrete Blob URLs keeps the OSMO interface honest and simpler.Additional Context
Related work: