Skip to content

feat(training): support repeatable Azure Blob URLs for OSMO LeRobot submissions #791

@algattik

Description

@algattik

Component

Training Workflows (OSMO)

Problem Statement

The OSMO LeRobot submission path still uses an exploded Azure Blob input contract (--from-blob, --storage-account, --storage-container, and --blob-prefix) and only handles a single Blob source. This differs from the newer AzureML LeRobot path, which accepts concrete dataset sources without requiring users to model every storage coordinate as a separate argument.

OSMO workflows consume concrete storage locations, not AzureML workspace control-plane assets. Treating AzureML data assets as first-class OSMO inputs would force the submit script to shell out to AzureML, resolve assets to Blob paths, validate asset type and workspace scope, and then submit Blob URLs anyway. That coupling is out of scope for this OSMO parity fix.

Proposed Solution

Update submit-osmo-lerobot-training.sh and the OSMO LeRobot workflow to support one or more direct Azure Blob dataset URLs.

The OSMO submission script should:

  • Replace the canonical --from-blob, --storage-account, --storage-container, and --blob-prefix contract with repeatable --blob-url URL arguments.
  • Accept one or more HTTPS Azure Blob URLs that include account, container, and non-empty prefix information.
  • Reject SAS/query-string URLs, fragments, unsupported schemes, AzureML asset identifiers, and AzureML datastore URIs.
  • Default --dataset-repo-id to dataset for Blob submissions so Hugging Face dataset IDs are not required for non-HF sources.
  • Preserve Blob URL ordering and pass the source list to the workflow through BLOB_URLS JSON.
  • Use the existing OSMO workload identity path for Blob downloads and fail early when the OSMO identity lacks Blob data-plane access.
  • Document the new arguments, validation rules, workload identity requirement, and unsupported input forms.

Out of Scope

OSMO should not accept AzureML data asset identifiers in this issue. A future enhancement can add a separate resolver that converts pinned AzureML uri_folder assets to Blob URLs before OSMO submission.

Also out of scope: SAS/public Blob downloads, ADLS Gen2, Azure Files, OneLake, AzureML datastore URIs, cross-subscription Blob discovery, and multi-node distributed LeRobot training.

Alternatives Considered

Add --dataset-asset to the OSMO submit script and resolve AzureML assets during submission. That would match AzureML input ergonomics superficially, but it would add AzureML control-plane coupling to an OSMO path and still reduce to Blob URL submission at runtime. Requiring concrete Blob URLs keeps the OSMO interface honest and simpler.

Additional Context

Related work:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions