Skip to content

test(e2e): run gpu workloads from manifest#1709

Open
elezar wants to merge 10 commits into
mainfrom
feat/1472-gpu-validation-tests/elezar
Open

test(e2e): run gpu workloads from manifest#1709
elezar wants to merge 10 commits into
mainfrom
feat/1472-gpu-validation-tests/elezar

Conversation

@elezar

@elezar elezar commented Jun 3, 2026

Copy link
Copy Markdown
Member

Summary

This PR adds manifest-driven GPU workload execution tests on top of the workload image artifacts from #1484. It keeps the existing GPU device-selection coverage, adds workload execution coverage under the umbrella gpu target, and documents how to build workload images locally before running the GPU e2e suite.

This branch is now rebased on the local e2e stabilization fixes from #1935, so the Docker GPU test path also includes the supervisor-image and host SSH linker-environment fixes needed for local Nix/devenv runs.

Related Issue

Closes #1472

Changes

  • Switch GPU workload execution tests from a single image env var to a YAML workload manifest consumed by the Rust e2e harness.
  • Run the manifest-defined workloads through openshell sandbox create --gpu --from <image> -- <command> and enforce declared pass or fail expectations.
  • Load the local manifest from e2e/gpu/images/.build/workloads.yaml by default, with OPENSHELL_E2E_WORKLOAD_MANIFEST available for external manifests.
  • Update the Docker GPU e2e wrapper to point users at the workload manifest flow when no local manifest exists.
  • Add serde_yaml to the e2e crate for manifest parsing.
  • Include the local e2e fixes from fix(e2e): stabilize local Docker smoke test #1935: configured Docker supervisor image handling and host SSH linker-environment isolation.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Validation status:

  • mise run e2e:docker:gpu
  • mise run pre-commit was run after rebasing onto main; Rust format/check/clippy, markdown lint, Python format, license checks, and docs checks completed successfully.
  • mise run pre-commit currently fails in helm:lint because the local chart dependency directory is missing the postgresql dependency. This is unrelated to the GPU workload changes.

GPU validation commands for future runs:

  • mise run e2e:workloads:build
  • mise run e2e:docker:gpu

Notes:

  • Build workload images and generate the local manifest with mise run e2e:workloads:build before running mise run e2e:docker:gpu locally.
  • External catalogs can be exercised by setting OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/workloads.yaml.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch from 5cc2d92 to efe4d25 Compare June 4, 2026 12:56
@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 5a84bca to 1c8f7b7 Compare June 4, 2026 14:13
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@elezar elezar force-pushed the feat/1476-gpu-workload-images/elezar branch 2 times, most recently from de40d64 to 8426fac Compare June 10, 2026 20:54
Base automatically changed from feat/1476-gpu-workload-images/elezar to main June 15, 2026 18:26
@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 1c8f7b7 to c5182b1 Compare June 16, 2026 08:33
@github-actions

Copy link
Copy Markdown

@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 032f133 to 55ed9ce Compare June 17, 2026 07:03
@elezar elezar marked this pull request as ready for review June 17, 2026 09:58
@elezar elezar added test:e2e Requires end-to-end coverage test:e2e-gpu Requires GPU end-to-end coverage labels Jun 17, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for 2f36b22. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@github-actions

Copy link
Copy Markdown

Label test:e2e applied for 2f36b22. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

elezar added 9 commits June 17, 2026 14:28
@elezar elezar force-pushed the feat/1472-gpu-validation-tests/elezar branch from 2f36b22 to 386d638 Compare June 17, 2026 12:35
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define GPU validation tests for GPU-enabled drivers

1 participant