Skip to content

fix(infrastructure): KAI Scheduler v0.5.5 binder webhook caBundle uses non-CA cert, blocks all OSMO pod admission #794

@algattik

Description

@algattik

Component

OSMO Control Plane (Helm/Kubernetes) — KAI Scheduler dependency installed by infrastructure/setup/01-deploy-robotics-charts.sh.

Summary

A fresh infrastructure/setup deployment installs KAI Scheduler v0.5.5, which generates its own self-signed serving certificate for binder.kai-scheduler.svc and patches that same leaf certificate into the caBundle of its MutatingWebhookConfiguration and ValidatingWebhookConfiguration. The Kubernetes API server treats caBundle as a CA trust anchor, but the certificate KAI installs lacks CA:TRUE / keyCertSign. The API server therefore rejects the chain with parent certificate cannot sign this kind of certificate, every admission call to binder.kai-scheduler.svc returns HTTP 500, and all pod creates in osmo-workflows are blocked, including OSMO CreateGroup. The failure is silent until a workflow is submitted.

Symptoms

OSMO workflow submission fails immediately with backend exit code 3002:

CreateGroup job failed: Got exception when running backend execute:
InternalServerError: 500
Internal error occurred: failed calling webhook "binder.run.ai":
  failed to call webhook: Post "https://binder.kai-scheduler.svc:443/mutate--v1-pod?timeout=10s":
  tls: failed to verify certificate:
  x509: certificate signed by unknown authority
  (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate"
   while trying to verify candidate authority certificate "binder.kai-scheduler.svc")

Equivalent reproduction outside OSMO with a server-side dry-run:

kubectl run probe --image=busybox --restart=Never -n osmo-workflows --dry-run=server -- /bin/true
# Error from server (InternalError): Internal error occurred: failed calling webhook "binder.run.ai" ...

Steps to reproduce

  1. Run the documented deploy sequence on a fresh AKS cluster:
    ./infrastructure/setup/01-deploy-robotics-charts.sh
    ./infrastructure/setup/02-deploy-azureml-extension.sh
    ./infrastructure/setup/03-deploy-osmo-backend.sh
    kubectl port-forward svc/osmo-service -n osmo-control-plane 8080:80 &
    infrastructure/setup/04-deploy-osmo-backend.sh  --service-url http://localhost:8080
  2. Submit an OSMO workflow that lands a pod in osmo-workflows, e.g.:
   ./training/il/scripts/submit-osmo-lerobot-training.sh \
  -d my-org/my-dataset \
  --from-blob \
  --storage-account <your-storage-account> \
  --storage-container datasets \
  --blob-prefix my-dataset/v1
  1. The workflow fails immediately with the binder.run.ai TLS error above.

Impact

  • Severity: blocker for every OSMO workflow on a freshly deployed cluster.
  • Surface: every code path that creates a pod under osmo-workflows, which includes all training, inference, and dataset workflows submitted through OSMO.
  • Detection: the failure happens at admission time, before any container starts, so user-facing job logs are empty and the only signal is OSMO backend 3002 errors. Easy to misdiagnose as an OSMO bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions