Component
OSMO Control Plane (Helm/Kubernetes) — KAI Scheduler dependency installed by infrastructure/setup/01-deploy-robotics-charts.sh.
Summary
A fresh infrastructure/setup deployment installs KAI Scheduler v0.5.5, which generates its own self-signed serving certificate for binder.kai-scheduler.svc and patches that same leaf certificate into the caBundle of its MutatingWebhookConfiguration and ValidatingWebhookConfiguration. The Kubernetes API server treats caBundle as a CA trust anchor, but the certificate KAI installs lacks CA:TRUE / keyCertSign. The API server therefore rejects the chain with parent certificate cannot sign this kind of certificate, every admission call to binder.kai-scheduler.svc returns HTTP 500, and all pod creates in osmo-workflows are blocked, including OSMO CreateGroup. The failure is silent until a workflow is submitted.
Symptoms
OSMO workflow submission fails immediately with backend exit code 3002:
CreateGroup job failed: Got exception when running backend execute:
InternalServerError: 500
Internal error occurred: failed calling webhook "binder.run.ai":
failed to call webhook: Post "https://binder.kai-scheduler.svc:443/mutate--v1-pod?timeout=10s":
tls: failed to verify certificate:
x509: certificate signed by unknown authority
(possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate"
while trying to verify candidate authority certificate "binder.kai-scheduler.svc")
Equivalent reproduction outside OSMO with a server-side dry-run:
kubectl run probe --image=busybox --restart=Never -n osmo-workflows --dry-run=server -- /bin/true
# Error from server (InternalError): Internal error occurred: failed calling webhook "binder.run.ai" ...
Steps to reproduce
- Run the documented deploy sequence on a fresh AKS cluster:
./infrastructure/setup/01-deploy-robotics-charts.sh
./infrastructure/setup/02-deploy-azureml-extension.sh
./infrastructure/setup/03-deploy-osmo-backend.sh
kubectl port-forward svc/osmo-service -n osmo-control-plane 8080:80 &
infrastructure/setup/04-deploy-osmo-backend.sh --service-url http://localhost:8080
- Submit an OSMO workflow that lands a pod in
osmo-workflows, e.g.:
./training/il/scripts/submit-osmo-lerobot-training.sh \
-d my-org/my-dataset \
--from-blob \
--storage-account <your-storage-account> \
--storage-container datasets \
--blob-prefix my-dataset/v1
- The workflow fails immediately with the
binder.run.ai TLS error above.
Impact
- Severity: blocker for every OSMO workflow on a freshly deployed cluster.
- Surface: every code path that creates a pod under
osmo-workflows, which includes all training, inference, and dataset workflows submitted through OSMO.
- Detection: the failure happens at admission time, before any container starts, so user-facing job logs are empty and the only signal is OSMO backend
3002 errors. Easy to misdiagnose as an OSMO bug.
Component
OSMO Control Plane (Helm/Kubernetes) — KAI Scheduler dependency installed by
infrastructure/setup/01-deploy-robotics-charts.sh.Summary
A fresh
infrastructure/setupdeployment installs KAI Schedulerv0.5.5, which generates its own self-signed serving certificate forbinder.kai-scheduler.svcand patches that same leaf certificate into thecaBundleof itsMutatingWebhookConfigurationandValidatingWebhookConfiguration. The Kubernetes API server treatscaBundleas a CA trust anchor, but the certificate KAI installs lacksCA:TRUE/keyCertSign. The API server therefore rejects the chain withparent certificate cannot sign this kind of certificate, every admission call tobinder.kai-scheduler.svcreturns HTTP 500, and all pod creates inosmo-workflowsare blocked, including OSMOCreateGroup. The failure is silent until a workflow is submitted.Symptoms
OSMO workflow submission fails immediately with backend exit code
3002:Equivalent reproduction outside OSMO with a server-side dry-run:
kubectl run probe --image=busybox --restart=Never -n osmo-workflows --dry-run=server -- /bin/true # Error from server (InternalError): Internal error occurred: failed calling webhook "binder.run.ai" ...Steps to reproduce
./infrastructure/setup/01-deploy-robotics-charts.sh ./infrastructure/setup/02-deploy-azureml-extension.sh ./infrastructure/setup/03-deploy-osmo-backend.sh kubectl port-forward svc/osmo-service -n osmo-control-plane 8080:80 & infrastructure/setup/04-deploy-osmo-backend.sh --service-url http://localhost:8080osmo-workflows, e.g.:binder.run.aiTLS error above.Impact
osmo-workflows, which includes all training, inference, and dataset workflows submitted through OSMO.3002errors. Easy to misdiagnose as an OSMO bug.