What happened?
Agent.Listener hangs after DELETE /sessions in --once mode when worker FinishJob races the listener's renewal cadence (Linux, 4.272.0)
Environment
|
|
| Agent version |
4.272.0 (linux-x64, published 2026-04-27) |
| Mode |
./run.sh --once (one-time-use) |
| OS |
Ubuntu Noble (24.04) inside container |
| Kubernetes |
AKS 1.34.6 (control plane and node pools) |
| Container runtime |
sysbox-runc with Docker-in-Docker (dockerd inside the agent container) |
| Scaling |
KEDA ScaledJob (one job per pod, parallelism: 1) |
| Observed frequency |
~17% of pods (26 of 152 over a 48-hour window) |
Symptom
After completing a job, Agent.Listener writes its normal shutdown trace to the diagnostic log ending in DELETE /sessions/{id} → HTTP 204, then stops
writing. The dotnet process does not exit; ./run.sh --once does not return; the container stays alive indefinitely until Kubernetes
activeDeadlineSeconds or manual kubectl delete reaps it.
The user-visible job result is Succeeded (the Worker correctly reports FinishJob to the server before this happens). The post-job hang affects pod
lifecycle only — it does not corrupt the job's recorded result.
In one captured incident the pod was alive for ≥8 hours 11 minutes after the last line in the agent's own diag log. During that entire 8-hour window,
Agent.Listener wrote zero bytes to its diag log. No further exception, warning, or thread-state output — just silence.
Reproduction (intermittent — ~1 in 6 jobs)
- Spin up a one-time-use agent pod (
./run.sh --once).
- Run any non-trivial job. Both pipeline-step success and pipeline-step failure can trigger the post-job hang.
- Worker completes all user steps and
JobRunner.FinalizeJob.
- Worker calls
FinishJob to the server. Server marks the job complete. Worker exits with ExitCode: 100.
- Listener's next renewal (every 60s) hits
/jobrequests/{id} and receives HTTP 400 BadRequest with TaskAgentJobTokenExpiredException because the
job is already complete server-side.
- Listener treats this as
Job: Abandoned, tears down its dispatcher, and ends its diag log normally with DELETE /sessions/{id} → 204.
- From that point Agent.Listener stops logging and the process does not exit.
Diagnostic log excerpt
Last 12 lines of an affected Agent_*.log. The file ends here, no further content for the next 8+ hours despite the pod remaining alive:
[09:05:53Z INFO Agent] [RunAsync] Beginning agent shutdown sequence - cleaning up resources
[09:05:53Z INFO Agent] [RunAsync] Shutting down job dispatcher - terminating active jobs
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] JobDispatcher shutdown initiated [ActiveJobs:1, QueuedJobs:1]
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] Waiting for worker completion and cancelling any running jobs ...
[09:05:53Z ERR JobDispatcher] [EnsureDispatchFinished] Worker Dispatch failed with an exception ...
[09:05:53Z ERR JobDispatcher] [EnsureDispatchFinished] System.IO.IOException: Broken pipe
... (full stack trace, caught — Worker had already exited normally) ...
[09:05:53Z INFO JobDispatcher] [EnsureDispatchFinished] Worker dispatcher cleanup completed [..., RemovedFromDictionary:True, Disposed:True, ActiveJobs:0]
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] Worker process shutdown completed successfully ...
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] JobDispatcher shutdown completed - all worker processes terminated
[09:05:53Z INFO Agent] [RunAsync] Cleaning up agent listener session - disconnecting from Azure DevOps
[09:05:53Z VERB VisualStudioServices] [OnEventWritten] Started DELETE request to
https://dev.azure.com/{org}/_apis/distributedtask/pools/{poolId}/sessions/{sessionGuid}
[09:05:53Z VERB VisualStudioServices] [OnEventWritten] Finished DELETE request ... with status code 204
After this, container stdout produces only the wrapper run.sh message "Error reported in diagnostic logs" referencing this file, then total silence
(no agent output, no dockerd heartbeat, nothing) until the pod is reaped.
Hypothesized cause (one or more)
- Background thread leak — a non-background
Task or thread (renewal task, MessageListener polling loop, JobServerQueue workers) is still alive
when RunAsync returns, blocking process exit. The orderly shutdown disposed the trace writers before the offending thread, so the hang itself is
invisible to file-based logging.
HttpClient / SslStream / VssConnection disposal blocked on a half-closed TLS connection to dev.azure.com after the DELETE.
JobDispatcher or MessageListener disposal awaiting a CancellationToken-registered callback that doesn't fire.
The TaskAgentJobTokenExpiredException shutdown branch appears more prone to this than a "normal" shutdown — most jobs that take the same branch exit
cleanly, but ~17% don't.
What's needed
- Code review of the
Agent.cs:RunAsync cleanup path and the disposal chain it triggers, focused on threads/tasks that could outlive Main().
- An option (env var or config) to force
Environment.Exit(0) after DELETE /sessions completes on --once mode. By definition there is no further work
for a one-time-use agent; a deliberate hard-exit would close off this entire class of disposal hang.
Workaround currently in place (downstream)
kill -9 on Agent.PluginHost log streamer at end of user steps, to resolve the separate AgentLogPlugin.WaitAsync hang at
JobExtension.FinalizeJob:644 (worker-side, FinalizeJob hang, ~22-57 min). That workaround is verified working: the Worker catches Exit code 137 from
the SIGKILLed log host, logs it, and proceeds to a clean ExitCode: 100. The Worker-side issue is fully mitigated. The disposal hang described here is
downstream of that — it manifests in Agent.Listener after the Worker has already exited.
- Kubernetes
activeDeadlineSeconds on the agent ScaledJob, so a hung pod is force-reaped at a fixed ceiling.
Attachments
I can attach once I anonymize, but is there a different way to share them with you privately?
- Full
Agent_*.log (~95 KB)
- Full
Worker_*.log (~459 KB)
Both show the complete sequence: pkill of log streamer → Exit code 137 caught → Worker ExitCode: 100 → Listener BadRequest renewal → Listener session
DELETE 204 → silence.
What's still unknown
We have not yet captured ps -ef / /proc/1/status from a live zombie pod, which would distinguish whether dotnet Agent.Listener is still alive in the
process tree (hypothesis 1-3 above) or whether the dotnet process has actually exited and something else inside the container (dockerd, a containerd
shim, signal handling in the entrypoint script) is keeping PID 1 alive. We'll attach that next time we catch a fresh zombie.
Versions
ADO Self-hosted Agent version 4.272.0
Environment type (Please select at least one enviroment where you face this issue)
Azure DevOps Server type
dev.azure.com (formerly visualstudio.com)
Azure DevOps Server Version (if applicable)
No response
Operation system
Ubuntu Noble (24.04) inside container
Version controll system
git
Relevant log output
What happened?
Agent.Listener hangs after
DELETE /sessionsin--oncemode when worker FinishJob races the listener's renewal cadence (Linux, 4.272.0)Environment
4.272.0(linux-x64, published 2026-04-27)./run.sh --once(one-time-use)ScaledJob(one job per pod,parallelism: 1)Symptom
After completing a job,
Agent.Listenerwrites its normal shutdown trace to the diagnostic log ending inDELETE /sessions/{id} → HTTP 204, then stopswriting. The dotnet process does not exit;
./run.sh --oncedoes not return; the container stays alive indefinitely until KubernetesactiveDeadlineSecondsor manualkubectl deletereaps it.The user-visible job result is
Succeeded(the Worker correctly reportsFinishJobto the server before this happens). The post-job hang affects podlifecycle only — it does not corrupt the job's recorded result.
In one captured incident the pod was alive for ≥8 hours 11 minutes after the last line in the agent's own diag log. During that entire 8-hour window,
Agent.Listener wrote zero bytes to its diag log. No further exception, warning, or thread-state output — just silence.
Reproduction (intermittent — ~1 in 6 jobs)
./run.sh --once).JobRunner.FinalizeJob.FinishJobto the server. Server marks the job complete. Worker exits withExitCode: 100./jobrequests/{id}and receivesHTTP 400 BadRequestwithTaskAgentJobTokenExpiredExceptionbecause thejob is already complete server-side.
Job: Abandoned, tears down its dispatcher, and ends its diag log normally withDELETE /sessions/{id} → 204.Diagnostic log excerpt
Last 12 lines of an affected
Agent_*.log. The file ends here, no further content for the next 8+ hours despite the pod remaining alive:After this, container stdout produces only the wrapper
run.shmessage"Error reported in diagnostic logs"referencing this file, then total silence(no agent output, no
dockerdheartbeat, nothing) until the pod is reaped.Hypothesized cause (one or more)
Taskor thread (renewal task,MessageListenerpolling loop,JobServerQueueworkers) is still alivewhen
RunAsyncreturns, blocking process exit. The orderly shutdown disposed the trace writers before the offending thread, so the hang itself isinvisible to file-based logging.
HttpClient/SslStream/VssConnectiondisposal blocked on a half-closed TLS connection todev.azure.comafter theDELETE.JobDispatcherorMessageListenerdisposal awaiting aCancellationToken-registered callback that doesn't fire.The
TaskAgentJobTokenExpiredExceptionshutdown branch appears more prone to this than a "normal" shutdown — most jobs that take the same branch exitcleanly, but ~17% don't.
What's needed
Agent.cs:RunAsynccleanup path and the disposal chain it triggers, focused on threads/tasks that could outliveMain().Environment.Exit(0)afterDELETE /sessionscompletes on--oncemode. By definition there is no further workfor a one-time-use agent; a deliberate hard-exit would close off this entire class of disposal hang.
Workaround currently in place (downstream)
kill -9onAgent.PluginHostlog streamer at end of user steps, to resolve the separateAgentLogPlugin.WaitAsynchang atJobExtension.FinalizeJob:644(worker-side, FinalizeJob hang, ~22-57 min). That workaround is verified working: the Worker catchesExit code 137fromthe SIGKILLed log host, logs it, and proceeds to a clean
ExitCode: 100. The Worker-side issue is fully mitigated. The disposal hang described here isdownstream of that — it manifests in
Agent.Listenerafter the Worker has already exited.activeDeadlineSecondson the agent ScaledJob, so a hung pod is force-reaped at a fixed ceiling.Attachments
I can attach once I anonymize, but is there a different way to share them with you privately?
Agent_*.log(~95 KB)Worker_*.log(~459 KB)Both show the complete sequence: pkill of log streamer →
Exit code 137caught → WorkerExitCode: 100→ Listener BadRequest renewal → Listener sessionDELETE 204 → silence.
What's still unknown
We have not yet captured
ps -ef//proc/1/statusfrom a live zombie pod, which would distinguish whetherdotnet Agent.Listeneris still alive in theprocess tree (hypothesis 1-3 above) or whether the dotnet process has actually exited and something else inside the container (
dockerd, a containerdshim, signal handling in the entrypoint script) is keeping PID 1 alive. We'll attach that next time we catch a fresh zombie.
Versions
ADO Self-hosted Agent version 4.272.0
Environment type (Please select at least one enviroment where you face this issue)
Azure DevOps Server type
dev.azure.com (formerly visualstudio.com)
Azure DevOps Server Version (if applicable)
No response
Operation system
Ubuntu Noble (24.04) inside container
Version controll system
git
Relevant log output