Skip to content

[BUG]: Agent.Listener hangs after DELETE /sessions on one-time-use mode when worker FinishJob races the listener's renewal cadence (Linux, agent 4.272.0) #5563

@bbrandt

Description

@bbrandt

What happened?

Agent.Listener hangs after DELETE /sessions in --once mode when worker FinishJob races the listener's renewal cadence (Linux, 4.272.0)

Environment

Agent version 4.272.0 (linux-x64, published 2026-04-27)
Mode ./run.sh --once (one-time-use)
OS Ubuntu Noble (24.04) inside container
Kubernetes AKS 1.34.6 (control plane and node pools)
Container runtime sysbox-runc with Docker-in-Docker (dockerd inside the agent container)
Scaling KEDA ScaledJob (one job per pod, parallelism: 1)
Observed frequency ~17% of pods (26 of 152 over a 48-hour window)

Symptom

After completing a job, Agent.Listener writes its normal shutdown trace to the diagnostic log ending in DELETE /sessions/{id} → HTTP 204, then stops
writing
. The dotnet process does not exit; ./run.sh --once does not return; the container stays alive indefinitely until Kubernetes
activeDeadlineSeconds or manual kubectl delete reaps it.

The user-visible job result is Succeeded (the Worker correctly reports FinishJob to the server before this happens). The post-job hang affects pod
lifecycle only — it does not corrupt the job's recorded result.

In one captured incident the pod was alive for ≥8 hours 11 minutes after the last line in the agent's own diag log. During that entire 8-hour window,
Agent.Listener wrote zero bytes to its diag log. No further exception, warning, or thread-state output — just silence.

Reproduction (intermittent — ~1 in 6 jobs)

  1. Spin up a one-time-use agent pod (./run.sh --once).
  2. Run any non-trivial job. Both pipeline-step success and pipeline-step failure can trigger the post-job hang.
  3. Worker completes all user steps and JobRunner.FinalizeJob.
  4. Worker calls FinishJob to the server. Server marks the job complete. Worker exits with ExitCode: 100.
  5. Listener's next renewal (every 60s) hits /jobrequests/{id} and receives HTTP 400 BadRequest with TaskAgentJobTokenExpiredException because the
    job is already complete server-side.
  6. Listener treats this as Job: Abandoned, tears down its dispatcher, and ends its diag log normally with DELETE /sessions/{id} → 204.
  7. From that point Agent.Listener stops logging and the process does not exit.

Diagnostic log excerpt

Last 12 lines of an affected Agent_*.log. The file ends here, no further content for the next 8+ hours despite the pod remaining alive:

[09:05:53Z INFO Agent] [RunAsync] Beginning agent shutdown sequence - cleaning up resources
[09:05:53Z INFO Agent] [RunAsync] Shutting down job dispatcher - terminating active jobs
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] JobDispatcher shutdown initiated [ActiveJobs:1, QueuedJobs:1]
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] Waiting for worker completion and cancelling any running jobs ...
[09:05:53Z ERR  JobDispatcher] [EnsureDispatchFinished] Worker Dispatch failed with an exception ...
[09:05:53Z ERR  JobDispatcher] [EnsureDispatchFinished] System.IO.IOException: Broken pipe
   ... (full stack trace, caught — Worker had already exited normally) ...
[09:05:53Z INFO JobDispatcher] [EnsureDispatchFinished] Worker dispatcher cleanup completed [..., RemovedFromDictionary:True, Disposed:True, ActiveJobs:0]
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] Worker process shutdown completed successfully ...
[09:05:53Z INFO JobDispatcher] [ShutdownAsync] JobDispatcher shutdown completed - all worker processes terminated
[09:05:53Z INFO Agent] [RunAsync] Cleaning up agent listener session - disconnecting from Azure DevOps
[09:05:53Z VERB VisualStudioServices] [OnEventWritten] Started DELETE request to
https://dev.azure.com/{org}/_apis/distributedtask/pools/{poolId}/sessions/{sessionGuid}
[09:05:53Z VERB VisualStudioServices] [OnEventWritten] Finished DELETE request ... with status code 204

After this, container stdout produces only the wrapper run.sh message "Error reported in diagnostic logs" referencing this file, then total silence
(no agent output, no dockerd heartbeat, nothing) until the pod is reaped.

Hypothesized cause (one or more)

  1. Background thread leak — a non-background Task or thread (renewal task, MessageListener polling loop, JobServerQueue workers) is still alive
    when RunAsync returns, blocking process exit. The orderly shutdown disposed the trace writers before the offending thread, so the hang itself is
    invisible to file-based logging.
  2. HttpClient / SslStream / VssConnection disposal blocked on a half-closed TLS connection to dev.azure.com after the DELETE.
  3. JobDispatcher or MessageListener disposal awaiting a CancellationToken-registered callback that doesn't fire.

The TaskAgentJobTokenExpiredException shutdown branch appears more prone to this than a "normal" shutdown — most jobs that take the same branch exit
cleanly, but ~17% don't.

What's needed

  • Code review of the Agent.cs:RunAsync cleanup path and the disposal chain it triggers, focused on threads/tasks that could outlive Main().
  • An option (env var or config) to force Environment.Exit(0) after DELETE /sessions completes on --once mode. By definition there is no further work
    for a one-time-use agent; a deliberate hard-exit would close off this entire class of disposal hang.

Workaround currently in place (downstream)

  • kill -9 on Agent.PluginHost log streamer at end of user steps, to resolve the separate AgentLogPlugin.WaitAsync hang at
    JobExtension.FinalizeJob:644 (worker-side, FinalizeJob hang, ~22-57 min). That workaround is verified working: the Worker catches Exit code 137 from
    the SIGKILLed log host, logs it, and proceeds to a clean ExitCode: 100. The Worker-side issue is fully mitigated. The disposal hang described here is
    downstream of that — it manifests in Agent.Listener after the Worker has already exited.
  • Kubernetes activeDeadlineSeconds on the agent ScaledJob, so a hung pod is force-reaped at a fixed ceiling.

Attachments

I can attach once I anonymize, but is there a different way to share them with you privately?

  • Full Agent_*.log (~95 KB)
  • Full Worker_*.log (~459 KB)

Both show the complete sequence: pkill of log streamer → Exit code 137 caught → Worker ExitCode: 100 → Listener BadRequest renewal → Listener session
DELETE 204 → silence.

What's still unknown

We have not yet captured ps -ef / /proc/1/status from a live zombie pod, which would distinguish whether dotnet Agent.Listener is still alive in the
process tree (hypothesis 1-3 above) or whether the dotnet process has actually exited and something else inside the container (dockerd, a containerd
shim, signal handling in the entrypoint script) is keeping PID 1 alive. We'll attach that next time we catch a fresh zombie.

Versions

ADO Self-hosted Agent version 4.272.0

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

Ubuntu Noble (24.04) inside container

Version controll system

git

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions