[BUG]: Agent.Listener hangs after DELETE /sessions on one-time-use mode when worker FinishJob races the listener's renewal cadence (Linux, agent 4.272.0)

### What happened?

 # Agent.Listener hangs after `DELETE /sessions` in `--once` mode when worker FinishJob races the listener's renewal cadence (Linux, 4.272.0)

  ## Environment

  | | |
  |---|---|
  | Agent version | `4.272.0` (linux-x64, published 2026-04-27) |
  | Mode | `./run.sh --once` (one-time-use) |
  | OS | Ubuntu Noble (24.04) inside container |
  | Kubernetes | AKS 1.34.6 (control plane and node pools) |
  | Container runtime | sysbox-runc with Docker-in-Docker (dockerd inside the agent container) |
  | Scaling | KEDA `ScaledJob` (one job per pod, `parallelism: 1`) |
  | Observed frequency | ~17% of pods (26 of 152 over a 48-hour window) |

  ## Symptom

  After completing a job, `Agent.Listener` writes its normal shutdown trace to the diagnostic log ending in `DELETE /sessions/{id} → HTTP 204`, then **stops
   writing**. The dotnet process does not exit; `./run.sh --once` does not return; the container stays alive indefinitely until Kubernetes
  `activeDeadlineSeconds` or manual `kubectl delete` reaps it.

  The user-visible job result is `Succeeded` (the Worker correctly reports `FinishJob` to the server before this happens). The post-job hang affects pod
  lifecycle only — it does not corrupt the job's recorded result.

  In one captured incident the pod was alive for **≥8 hours 11 minutes after the last line in the agent's own diag log**. During that entire 8-hour window,
  Agent.Listener wrote zero bytes to its diag log. No further exception, warning, or thread-state output — just silence.

  ## Reproduction (intermittent — ~1 in 6 jobs)

  1. Spin up a one-time-use agent pod (`./run.sh --once`).
  2. Run any non-trivial job. Both pipeline-step success and pipeline-step failure can trigger the post-job hang.
  3. Worker completes all user steps and `JobRunner.FinalizeJob`.
  4. Worker calls `FinishJob` to the server. Server marks the job complete. Worker exits with `ExitCode: 100`.
  5. Listener's next renewal (every 60s) hits `/jobrequests/{id}` and receives `HTTP 400 BadRequest` with `TaskAgentJobTokenExpiredException` because the
  job is already complete server-side.
  6. Listener treats this as `Job: Abandoned`, tears down its dispatcher, and ends its diag log normally with `DELETE /sessions/{id} → 204`.
  7. **From that point Agent.Listener stops logging and the process does not exit.**

  ## Diagnostic log excerpt

  Last 12 lines of an affected `Agent_*.log`. **The file ends here**, no further content for the next 8+ hours despite the pod remaining alive:

  ```
  [09:05:53Z INFO Agent] [RunAsync] Beginning agent shutdown sequence - cleaning up resources
  [09:05:53Z INFO Agent] [RunAsync] Shutting down job dispatcher - terminating active jobs
  [09:05:53Z INFO JobDispatcher] [ShutdownAsync] JobDispatcher shutdown initiated [ActiveJobs:1, QueuedJobs:1]
  [09:05:53Z INFO JobDispatcher] [ShutdownAsync] Waiting for worker completion and cancelling any running jobs ...
  [09:05:53Z ERR  JobDispatcher] [EnsureDispatchFinished] Worker Dispatch failed with an exception ...
  [09:05:53Z ERR  JobDispatcher] [EnsureDispatchFinished] System.IO.IOException: Broken pipe
     ... (full stack trace, caught — Worker had already exited normally) ...
  [09:05:53Z INFO JobDispatcher] [EnsureDispatchFinished] Worker dispatcher cleanup completed [..., RemovedFromDictionary:True, Disposed:True, ActiveJobs:0]
  [09:05:53Z INFO JobDispatcher] [ShutdownAsync] Worker process shutdown completed successfully ...
  [09:05:53Z INFO JobDispatcher] [ShutdownAsync] JobDispatcher shutdown completed - all worker processes terminated
  [09:05:53Z INFO Agent] [RunAsync] Cleaning up agent listener session - disconnecting from Azure DevOps
  [09:05:53Z VERB VisualStudioServices] [OnEventWritten] Started DELETE request to
  https://dev.azure.com/{org}/_apis/distributedtask/pools/{poolId}/sessions/{sessionGuid}
  [09:05:53Z VERB VisualStudioServices] [OnEventWritten] Finished DELETE request ... with status code 204
  ```

  After this, container stdout produces only the wrapper `run.sh` message `"Error reported in diagnostic logs"` referencing this file, then total silence
  (no agent output, no `dockerd` heartbeat, nothing) until the pod is reaped.

  ## Hypothesized cause (one or more)

  1. **Background thread leak** — a non-background `Task` or thread (renewal task, `MessageListener` polling loop, `JobServerQueue` workers) is still alive
  when `RunAsync` returns, blocking process exit. The orderly shutdown disposed the trace writers *before* the offending thread, so the hang itself is
  invisible to file-based logging.
  2. **`HttpClient` / `SslStream` / `VssConnection` disposal blocked** on a half-closed TLS connection to `dev.azure.com` after the `DELETE`.
  3. **`JobDispatcher` or `MessageListener` disposal** awaiting a `CancellationToken`-registered callback that doesn't fire.

  The `TaskAgentJobTokenExpiredException` shutdown branch appears more prone to this than a "normal" shutdown — most jobs that take the same branch exit
  cleanly, but ~17% don't.

  ## What's needed

  - Code review of the `Agent.cs:RunAsync` cleanup path and the disposal chain it triggers, focused on threads/tasks that could outlive `Main()`.
  - An option (env var or config) to force `Environment.Exit(0)` after `DELETE /sessions` completes on `--once` mode. By definition there is no further work
   for a one-time-use agent; a deliberate hard-exit would close off this entire class of disposal hang.

  ## Workaround currently in place (downstream)

  - `kill -9` on `Agent.PluginHost` log streamer at end of user steps, to resolve the *separate* `AgentLogPlugin.WaitAsync` hang at
  `JobExtension.FinalizeJob:644` (worker-side, FinalizeJob hang, ~22-57 min). That workaround is verified working: the Worker catches `Exit code 137` from
  the SIGKILLed log host, logs it, and proceeds to a clean `ExitCode: 100`. The Worker-side issue is fully mitigated. The disposal hang described here is
  downstream of that — it manifests in `Agent.Listener` after the Worker has already exited.
  - Kubernetes `activeDeadlineSeconds` on the agent ScaledJob, so a hung pod is force-reaped at a fixed ceiling.

  ## Attachments
I can attach once I anonymize, but is there a different way to share them with you privately?
  - Full `Agent_*.log` (~95 KB)
  - Full `Worker_*.log` (~459 KB)

  Both show the complete sequence: pkill of log streamer → `Exit code 137` caught → Worker `ExitCode: 100` → Listener BadRequest renewal → Listener session
  DELETE 204 → silence.

  ## What's still unknown

  We have not yet captured `ps -ef` / `/proc/1/status` from a live zombie pod, which would distinguish whether `dotnet Agent.Listener` is still alive in the
   process tree (hypothesis 1-3 above) or whether the dotnet process has actually exited and something else inside the container (`dockerd`, a containerd
  shim, signal handling in the entrypoint script) is keeping PID 1 alive. We'll attach that next time we catch a fresh zombie.


### Versions

ADO Self-hosted Agent version 4.272.0

### Environment type (Please select at least one enviroment where you face this issue)

- [x] Self-Hosted
- [ ] Microsoft Hosted
- [ ] VMSS Pool
- [x] Container

### Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

### Azure DevOps Server Version (if applicable)

_No response_

### Operation system

Ubuntu Noble (24.04) inside container

### Version controll system

git

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Agent.Listener hangs after DELETE /sessions on one-time-use mode when worker FinishJob races the listener's renewal cadence (Linux, agent 4.272.0) #5563

What happened?

Agent.Listener hangs after `DELETE /sessions` in `--once` mode when worker FinishJob races the listener's renewal cadence (Linux, 4.272.0)

Environment

Symptom

Reproduction (intermittent — ~1 in 6 jobs)

Diagnostic log excerpt

Hypothesized cause (one or more)

What's needed

Workaround currently in place (downstream)

Attachments

What's still unknown

Versions

Environment type (Please select at least one enviroment where you face this issue)

Azure DevOps Server type

Azure DevOps Server Version (if applicable)

Operation system

Version controll system

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


Agent version	`4.272.0` (linux-x64, published 2026-04-27)
Mode	`./run.sh --once` (one-time-use)
OS	Ubuntu Noble (24.04) inside container
Kubernetes	AKS 1.34.6 (control plane and node pools)
Container runtime	sysbox-runc with Docker-in-Docker (dockerd inside the agent container)
Scaling	KEDA `ScaledJob` (one job per pod, `parallelism: 1`)
Observed frequency	~17% of pods (26 of 152 over a 48-hour window)

[BUG]: Agent.Listener hangs after DELETE /sessions on one-time-use mode when worker FinishJob races the listener's renewal cadence (Linux, agent 4.272.0) #5563

Description

What happened?

Agent.Listener hangs after DELETE /sessions in --once mode when worker FinishJob races the listener's renewal cadence (Linux, 4.272.0)

Environment

Symptom

Reproduction (intermittent — ~1 in 6 jobs)

Diagnostic log excerpt

Hypothesized cause (one or more)

What's needed

Workaround currently in place (downstream)

Attachments

What's still unknown

Versions

Environment type (Please select at least one enviroment where you face this issue)

Azure DevOps Server type

Azure DevOps Server Version (if applicable)

Operation system

Version controll system

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Agent.Listener hangs after `DELETE /sessions` in `--once` mode when worker FinishJob races the listener's renewal cadence (Linux, 4.272.0)