feat: add per-job stop capability to serverless worker#510
Conversation
|
Promptless prepared a documentation update related to this change. Triggered by PR #510 Added documentation for the per-job stop capability to the main docs site. The update explains that workers handling multiple jobs concurrently can now stop individual jobs without affecting siblings, and includes guidance on catching Review: Document per-job stop capability for concurrent workers |
There was a problem hiding this comment.
Pull request overview
Adds per-job cancellation support to the serverless worker so a single in-flight job can be stopped without terminating the whole worker (and other concurrent jobs). This introduces a new stop-signal long-poll channel and tracks running job tasks by job id so they can be cancelled individually.
Changes:
- Track in-progress jobs as
asyncio.Tasks keyed by job id and addstop_job()+monitor_stop_signals()to cancel individual jobs. - Add a new
get_stop_signals()job module helper that long-polls a derivedjob-stopendpoint. - Add tests covering per-job stop behavior and stop-signal polling, plus documentation describing cancellation behavior for handlers.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
runpod/serverless/modules/rp_scale.py |
Track per-job tasks, long-poll stop signals, and cancel individual jobs. |
runpod/serverless/modules/rp_job.py |
Add stop-channel URL derivation and get_stop_signals() long-poll helper. |
tests/test_serverless/test_rp_scale.py |
Add integration-style async tests for stopping jobs and stop-signal monitoring. |
tests/test_serverless/test_modules/test_job.py |
Add unit tests for stop URL derivation and stop-signal parsing/behavior. |
docs/serverless/worker.md |
Document “Stopping Individual Jobs” and handler cleanup via asyncio.CancelledError. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
deanq
left a comment
There was a problem hiding this comment.
Automated review pass (per-job stop capability). 9 inline comments below — 2 critical (silent error logging on the stop loop; no backpressure on the poll's success path), 4 important (silent swallows in get_stop_signals, silent feature-disable on unset JOB_GET_URL, except-clause ordering, docs overpromising long-poll), and 3 suggestions/nits. Nothing blocking the design, which is sound.
A serverless worker that takes more than one job concurrently had no way to stop processing an individual request once it started. The only available lever was killing the entire worker, which also terminates the other healthy in-progress jobs on that worker. This is the root cause behind cancelled requests continuing to run and incur charges when a worker is handling several jobs at once.
This gives the worker a notion of stopping a single request. The worker now tracks each in-progress job by id and can cancel just that job's task, leaving its siblings untouched. Stop signal arrives via a new job-stop long-polling channel similar to the job-take long polling endpoint.
Handlers need no changes; async handlers holding resources can clean up by catching
asyncio.CancelledError.relies on https://github.com/runpod/ai-api/pull/881
Closes SLS-41.