Skip to content

feat: add per-job stop capability to serverless worker#510

Open
KAJdev wants to merge 6 commits into
mainfrom
zeke/sls-41-add-stop-job-capability-to-runpod-python-sdk
Open

feat: add per-job stop capability to serverless worker#510
KAJdev wants to merge 6 commits into
mainfrom
zeke/sls-41-add-stop-job-capability-to-runpod-python-sdk

Conversation

@KAJdev

@KAJdev KAJdev commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

A serverless worker that takes more than one job concurrently had no way to stop processing an individual request once it started. The only available lever was killing the entire worker, which also terminates the other healthy in-progress jobs on that worker. This is the root cause behind cancelled requests continuing to run and incur charges when a worker is handling several jobs at once.

This gives the worker a notion of stopping a single request. The worker now tracks each in-progress job by id and can cancel just that job's task, leaving its siblings untouched. Stop signal arrives via a new job-stop long-polling channel similar to the job-take long polling endpoint.

Handlers need no changes; async handlers holding resources can clean up by catching asyncio.CancelledError.

relies on https://github.com/runpod/ai-api/pull/881

Closes SLS-41.

@promptless

promptless Bot commented Jun 5, 2026

Copy link
Copy Markdown

Promptless prepared a documentation update related to this change.

Triggered by PR #510

Added documentation for the per-job stop capability to the main docs site. The update explains that workers handling multiple jobs concurrently can now stop individual jobs without affecting siblings, and includes guidance on catching asyncio.CancelledError for resource cleanup in async handlers.

Review: Document per-job stop capability for concurrent workers

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-job cancellation support to the serverless worker so a single in-flight job can be stopped without terminating the whole worker (and other concurrent jobs). This introduces a new stop-signal long-poll channel and tracks running job tasks by job id so they can be cancelled individually.

Changes:

  • Track in-progress jobs as asyncio.Tasks keyed by job id and add stop_job() + monitor_stop_signals() to cancel individual jobs.
  • Add a new get_stop_signals() job module helper that long-polls a derived job-stop endpoint.
  • Add tests covering per-job stop behavior and stop-signal polling, plus documentation describing cancellation behavior for handlers.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
runpod/serverless/modules/rp_scale.py Track per-job tasks, long-poll stop signals, and cancel individual jobs.
runpod/serverless/modules/rp_job.py Add stop-channel URL derivation and get_stop_signals() long-poll helper.
tests/test_serverless/test_rp_scale.py Add integration-style async tests for stopping jobs and stop-signal monitoring.
tests/test_serverless/test_modules/test_job.py Add unit tests for stop URL derivation and stop-signal parsing/behavior.
docs/serverless/worker.md Document “Stopping Individual Jobs” and handler cleanup via asyncio.CancelledError.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread runpod/serverless/modules/rp_job.py Outdated
Comment thread tests/test_serverless/test_modules/test_job.py
Comment thread runpod/serverless/modules/rp_scale.py

@deanq deanq left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review pass (per-job stop capability). 9 inline comments below — 2 critical (silent error logging on the stop loop; no backpressure on the poll's success path), 4 important (silent swallows in get_stop_signals, silent feature-disable on unset JOB_GET_URL, except-clause ordering, docs overpromising long-poll), and 3 suggestions/nits. Nothing blocking the design, which is sound.

Comment thread runpod/serverless/modules/rp_scale.py Outdated
Comment thread runpod/serverless/modules/rp_scale.py Outdated
Comment thread runpod/serverless/modules/rp_job.py
Comment thread runpod/serverless/modules/rp_job.py
Comment thread runpod/serverless/modules/rp_scale.py
Comment thread docs/serverless/worker.md Outdated
Comment thread runpod/serverless/modules/rp_scale.py
Comment thread runpod/serverless/modules/rp_scale.py Outdated
Comment thread tests/test_serverless/test_rp_scale.py

@capy-ai capy-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 3 comments

Comment thread runpod/serverless/modules/rp_job.py Outdated
Comment thread runpod/serverless/modules/rp_scale.py Outdated
Comment thread runpod/serverless/modules/rp_scale.py
Comment thread tests/test_serverless/test_rp_scale.py Dismissed
@KAJdev KAJdev requested a review from deanq June 22, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants