stfsender: fix segfault on termination (R3C-1147)#193
Open
ktf wants to merge 1 commit into
Open
Conversation
At the end of every run all TfBuilders disconnect from each StfSender. Each UCX disconnect (StfSenderOutputUCX::disconnectTfBuilder) spawned a detached thread that keeps progressing a UCX worker and touching object state, but stop() never waited for it: it went straight to ucp_worker_destroy()/ucp_cleanup() and the object was then destructed. The still-running detached threads then used freed UCX workers / a destructed object, causing a SIGSEGV (core dumped) and a burst of errors as connections dropped. - StfSenderOutputUCX: track the async endpoint-close threads instead of detaching them and join them in stop() while the workers/context are still valid; add a destructor as a safety net in case stop() is skipped. - StfSenderDevice::ResetTask: stop the gRPC RPC server before the output handler so no late connect/disconnect/data request can reference the output handler mid-teardown (keeps the unconditional output stop from ce899b9 for flp-only runs). Ref: https://its.cern.ch/jira/browse/R3C-1147
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
At the end of every run all TfBuilders disconnect from each StfSender. Each UCX disconnect (StfSenderOutputUCX::disconnectTfBuilder) spawned a detached thread that keeps progressing a UCX worker and touching object state, but stop() never waited for it: it went straight to ucp_worker_destroy()/ucp_cleanup() and the object was then destructed. The still-running detached threads then used freed UCX workers / a destructed object, causing a SIGSEGV (core dumped) and a burst of errors as connections dropped.
Ref: https://its.cern.ch/jira/browse/R3C-1147