traffic_crashlog: fix false-positive crash logs on clean shutdown#13296
Merged
Conversation
The crash log helper is parked with SIGSTOP and armed with PR_SET_PDEATHSIG=SIGCONT, so the kernel resumes it whenever traffic_server exits, not only on a crash. It used getppid() to tell a crash from a plain exit, but that is racy: PR_SET_PDEATHSIG fires on the forking thread's death, which can occur while the parent pid is still alive, so a normal SIGTERM shutdown could emit an empty crash log. Replace the getppid() guess with a socket handshake: a real crash writes the signal number to the helper, while any other exit closes the socket so the helper reads EOF and exits quietly. FD_CLOEXEC keeps that EOF reliable against later fork+exec children.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR prevents traffic_crashlog from emitting false-positive (empty) crash logs during a normal traffic_server shutdown by replacing a racy getppid()-based heuristic with an explicit socket handshake between traffic_server and the crashlog helper.
Changes:
- Add an explicit “real crash” notification:
traffic_serverwrites the crashing signal number to the helper over the existing socket before resuming it. - Make the helper treat EOF/short read on that socket as “no crash” and exit quietly, avoiding bogus logs on clean shutdown.
- Add a gold test to assert SIGTERM shutdown produces no crashlog output.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
tests/gold_tests/shutdown/crashlog_no_false_positive.test.py |
New regression test ensuring clean SIGTERM shutdown does not trigger crashlog output. |
src/traffic_server/Crash.cc |
Sets FD_CLOEXEC on the parent socket end and writes the signal number to the helper on real crashes. |
src/traffic_crashlog/traffic_crashlog.cc |
Removes getppid() heuristic and blocks on a socket read to distinguish crash vs. parent exit (EOF). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On a normal
traffic_servershutdown (e.g. SIGTERM),traffic_crashlogcanspuriously emit an empty crash log against the already-exiting process.
Background
At startup,
traffic_serverforkstraffic_crashlog --wait, which parksitself:
The helper is therefore resumed by
SIGCONTin two indistinguishablesituations:
crash_logger_invoke()explicitly sendsSIGCONT.(
PR_SET_PDEATHSIG, set toSIGCONTin traffic_crashlog: use SIGCONT for parent death #11614) resumes it.To tell these apart the helper used
getppid() != parent. That check isracy:
PR_SET_PDEATHSIGfires when the thread that forked the helperterminates, which during a graceful shutdown happens while the rest of
traffic_server— and thus its PID — is still alive. Sogetppid()stillequals the original parent, the guard doesn't trip, and the helper proceeds
to write a crash log for a process that merely exited.
Fix
Replace the
getppid()guess with an explicit handshake over the socketthat already connects the two processes:
crash_logger_invoke()writes the crashing signal number to the helperbefore the existing (Linux-only) thread payload. This is the only path
that writes it.
read():quietly.
FD_CLOEXECis set on the parent's end of the socket so the write end issolely owned by
traffic_server; that guarantees the helper sees EOF onparent death rather than blocking on an end inherited by a
popen()/system()child.SIGCONT(rather thanSIGKILL) is retained deliberately: it lets thehelper survive a spurious/early wake,
read()for the answer, and stay readyfor a later real crash — whereas
SIGKILLwould prematurely kill the helperon the same forking-thread race and silently drop crash logging (and was
already moved away from in #11614).
Testing
tests/gold_tests/shutdown/crashlog_no_false_positive.test.py: arms thehelper, lets AuTest SIGTERM
traffic_server, and asserts thegraceful-shutdown path ran (
received exit signal, shutting down) while nocrash log was produced (
crashlog started/wrote crash logexcluded).crash_test(real crash → crash log produced) still passes on theparts this code governs.
Notes
PR_SET_PDEATHSIGis Linux-only, so the false positive itself only occurs onLinux; on macOS/FreeBSD the helper is simply never woken on a clean exit. The
regression test still confirms the "armed but quiet" behavior there.