Skip to content

traffic_crashlog: fix false-positive crash logs on clean shutdown#13296

Merged
masaori335 merged 2 commits into
apache:masterfrom
masaori335:asf-master-crash-log
Jun 18, 2026
Merged

traffic_crashlog: fix false-positive crash logs on clean shutdown#13296
masaori335 merged 2 commits into
apache:masterfrom
masaori335:asf-master-crash-log

Conversation

@masaori335

Copy link
Copy Markdown
Contributor

Problem

On a normal traffic_server shutdown (e.g. SIGTERM), traffic_crashlog can
spuriously emit an empty crash log against the already-exiting process.

Background

At startup, traffic_server forks traffic_crashlog --wait, which parks
itself:

EnableDeathSignal(SIGCONT);   // prctl(PR_SET_PDEATHSIG, SIGCONT)
kill(getpid(), SIGSTOP);

The helper is therefore resumed by SIGCONT in two indistinguishable
situations:

  1. A real crash — the crash signal handler crash_logger_invoke() explicitly sends SIGCONT.
  2. Any other parent exit — the kernel's parent-death signal
    (PR_SET_PDEATHSIG, set to SIGCONT in traffic_crashlog: use SIGCONT for parent death #11614) resumes it.

To tell these apart the helper used getppid() != parent. That check is
racy: PR_SET_PDEATHSIG fires when the thread that forked the helper
terminates, which during a graceful shutdown happens while the rest of
traffic_server — and thus its PID — is still alive. So getppid() still
equals the original parent, the guard doesn't trip, and the helper proceeds
to write a crash log for a process that merely exited.

Fix

Replace the getppid() guess with an explicit handshake over the socket
that already connects the two processes:

  • crash_logger_invoke() writes the crashing signal number to the helper
    before the existing (Linux-only) thread payload. This is the only path
    that writes it.
  • The helper, after being resumed, does a blocking read():
    • data ⇒ a real crash ⇒ proceed and write the log;
    • EOF / short read ⇒ the parent is gone with no notification ⇒ exit
      quietly.
  • FD_CLOEXEC is set on the parent's end of the socket so the write end is
    solely owned by traffic_server; that guarantees the helper sees EOF on
    parent death rather than blocking on an end inherited by a
    popen()/system() child.

SIGCONT (rather than SIGKILL) is retained deliberately: it lets the
helper survive a spurious/early wake, read() for the answer, and stay ready
for a later real crash — whereas SIGKILL would prematurely kill the helper
on the same forking-thread race and silently drop crash logging (and was
already moved away from in #11614).

Testing

  • New regression test
    tests/gold_tests/shutdown/crashlog_no_false_positive.test.py: arms the
    helper, lets AuTest SIGTERM traffic_server, and asserts the
    graceful-shutdown path ran (received exit signal, shutting down) while no
    crash log was produced (crashlog started / wrote crash log excluded).
  • Existing crash_test (real crash → crash log produced) still passes on the
    parts this code governs.

Notes

PR_SET_PDEATHSIG is Linux-only, so the false positive itself only occurs on
Linux; on macOS/FreeBSD the helper is simply never woken on a clean exit. The
regression test still confirms the "armed but quiet" behavior there.

The crash log helper is parked with SIGSTOP and armed with
PR_SET_PDEATHSIG=SIGCONT, so the kernel resumes it whenever
traffic_server exits, not only on a crash. It used getppid() to tell a
crash from a plain exit, but that is racy: PR_SET_PDEATHSIG fires on the
forking thread's death, which can occur while the parent pid is still
alive, so a normal SIGTERM shutdown could emit an empty crash log.

Replace the getppid() guess with a socket handshake: a real crash writes
the signal number to the helper, while any other exit closes the socket
so the helper reads EOF and exits quietly. FD_CLOEXEC keeps that EOF
reliable against later fork+exec children.
@masaori335 masaori335 added this to the 11.0.0 milestone Jun 18, 2026
@masaori335 masaori335 self-assigned this Jun 18, 2026
Copilot AI review requested due to automatic review settings June 18, 2026 04:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents traffic_crashlog from emitting false-positive (empty) crash logs during a normal traffic_server shutdown by replacing a racy getppid()-based heuristic with an explicit socket handshake between traffic_server and the crashlog helper.

Changes:

  • Add an explicit “real crash” notification: traffic_server writes the crashing signal number to the helper over the existing socket before resuming it.
  • Make the helper treat EOF/short read on that socket as “no crash” and exit quietly, avoiding bogus logs on clean shutdown.
  • Add a gold test to assert SIGTERM shutdown produces no crashlog output.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/gold_tests/shutdown/crashlog_no_false_positive.test.py New regression test ensuring clean SIGTERM shutdown does not trigger crashlog output.
src/traffic_server/Crash.cc Sets FD_CLOEXEC on the parent socket end and writes the signal number to the helper on real crashes.
src/traffic_crashlog/traffic_crashlog.cc Removes getppid() heuristic and blocks on a socket read to distinguish crash vs. parent exit (EOF).

Comment thread src/traffic_crashlog/traffic_crashlog.cc Outdated
Comment thread src/traffic_server/Crash.cc

@JosiahWI JosiahWI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix!

@masaori335 masaori335 merged commit 3c000b3 into apache:master Jun 18, 2026
15 checks passed
@github-project-automation github-project-automation Bot moved this to For v10.2.0 in ATS v10.2.x Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: For v10.2.0

Development

Successfully merging this pull request may close these issues.

3 participants