[MINOR][CI] Fix leaked threads hanging Java test forks#2488
Conversation
The component.c** suite cannot finish within the 30-minute GitHub Actions job cap. The cap cancels the whole job before surefire's per-fork timeout (test-forkedProcessTimeout, 600s) can kill hung forks and run the remaining tests in the suite. Raise the job cap to 60 minutes so surefire's kill-and-continue has room to complete the suite instead of GitHub Actions cancelling the job mid-run.
In-JVM federated workers (used by the lineage federated tests) ran on non-daemon threads, and their Netty boss/worker event loops were also non-daemon. When a test failed before calling shutdownThreads(...), the leaked worker kept the surefire fork JVM alive until the CI job timeout cancelled the whole job. - FederatedWorker: create the Netty event-loop groups with daemon threads. A standalone worker process keeps the JVM alive via its main thread, so this only affects in-JVM (test) workers, where it lets a leaked worker no longer block JVM exit. - AutomatedTestBase: mark the in-JVM worker wrapper threads as daemon. - TestUtils.shutdownThread: bound the join() so cleanup cannot block indefinitely if a worker ignores the interrupt. - LineageFedReuseAlg, FedFullReuseTest, FedUDFReuseTest: shut down workers in a finally block so they are reaped on the failure path.
CommonThreadPool.get(k) returns a ForkJoinPool (daemon threads) for main/PARFOR/FedExec threads, but falls back to a plain Executors.newFixedThreadPool(k) for any other caller thread, and getDynamicPool() uses Executors.newCachedThreadPool(); both default to non-daemon threads. Under parallel test execution the test thread is not the master thread, so component tests that compress directly hit the fallback path. A pool left unshut (e.g. stored as a field or skipped on an exception path) then keeps its non-daemon threads alive and blocks the surefire fork JVM from exiting, stalling the suite until the CI job timeout. Make both fallback pools use daemon threads, matching the ForkJoinPool paths. Daemon-ness only affects JVM shutdown: pending work keeps the submitting thread alive, so this cannot terminate in-flight tasks early.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2488 +/- ##
============================================
+ Coverage 71.41% 71.42% +0.01%
- Complexity 48817 48838 +21
============================================
Files 1572 1572
Lines 189089 189117 +28
Branches 37101 37106 +5
============================================
+ Hits 135030 135085 +55
+ Misses 43596 43580 -16
+ Partials 10463 10452 -11 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Restore the 30-minute job cap now that the underlying thread-leak hangs are fixed (daemon Netty event loops, daemon CommonThreadPool fallback pools, and bounded test worker cleanup). The 600s per-fork surefire timeout still operates well within this cap, so a genuinely hung fork is killed by surefire before GitHub Actions cancels the job, restoring fast feedback on real hangs.
|
@janniklinde thanks to your points in the previous commit, I found the source of the issue for timeouts in many tests. It basically is because our parallel threads were not deamon threads. Would you like to review? |
janniklinde
left a comment
There was a problem hiding this comment.
Looks good to me. Nice to finally fix those hanging actions.
One minor note: shutdownThread(process) still prints the stack trace on interrupt, while shutdownThread(thread) restores the interrupt status with Thread.currentThread().interrupt().
Re-assert the interrupt flag (Thread.currentThread().interrupt()) instead of swallowing InterruptedException or only printing a stack trace, so the cancellation signal is not lost for callers up the stack. - TestUtils.shutdownThread(Process): log and restore interrupt instead of printStackTrace, matching the Thread overload - OOCMatrixIOHandler: restore interrupt in shutdown and scheduleEviction instead of silently ignoring - FederatedBackendPerformanceTest: split the InterruptedException/ ExecutionException multi-catch so the test fails instead of swallowing, and restore interrupt on interruption - Base, FederatedMatrixScalarOperationsTest, GenMatrices, dev release Utility: restore interrupt instead of printStackTrace/swallowing
Prevent leaked threads from hanging Java test forks in CI
Several Java test suites (most visibly
**.component.c**anddata.misc/lineage) intermittently ran until the GitHub Actions job timeout even though the tests themselves had completed. The cause was leaked non-daemon threads keeping the surefire fork JVM alive, so the fork never exited and the job stalled until cancelled. There were two sources: in-JVM federated workers (FederatedWorker's Netty event loops and the test-side worker wrapper threads were non-daemon), andCommonThreadPool's fallback pools — when called off the main thread, it returnedExecutors.newFixedThreadPool/newCachedThreadPool, which default to non-daemon threads, while only the ForkJoinPool-backed variants were already daemon.This PR makes those threads daemon at the source:
FederatedWorkernow creates its Netty event-loop groups with a daemon thread factory, andCommonThreadPoolroutes its fixed/cached fallbacks through one too, so daemon behavior is uniform across all pool variants. On the test side,AutomatedTestBasemarks spawned worker threads as daemon,TestUtils.shutdownThreadbounds itsjoin(30s, warns on stragglers, restores the interrupt flag), and the lineage tests (LineageFedReuseAlg,FedFullReuseTest,FedUDFReuseTest) now shut workers down in afinallyblock so failures no longer leak workers (the large line counts there are just reindentation from thetry/finallywrap). ThejavaTests.ymljob cap stays at 30 minutes, with a comment documenting why it sits above the 600s per-fork surefire timeout, which remains the backstop for genuine hangs.