[MINOR][CI] Fix leaked threads hanging Java test forks by Baunsgaard · Pull Request #2488 · apache/systemds

Baunsgaard · 2026-06-11T22:58:29Z

Prevent leaked threads from hanging Java test forks in CI

Several Java test suites (most visibly **.component.c** and data.misc/lineage) intermittently ran until the GitHub Actions job timeout even though the tests themselves had completed. The cause was leaked non-daemon threads keeping the surefire fork JVM alive, so the fork never exited and the job stalled until cancelled. There were two sources: in-JVM federated workers (FederatedWorker's Netty event loops and the test-side worker wrapper threads were non-daemon), and CommonThreadPool's fallback pools — when called off the main thread, it returned Executors.newFixedThreadPool/newCachedThreadPool, which default to non-daemon threads, while only the ForkJoinPool-backed variants were already daemon.

This PR makes those threads daemon at the source: FederatedWorker now creates its Netty event-loop groups with a daemon thread factory, and CommonThreadPool routes its fixed/cached fallbacks through one too, so daemon behavior is uniform across all pool variants. On the test side, AutomatedTestBase marks spawned worker threads as daemon, TestUtils.shutdownThread bounds its join (30s, warns on stragglers, restores the interrupt flag), and the lineage tests (LineageFedReuseAlg, FedFullReuseTest, FedUDFReuseTest) now shut workers down in a finally block so failures no longer leak workers (the large line counts there are just reindentation from the try/finally wrap). The javaTests.yml job cap stays at 30 minutes, with a comment documenting why it sits above the 600s per-fork surefire timeout, which remains the backstop for genuine hangs.

The component.c** suite cannot finish within the 30-minute GitHub Actions job cap. The cap cancels the whole job before surefire's per-fork timeout (test-forkedProcessTimeout, 600s) can kill hung forks and run the remaining tests in the suite. Raise the job cap to 60 minutes so surefire's kill-and-continue has room to complete the suite instead of GitHub Actions cancelling the job mid-run.

In-JVM federated workers (used by the lineage federated tests) ran on non-daemon threads, and their Netty boss/worker event loops were also non-daemon. When a test failed before calling shutdownThreads(...), the leaked worker kept the surefire fork JVM alive until the CI job timeout cancelled the whole job. - FederatedWorker: create the Netty event-loop groups with daemon threads. A standalone worker process keeps the JVM alive via its main thread, so this only affects in-JVM (test) workers, where it lets a leaked worker no longer block JVM exit. - AutomatedTestBase: mark the in-JVM worker wrapper threads as daemon. - TestUtils.shutdownThread: bound the join() so cleanup cannot block indefinitely if a worker ignores the interrupt. - LineageFedReuseAlg, FedFullReuseTest, FedUDFReuseTest: shut down workers in a finally block so they are reaped on the failure path.

CommonThreadPool.get(k) returns a ForkJoinPool (daemon threads) for main/PARFOR/FedExec threads, but falls back to a plain Executors.newFixedThreadPool(k) for any other caller thread, and getDynamicPool() uses Executors.newCachedThreadPool(); both default to non-daemon threads. Under parallel test execution the test thread is not the master thread, so component tests that compress directly hit the fallback path. A pool left unshut (e.g. stored as a field or skipped on an exception path) then keeps its non-daemon threads alive and blocks the surefire fork JVM from exiting, stalling the suite until the CI job timeout. Make both fallback pools use daemon threads, matching the ForkJoinPool paths. Daemon-ness only affects JVM shutdown: pending work keeps the submitting thread alive, so this cannot terminate in-flight tasks early.

codecov · 2026-06-15T12:17:25Z

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.42%. Comparing base (17803b1) to head (a8fc528).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...he/sysds/runtime/ooc/cache/OOCMatrixIOHandler.java	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2488      +/-   ##
============================================
+ Coverage     71.41%   71.42%   +0.01%     
- Complexity    48817    48838      +21     
============================================
  Files          1572     1572              
  Lines        189089   189117      +28     
  Branches      37101    37106       +5     
============================================
+ Hits         135030   135085      +55     
+ Misses        43596    43580      -16     
+ Partials      10463    10452      -11

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Restore the 30-minute job cap now that the underlying thread-leak hangs are fixed (daemon Netty event loops, daemon CommonThreadPool fallback pools, and bounded test worker cleanup). The 600s per-fork surefire timeout still operates well within this cap, so a genuinely hung fork is killed by surefire before GitHub Actions cancels the job, restoring fast feedback on real hangs.

Baunsgaard · 2026-06-15T14:53:21Z

@janniklinde thanks to your points in the previous commit, I found the source of the issue for timeouts in many tests. It basically is because our parallel threads were not deamon threads.

Would you like to review?

janniklinde

Looks good to me. Nice to finally fix those hanging actions.

One minor note: shutdownThread(process) still prints the stack trace on interrupt, while shutdownThread(thread) restores the interrupt status with Thread.currentThread().interrupt().

Re-assert the interrupt flag (Thread.currentThread().interrupt()) instead of swallowing InterruptedException or only printing a stack trace, so the cancellation signal is not lost for callers up the stack. - TestUtils.shutdownThread(Process): log and restore interrupt instead of printStackTrace, matching the Thread overload - OOCMatrixIOHandler: restore interrupt in shutdown and scheduleEviction instead of silently ignoring - FederatedBackendPerformanceTest: split the InterruptedException/ ExecutionException multi-catch so the test fails instead of swallowing, and restore interrupt on interruption - Base, FederatedMatrixScalarOperationsTest, GenMatrices, dev release Utility: restore interrupt instead of printStackTrace/swallowing

github-project-automation Bot added this to SystemDS PR Queue Jun 11, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 11, 2026

Baunsgaard changed the title ~~Raise Java test job timeout to 60 minutes~~ [MINOR][CI] Raise Java test job timeout to 60 minutes Jun 11, 2026

Baunsgaard added 3 commits June 12, 2026 16:36

Shorten daemon event-loop comment in FederatedWorker

d82adbe

Baunsgaard changed the title ~~[MINOR][CI] Raise Java test job timeout to 60 minutes~~ [MINOR][CI] Fix leaked threads hanging Java test forks Jun 15, 2026

Baunsgaard requested a review from janniklinde June 15, 2026 14:52

janniklinde approved these changes Jun 16, 2026

View reviewed changes

Baunsgaard merged commit 65e734e into apache:main Jun 16, 2026
85 of 87 checks passed

github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][CI] Fix leaked threads hanging Java test forks#2488

[MINOR][CI] Fix leaked threads hanging Java test forks#2488
Baunsgaard merged 6 commits into
apache:mainfrom
Baunsgaard:ci/increase-java-test-timeout

Baunsgaard commented Jun 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Baunsgaard commented Jun 15, 2026

Uh oh!

janniklinde left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Baunsgaard commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prevent leaked threads from hanging Java test forks in CI

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Baunsgaard commented Jun 15, 2026

Uh oh!

janniklinde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Baunsgaard commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading