Skip to content

ARTEMIS-6008 Prevent CriticalAnalyzer from killing the broker while failing over#6533

Merged
jbertram merged 3 commits into
apache:mainfrom
sijonelis:ARTEMIS-6008
Jun 24, 2026
Merged

ARTEMIS-6008 Prevent CriticalAnalyzer from killing the broker while failing over#6533
jbertram merged 3 commits into
apache:mainfrom
sijonelis:ARTEMIS-6008

Conversation

@sijonelis

@sijonelis sijonelis commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

issues.apache.org/jira/browse/ARTEMIS-3664 fixed an issue with Critical Analyzer killing the broker during slow startup.

However, during live/backup pair failover when the backup node is in the STARTED state, the Critical Analyzer would still evaluate the node as unhealthy under large journal/slow network disk conditions and kill it.

This change proposes to switch the Critical Analyzer to LOG mode while the backup node is activating.

I have debated between adding a new ServerState and adding a boolean flag (as done in this PR). Finally, I have decided to go with the flag as this change is a minor bug fix and should not warrant a significantly wider change of adding a new ServerState, however I would gladly do that if asked by the maintainers.

@sijonelis sijonelis changed the title Artemis 6008 ARTEMIS-6008 Prevent CriticalAnalyzer from killing the broker while failing over Jun 22, 2026
@jbertram

Copy link
Copy Markdown
Contributor

I think instead of checking the server state you could just check ActiveMQServer#isActive and that would work for both the primary and the backup use-cases and you wouldn't need to add another field. What do you think?

@clebertsuconic

Copy link
Copy Markdown
Contributor

can you rebase your commit, without a merge commit

basicall:

git pull --rebase upstream main

And squash your commits as a single one?

@sijonelis

sijonelis commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

I think instead of checking the server state you could just check ActiveMQServer#isActive and that would work for both the primary and the backup use-cases and you wouldn't need to add another field.

This is a great suggestion @jbertram. I have tried playing around with uglier stuff (such as adding a new ServerState) before settling on the new variable as the "least invasive" approach, but your suggestion is exactly what I was looking for all along.

Adopted it and also rebased the commit.

Note that i had to use reflection to access the latch it in the test, but I saw this approach used elsewhere in the project's tests so it shouldnt be an issue to use it here as well

final CriticalAnalyzerPolicy criticalAnalyzerPolicy = configuration.getCriticalAnalyzerPolicy();
CriticalAction criticalAction = switch (criticalAnalyzerPolicy) {
case HALT -> criticalComponent -> {
if (ActiveMQServerImpl.this.state == SERVER_STATE.STARTING) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more straight-forward to just change all the checks of state to use isActive() instead of adding a new method that does the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, done!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically speaking reflection will work here. However, plumbing already exists to test this in a more robust way, e.g.:

   private void testTooLongToStart(CriticalAnalyzerPolicy policy) throws Exception {

      try (AssertionLoggerHandler loggerHandler = new AssertionLoggerHandler()) {

         ConfigurationImpl configuration = new ConfigurationImpl();
         configuration.setCriticalAnalyzerPolicy(policy);
         configuration.setCriticalAnalyzer(true);
         configuration.setPersistenceEnabled(false);
         ActiveMQServerImpl server = new ActiveMQServerImpl(configuration);
         addServer(server);
         CountDownLatch latch = new CountDownLatch(1);
         server.registerActivateCallback(new ActivateCallback() {
            @Override
            public void preActivate() {
               try {
                  latch.await();
               } catch (InterruptedException e) {
                  throw new RuntimeException(e);
               }
            }
         });
         CompletableFuture.runAsync(() -> {
            try {
               server.start();
            } catch (Exception e) {
               e.printStackTrace();
            }
         });
         Wait.waitFor(() -> server.getCriticalAnalyzer() != null);
         CriticalAnalyzerAccessor.fireActions(server.getCriticalAnalyzer(), new CriticalComponentImpl(server.getCriticalAnalyzer(), 2));
         assertTrue(loggerHandler.findText("AMQ224116"));
         assertFalse(server.isActive()); // should not be changed
         latch.countDown();
         server.stop();
      }
   }

By starting the broker async and forcing it to stall using a custom activation callback we don't need any reflection and we test the actual condition rather than an approximation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Updated the test

@sijonelis sijonelis requested a review from jbertram June 24, 2026 05:23
@jbertram jbertram merged commit 6d91acd into apache:main Jun 24, 2026
6 checks passed
@jbertram

Copy link
Copy Markdown
Contributor

Nice work @sijonelis. Thanks for the PR!

@ViliusS

ViliusS commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

We talked with @sijonelis about this PR offline. We wanted to include one more change before merging. The following log message is now technically not 100% correct:

@Message(id = 224116, value = "The component {0} is not responsive during start up. The Server may be taking too long to start", format = Message.Format.MESSAGE_FORMAT)

maybe change it to

@Message(id = 224116, value = "The component {0} is not responsive. The Server may be taking too long to load", format = Message.Format.MESSAGE_FORMAT)

@jbertram do you think it's worth opening additional PR for that?

@jbertram

Copy link
Copy Markdown
Contributor

@jbertram do you think it's worth opening additional PR for that?

Sure. Do you plan on sending a PR?

@ViliusS

ViliusS commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Yes, @sijonelis or me will do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants