ARTEMIS-6008 Prevent CriticalAnalyzer from killing the broker while failing over#6533
Conversation
|
I think instead of checking the server state you could just check |
|
can you rebase your commit, without a merge commit basicall: git pull --rebase upstream main And squash your commits as a single one? |
This is a great suggestion @jbertram. I have tried playing around with uglier stuff (such as adding a new ServerState) before settling on the new variable as the "least invasive" approach, but your suggestion is exactly what I was looking for all along. Adopted it and also rebased the commit. Note that i had to use reflection to access the latch it in the test, but I saw this approach used elsewhere in the project's tests so it shouldnt be an issue to use it here as well |
…ctivating after cluster primary is killed
| final CriticalAnalyzerPolicy criticalAnalyzerPolicy = configuration.getCriticalAnalyzerPolicy(); | ||
| CriticalAction criticalAction = switch (criticalAnalyzerPolicy) { | ||
| case HALT -> criticalComponent -> { | ||
| if (ActiveMQServerImpl.this.state == SERVER_STATE.STARTING) { |
There was a problem hiding this comment.
I think it would be more straight-forward to just change all the checks of state to use isActive() instead of adding a new method that does the same.
There was a problem hiding this comment.
Fair enough, done!
There was a problem hiding this comment.
Technically speaking reflection will work here. However, plumbing already exists to test this in a more robust way, e.g.:
private void testTooLongToStart(CriticalAnalyzerPolicy policy) throws Exception {
try (AssertionLoggerHandler loggerHandler = new AssertionLoggerHandler()) {
ConfigurationImpl configuration = new ConfigurationImpl();
configuration.setCriticalAnalyzerPolicy(policy);
configuration.setCriticalAnalyzer(true);
configuration.setPersistenceEnabled(false);
ActiveMQServerImpl server = new ActiveMQServerImpl(configuration);
addServer(server);
CountDownLatch latch = new CountDownLatch(1);
server.registerActivateCallback(new ActivateCallback() {
@Override
public void preActivate() {
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
});
CompletableFuture.runAsync(() -> {
try {
server.start();
} catch (Exception e) {
e.printStackTrace();
}
});
Wait.waitFor(() -> server.getCriticalAnalyzer() != null);
CriticalAnalyzerAccessor.fireActions(server.getCriticalAnalyzer(), new CriticalComponentImpl(server.getCriticalAnalyzer(), 2));
assertTrue(loggerHandler.findText("AMQ224116"));
assertFalse(server.isActive()); // should not be changed
latch.countDown();
server.stop();
}
}By starting the broker async and forcing it to stall using a custom activation callback we don't need any reflection and we test the actual condition rather than an approximation.
There was a problem hiding this comment.
Agree. Updated the test
|
Nice work @sijonelis. Thanks for the PR! |
|
We talked with @sijonelis about this PR offline. We wanted to include one more change before merging. The following log message is now technically not 100% correct:
maybe change it to
@jbertram do you think it's worth opening additional PR for that? |
Sure. Do you plan on sending a PR? |
|
Yes, @sijonelis or me will do that. |
issues.apache.org/jira/browse/ARTEMIS-3664 fixed an issue with Critical Analyzer killing the broker during slow startup.
However, during live/backup pair failover when the backup node is in the STARTED state, the Critical Analyzer would still evaluate the node as unhealthy under large journal/slow network disk conditions and kill it.
This change proposes to switch the Critical Analyzer to LOG mode while the backup node is activating.
I have debated between adding a new ServerState and adding a boolean flag (as done in this PR). Finally, I have decided to go with the flag as this change is a minor bug fix and should not warrant a significantly wider change of adding a new ServerState, however I would gladly do that if asked by the maintainers.