Bug 1348404
| Summary: | Coordinator failover is taking longer because VERIFY_SUSPECT runs twice | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Osamu Nagano <onagano> |
| Component: | JGroups | Assignee: | Tristan Tarrant <ttarrant> |
| Status: | CLOSED UPSTREAM | QA Contact: | Martin Gencur <mgencur> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.6.0 | CC: | bban, pslavice, pzapataf, wfink |
| Target Milestone: | ER1 | ||
| Target Release: | 6.6.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2025-02-10 03:49:04 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1309749 | ||
The upstream has implemented a fix. https://github.com/belaban/JGroups/commit/c9102cb235aaf9b114a353f7c99ffa5739bd11be It's better if we have this in JDG 6.6.1, hopefully together with BZ-1351016 which is not resolved yet though. This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |
Description of problem: A customer is testing the latency of Hot Rod Java client when the coordinator is killed. The client's turn-around-time including retries becomes about twice longer if nodes including the coordinator are killed. From the TRACE log files we found VERIFY_SUSPECT runs twice in that case which led to the slow latency. Configuration: JDG 6.6.0 server with BZ-1328307 patch (7 nodes * 4 machines, clustered.xml is in log-and-config.zip.) Java Client configuration ~~~ cacheManager = new RemoteCacheManager(new ConfigurationBuilder() .addServers(urlList) .connectionTimeout(1000) .socketTimeout(1000) .maxRetries(8) .marshaller("org.mk300.infinispan.hotrod.marshaller.HotrodMinimumMarshaller") .forceReturnValues(true) .build()); ~~~ How reproducible: Reproducible in the customer's environment. Actual results: Extract of TRACE log from the failed-over coordinator: 0501_server.log in log-and-config.zip (7 nodes including the coordinator on a machine are killed at the same time. You can find VERIFY_SUSPECT runs twice.) Extract of TRACE log from suspected node: 0607_server.log Expected results: The old coordinator should be verified as dead at the first time so that the time of failover is comparable to when a normal node is killed. Additional info: The full set of log files are available as '20160617_タイマ変更後_TRACE版_.zip' in the linked case.