Description of problem: A customer is testing the latency of Hot Rod Java client when the coordinator is killed. The client's turn-around-time including retries becomes about twice longer if nodes including the coordinator are killed. From the TRACE log files we found VERIFY_SUSPECT runs twice in that case which led to the slow latency. Configuration: JDG 6.6.0 server with BZ-1328307 patch (7 nodes * 4 machines, clustered.xml is in log-and-config.zip.) Java Client configuration ~~~ cacheManager = new RemoteCacheManager(new ConfigurationBuilder() .addServers(urlList) .connectionTimeout(1000) .socketTimeout(1000) .maxRetries(8) .marshaller("org.mk300.infinispan.hotrod.marshaller.HotrodMinimumMarshaller") .forceReturnValues(true) .build()); ~~~ How reproducible: Reproducible in the customer's environment. Actual results: Extract of TRACE log from the failed-over coordinator: 0501_server.log in log-and-config.zip (7 nodes including the coordinator on a machine are killed at the same time. You can find VERIFY_SUSPECT runs twice.) Extract of TRACE log from suspected node: 0607_server.log Expected results: The old coordinator should be verified as dead at the first time so that the time of failover is comparable to when a normal node is killed. Additional info: The full set of log files are available as '20160617_タイマ変更後_TRACE版_.zip' in the linked case.
The upstream has implemented a fix. https://github.com/belaban/JGroups/commit/c9102cb235aaf9b114a353f7c99ffa5739bd11be It's better if we have this in JDG 6.6.1, hopefully together with BZ-1351016 which is not resolved yet though.
Fixed: https://github.com/infinispan/jdg/commit/6c5b9c1e196f17d102c198870124a31f1c30b3ca
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.