Bug 1348404

Summary: Coordinator failover is taking longer because VERIFY_SUSPECT runs twice
Product: [JBoss] JBoss Data Grid 6 Reporter: Osamu Nagano <onagano>
Component: JGroupsAssignee: Tristan Tarrant <ttarrant>
Status: CLOSED UPSTREAM QA Contact: Martin Gencur <mgencur>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.6.0CC: bban, pslavice, pzapataf, wfink
Target Milestone: ER1   
Target Release: 6.6.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-02-10 03:49:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1309749    

Description Osamu Nagano 2016-06-21 06:31:48 UTC
Description of problem:
A customer is testing the latency of Hot Rod Java client when the coordinator is killed. The client's turn-around-time including retries becomes about twice longer if nodes including the coordinator are killed. From the TRACE log files we found VERIFY_SUSPECT runs twice in that case which led to the slow latency.


Configuration:
  JDG 6.6.0 server with BZ-1328307 patch (7 nodes * 4 machines, clustered.xml is in log-and-config.zip.)
  Java Client configuration
~~~
            cacheManager = 
                new RemoteCacheManager(new ConfigurationBuilder()
                .addServers(urlList)
                .connectionTimeout(1000)
                .socketTimeout(1000)
                .maxRetries(8)                
                .marshaller("org.mk300.infinispan.hotrod.marshaller.HotrodMinimumMarshaller")
                .forceReturnValues(true)
                .build());
~~~


How reproducible:
Reproducible in the customer's environment.


Actual results:
Extract of TRACE log from the failed-over coordinator: 0501_server.log in log-and-config.zip (7 nodes including the coordinator on a machine are killed at the same time. You can find VERIFY_SUSPECT runs twice.)
Extract of TRACE log from suspected node: 0607_server.log


Expected results:
The old coordinator should be verified as dead at the first time so that the time of failover is comparable to when a normal node is killed.


Additional info:
The full set of log files are available as '20160617_タイマ変更後_TRACE版_.zip' in the linked case.

Comment 5 Osamu Nagano 2016-07-12 07:36:25 UTC
The upstream has implemented a fix.
https://github.com/belaban/JGroups/commit/c9102cb235aaf9b114a353f7c99ffa5739bd11be

It's better if we have this in JDG 6.6.1, hopefully together with BZ-1351016 which is not resolved yet though.

Comment 10 Red Hat Bugzilla 2025-02-10 03:49:04 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.