Bug 1348404

Summary:	Coordinator failover is taking longer because VERIFY_SUSPECT runs twice
Product:	[JBoss] JBoss Data Grid 6	Reporter:	Osamu Nagano <onagano>
Component:	JGroups	Assignee:	Tristan Tarrant <ttarrant>
Status:	CLOSED UPSTREAM	QA Contact:	Martin Gencur <mgencur>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.6.0	CC:	bban, pslavice, pzapataf, wfink
Target Milestone:	ER1
Target Release:	6.6.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2025-02-10 03:49:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1309749

Description Osamu Nagano 2016-06-21 06:31:48 UTC

Description of problem:
A customer is testing the latency of Hot Rod Java client when the coordinator is killed. The client's turn-around-time including retries becomes about twice longer if nodes including the coordinator are killed. From the TRACE log files we found VERIFY_SUSPECT runs twice in that case which led to the slow latency.


Configuration:
  JDG 6.6.0 server with BZ-1328307 patch (7 nodes * 4 machines, clustered.xml is in log-and-config.zip.)
  Java Client configuration
~~~
            cacheManager = 
                new RemoteCacheManager(new ConfigurationBuilder()
                .addServers(urlList)
                .connectionTimeout(1000)
                .socketTimeout(1000)
                .maxRetries(8)                
                .marshaller("org.mk300.infinispan.hotrod.marshaller.HotrodMinimumMarshaller")
                .forceReturnValues(true)
                .build());
~~~


How reproducible:
Reproducible in the customer's environment.


Actual results:
Extract of TRACE log from the failed-over coordinator: 0501_server.log in log-and-config.zip (7 nodes including the coordinator on a machine are killed at the same time. You can find VERIFY_SUSPECT runs twice.)
Extract of TRACE log from suspected node: 0607_server.log


Expected results:
The old coordinator should be verified as dead at the first time so that the time of failover is comparable to when a normal node is killed.


Additional info:
The full set of log files are available as '20160617_タイマ変更後_TRACE版_.zip' in the linked case.

Comment 5 Osamu Nagano 2016-07-12 07:36:25 UTC

The upstream has implemented a fix.
https://github.com/belaban/JGroups/commit/c9102cb235aaf9b114a353f7c99ffa5739bd11be

It's better if we have this in JDG 6.6.1, hopefully together with BZ-1351016 which is not resolved yet though.

Comment 6 Vaclav Dedik 2016-07-28 15:51:13 UTC

Fixed:
https://github.com/infinispan/jdg/commit/6c5b9c1e196f17d102c198870124a31f1c30b3ca

Comment 10 Red Hat Bugzilla 2025-02-10 03:49:04 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.