1348404 – Coordinator failover is taking longer because VERIFY_SUSPECT runs twice

Bug 1348404 - Coordinator failover is taking longer because VERIFY_SUSPECT runs twice

Summary: Coordinator failover is taking longer because VERIFY_SUSPECT runs twice

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	JGroups
Sub Component:
Version:	6.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ER1
Target Release:	6.6.1
Assignee:	Tristan Tarrant
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1309749
TreeView+	depends on / blocked

Reported:	2016-06-21 06:31 UTC by Osamu Nagano
Modified:	2025-02-10 03:49 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2025-02-10 03:49:04 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	JGRP-2082	0	Major	Resolved	Coordinator failover is taking longer because VERIFY_SUSPECT runs twice	2017-05-16 08:09:38 UTC

Description Osamu Nagano 2016-06-21 06:31:48 UTC

Description of problem:
A customer is testing the latency of Hot Rod Java client when the coordinator is killed. The client's turn-around-time including retries becomes about twice longer if nodes including the coordinator are killed. From the TRACE log files we found VERIFY_SUSPECT runs twice in that case which led to the slow latency.


Configuration:
  JDG 6.6.0 server with BZ-1328307 patch (7 nodes * 4 machines, clustered.xml is in log-and-config.zip.)
  Java Client configuration
~~~
            cacheManager = 
                new RemoteCacheManager(new ConfigurationBuilder()
                .addServers(urlList)
                .connectionTimeout(1000)
                .socketTimeout(1000)
                .maxRetries(8)                
                .marshaller("org.mk300.infinispan.hotrod.marshaller.HotrodMinimumMarshaller")
                .forceReturnValues(true)
                .build());
~~~


How reproducible:
Reproducible in the customer's environment.


Actual results:
Extract of TRACE log from the failed-over coordinator: 0501_server.log in log-and-config.zip (7 nodes including the coordinator on a machine are killed at the same time. You can find VERIFY_SUSPECT runs twice.)
Extract of TRACE log from suspected node: 0607_server.log


Expected results:
The old coordinator should be verified as dead at the first time so that the time of failover is comparable to when a normal node is killed.


Additional info:
The full set of log files are available as '20160617_タイマ変更後_TRACE版_.zip' in the linked case.

Comment 5 Osamu Nagano 2016-07-12 07:36:25 UTC

The upstream has implemented a fix.
https://github.com/belaban/JGroups/commit/c9102cb235aaf9b114a353f7c99ffa5739bd11be

It's better if we have this in JDG 6.6.1, hopefully together with BZ-1351016 which is not resolved yet though.

Comment 6 Vaclav Dedik 2016-07-28 15:51:13 UTC

Fixed:
https://github.com/infinispan/jdg/commit/6c5b9c1e196f17d102c198870124a31f1c30b3ca

Comment 10 Red Hat Bugzilla 2025-02-10 03:49:04 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.

Note You need to log in before you can comment on or make changes to this bug.