Bug 1348404 - Coordinator failover is taking longer because VERIFY_SUSPECT runs twice
Summary: Coordinator failover is taking longer because VERIFY_SUSPECT runs twice
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: JGroups
Version: 6.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ER1
: 6.6.1
Assignee: Tristan Tarrant
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks: 1309749
TreeView+ depends on / blocked
 
Reported: 2016-06-21 06:31 UTC by Osamu Nagano
Modified: 2025-02-10 03:49 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-02-10 03:49:04 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker JGRP-2082 0 Major Resolved Coordinator failover is taking longer because VERIFY_SUSPECT runs twice 2017-05-16 08:09:38 UTC

Description Osamu Nagano 2016-06-21 06:31:48 UTC
Description of problem:
A customer is testing the latency of Hot Rod Java client when the coordinator is killed. The client's turn-around-time including retries becomes about twice longer if nodes including the coordinator are killed. From the TRACE log files we found VERIFY_SUSPECT runs twice in that case which led to the slow latency.


Configuration:
  JDG 6.6.0 server with BZ-1328307 patch (7 nodes * 4 machines, clustered.xml is in log-and-config.zip.)
  Java Client configuration
~~~
            cacheManager = 
                new RemoteCacheManager(new ConfigurationBuilder()
                .addServers(urlList)
                .connectionTimeout(1000)
                .socketTimeout(1000)
                .maxRetries(8)                
                .marshaller("org.mk300.infinispan.hotrod.marshaller.HotrodMinimumMarshaller")
                .forceReturnValues(true)
                .build());
~~~


How reproducible:
Reproducible in the customer's environment.


Actual results:
Extract of TRACE log from the failed-over coordinator: 0501_server.log in log-and-config.zip (7 nodes including the coordinator on a machine are killed at the same time. You can find VERIFY_SUSPECT runs twice.)
Extract of TRACE log from suspected node: 0607_server.log


Expected results:
The old coordinator should be verified as dead at the first time so that the time of failover is comparable to when a normal node is killed.


Additional info:
The full set of log files are available as '20160617_タイマ変更後_TRACE版_.zip' in the linked case.

Comment 5 Osamu Nagano 2016-07-12 07:36:25 UTC
The upstream has implemented a fix.
https://github.com/belaban/JGroups/commit/c9102cb235aaf9b114a353f7c99ffa5739bd11be

It's better if we have this in JDG 6.6.1, hopefully together with BZ-1351016 which is not resolved yet though.

Comment 10 Red Hat Bugzilla 2025-02-10 03:49:04 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.


Note You need to log in before you can comment on or make changes to this bug.