Bug 1159162

Summary:	Possible false positive suspect of FD_HOST when the number of hosts is large
Product:	[JBoss] JBoss Data Grid 6	Reporter:	Osamu Nagano <onagano>
Component:	JGroups	Assignee:	Tristan Tarrant <ttarrant>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Martin Gencur <mgencur>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.3.1	CC:	bban, dstahl, gsheldon, ksuzumur, mhusnain, pslavice, rvansa, sjacobs, slaskawi, tkimura, wfink
Target Milestone:	CR1	Flags:	ksuzumur: needinfo+
Target Release:	6.3.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously in Red Hat JBoss Data Grid, when using the FD_HOST protocol in JGroups for node failure detection (whether the node was alive was checked using ICMP pings), a node was suspected to be dead even if it was responsive. This issue was more likely to occur in larger clusters. This issue is now fixed in JBoss Data Grid 6.3.2.	Story Points:	---
Clone Of:
Clones:	1161529 (view as bug list)		Environment:
Last Closed:	2015-01-26 14:03:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1161529

Description Osamu Nagano 2014-10-31 06:00:38 UTC

FD_HOST's ping task reports how old it's been suspected for by "age=X secs" as follows.

~~~
2014-10-29 16:22:43.045 TRACE [org.jgroups.protocols.FD_HOST] (Timer-2,shared=udp) AAAAAAAA/clustered(s1): /10.9.64.67 is not alive (age=5 secs)
~~~

Normally it will be counted up to "age=5 secs" if the configuration is as follows (timeout/interval = 5).

~~~
        <protocol type="FD_HOST">
             <property name="timeout">5000</property>
             <property name="interval">1000</property>
             <property name="check_timeout">1000</property>
        </protocol>
~~~

But we found a node has been directly dead without such ageing count-up as follows.

~~~
2014-10-29 16:23:17.371 TRACE [org.jgroups.protocols.VERIFY_SUSPECT] (Timer-2,shared=udp) verifying that BBBBBBBB/clustered(s1) is dead
~~~

Of course server's busyness and such a tight interval contributes to the direct death without ageing count-up.  But the current implementation of FD_HOST$PingTask seems to behave more like this if the number of hosts is large.  See this comment [1] for the detail.


[1] https://github.com/belaban/JGroups/pull/156#issuecomment-47315801

> We have 60 hosts, each ping takes 0.5 sec, timeout=30 sec. Suspect cond is: if(current_time - timestamp >= timeout) suspect(host);
> 
> The current impl is:
> 
>     start_time = 0
>     for (hosts) update timetamps, first is 0.5 and last is 30
>     current_time = 30
>     in suspect check, 30 - 0.5 = 29.5
> 
> The first member is almost timed out although it responds to ping in 0.5 sec. It'll be suspected if there is a GC pause.

(The last comment in the above conversation is saying that the old timestamp would be overwritten by new task thread.  But FD_HOST$PingTask is scheduled by scheduleAtFixedRate(), which indicates there is no such concurrent execution, according to JavaDoc of java.util.concurrent.ScheduledExecutorService.)

Comment 3 Takayoshi Kimura 2014-11-05 03:36:20 UTC

Currently we have 2 loops, ping loop and cheking timeout loop. Full GC during the ping loop affects the multiple hosts and sends unnecessarily many false suspects.

To reduce the impact of the Full GC, we can combine the 2 loops into 1 loop, ping and checking timeout each host, so the Full GC delay only affects to a single host and never affect to other hosts.

Will send a pull request later.

Comment 4 JBoss JIRA Server 2014-11-05 07:59:51 UTC

Takayoshi Kimura <tkimura> updated the status of jira JGRP-1898 to Coding In Progress

Comment 5 JBoss JIRA Server 2014-11-05 08:00:01 UTC

Takayoshi Kimura <tkimura> updated the status of jira JGRP-1898 to Open

Comment 6 Bela Ban 2014-11-05 11:44:32 UTC

Thanks for the PR !
I applied it (changing the logic slightly, see comments on the PR) and backported it to the 3.5 and 3.4 branches.
I suggest create a JGroups JAR from a snapshot of the 3.4 branch and test this change. Once you tell me it works, I can release a 3.4.7.Final.

Comment 7 JBoss JIRA Server 2014-11-05 11:45:04 UTC

Bela Ban <bela> updated the status of jira JGRP-1898 to Resolved

Comment 8 Takayoshi Kimura 2014-11-07 07:58:58 UTC

Tested 3.4 branch but it doesn't work. Looking at the source code, the change is not correctly applied. Will send a PR shortly.

Comment 9 Takayoshi Kimura 2014-11-07 08:20:16 UTC

Fixed, tested, verified and sent PRs for 3.4 and 3.5.

https://github.com/belaban/JGroups/pull/181
https://github.com/belaban/JGroups/pull/182

Comment 10 Radim Vansa 2014-11-10 10:04:18 UTC

fixed typo

Comment 11 Dave Stahl 2014-11-24 15:54:19 UTC

jdg-6.3.x PR: https://github.com/infinispan/jdg/pull/366

Comment 12 Radim Vansa 2014-11-27 14:47:52 UTC

Unit test attached to JGroups JIRA.