1703423 – Multiple disconnect events being propagated for the same child

Bug 1703423 - Multiple disconnect events being propagated for the same child

Summary: Multiple disconnect events being propagated for the same child

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rpc
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.5.0
Assignee:	Raghavendra G
QA Contact:	Mugdha Soni
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1696809 1716979 1739334 1739335 1739336
TreeView+	depends on / blocked

Reported:	2019-04-26 11:32 UTC by Ravishankar N
Modified:	2019-10-30 12:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-6.0-7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1716979 (view as bug list)
Environment:
Last Closed:	2019-10-30 12:20:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:3249	0	None	None	None	2019-10-30 12:21:18 UTC

Comment 6 Raghavendra G 2019-06-04 14:09:56 UTC

The multiple disconnect events are due to reconnect/disconnect to glusterd (port 24007). rpc/clnt has a reconnect feature which tries to reconnect to a disconnected brick and client connection to brick is a two step process:
1. connect to glusterd, get brick port then disconnect
2. connect to brick

In this case step 1 would be successful and step 2 won't happen as glusterd wouldn't send back the brick port (as brick is dead). Nevertheless there is a chain of connect/disconnect (to glusterd) at rpc layer and these are valid steps as we need reconnection logic. However subsequent disconnect events were prevented from reaching parents of protocol/client as it remembered which was the last sent and if current event is the same as last event, it would skip notification. Before Halo replication feature - https://review.gluster.org/16177, last_sent_event for this test case would be GF_EVENT_DISCONNECT and hence subsequent disconnects were skipped notification to parent xlators. But Halo replication introduced another event GF_EVENT_CHILD_PING which gets notified to parents of protocol/client whenever there is a successful ping response. In this case, the successful ping response would be from glusterd and would change conf->last_sent_event to GF_EVENT_CHILD_PING. This made subsequent disconnect events are not skipped.

A patch to propagate GF_EVENT_CHILD_PING only after a successful handshake prevents spurious CHILD_DOWN events to afr. However, I am not sure whether this breaks Halo replication. Would request afr team members comment on the patch (I'll post shortly).

Comment 7 Raghavendra G 2019-06-04 14:10:24 UTC

upstream patch: https://review.gluster.org/#/c/glusterfs/+/22821/

Comment 13 Mugdha Soni 2019-08-20 10:54:11 UTC

Tested with the following:

glusterfs-6.0-12.el7rhgs.x86_64
glusterfs-server-6.0-12.el7rhgs.x86_64
glusterfs-libs-6.0-12.el7rhgs.x86_64
glusterfs-fuse-6.0-12.el7rhgs.x86_64
glusterfs-api-6.0-12.el7rhgs.x86_64
glusterfs-rdma-6.0-12.el7rhgs.x86_64
glusterfs-cli-6.0-12.el7rhgs.x86_64
glusterfs-geo-replication-6.0-12.el7rhgs.x86_64
glusterfs-events-6.0-12.el7rhgs.x86_64
glusterfs-debuginfo-6.0-12.el7rhgs.x86_64
glusterfs-client-xlators-6.0-12.el7rhgs.x86_64 

The steps performed :

1.Created a replica 3 volume and started it.
2.Installed the GDB and glusterfs-debuginfo packages on the node.
3.Enter into the gdb session (gdb -p `pidof glusterfs`) and  set a break point (b __afr_handle_child_down_event)
4.Went to one of the node and killed a brick on the node (kill -9 pid)
5.The following output was seen.There was only 1 event per one disconnect and didnot see any issue that of events generating every 3 sec for one disconnect.

   (gdb) b __afr_handle_child_down_event
Breakpoint 1 at 0x7f675b62b940: file afr-common.c, line 5249.
(gdb) c
Continuing.
[Switching to Thread 0x7f675cafe700 (LWP 16524)]

Breakpoint 1, __afr_handle_child_down_event (this=this@entry=0x7f675400ffe0, child_xlator=child_xlator@entry=0x7f675400d5c0, idx=idx@entry=2, child_latency_msec=child_latency_msec@entry=-1, 
    event=event@entry=0x7f675cafcc5c, call_psh=call_psh@entry=0x7f675cafcc68, up_child=up_child@entry=0x7f675cafcc6c) at afr-common.c:5249
5249	{
.



Based on the above output moving the bug to verified state.

Comment 15 errata-xmlrpc 2019-10-30 12:20:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3249

Note You need to log in before you can comment on or make changes to this bug.