Bug 1716979

Summary: Multiple disconnect events being propagated for the same child
Product: [Community] GlusterFS Reporter: Raghavendra G <rgowdapp>
Component: rpcAssignee: bugs <bugs>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: amgad.saleh, amukherj, bugs, hgowtham, ravishankar, rgowdapp, rhinduja, rhs-bugs, sankarshan, sheggodu
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1703423
: 1739334 1739335 1739336 (view as bug list) Environment:
Last Closed: 2019-06-27 14:11:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1703423    
Bug Blocks: 1739334, 1739335, 1739336    

Comment 1 Raghavendra G 2019-06-04 13:39:06 UTC
Issue was reported upstream by a user via https://github.com/gluster/glusterfs/issues/648

I'm seeing that if I kill a brick in a replica 3 system, AFR keeps getting child_down event repeatedly for the same child.

Version-Release number of selected component (if applicable):
master (source install)

How reproducible:
Always.

Steps to Reproduce:
1. Create a replica 3 volume and start it.
2. Put  a break point in __afr_handle_child_down_event() in glustershd process.
3. Kill any one brick.

Actual results:
The break point keeps getting hit once every 3 seconds or so repeatedly.

Expected results:
Only 1 event per one disconnect.

Additional info:
I haven't checked if the same happens for GF_EVENT_CHILD_UP as well. I think this is regression that needs to be fixed. If this is not a bug please feel free to close stating why.

Comment 2 Raghavendra G 2019-06-04 13:52:16 UTC
The multiple disconnect events are due to reconnect/disconnect to glusterd (port 24007). rpc/clnt has a reconnect feature which tries to reconnect to a disconnected brick and client connection to brick is a two step process:
1. connect to glusterd, get brick port then disconnect
2. connect to brick

In this case step 1 would be successful and step 2 won't happen as glusterd wouldn't send back the brick port (as brick is dead). Nevertheless there is a chain of connect/disconnect (to glusterd) at rpc layer and these are valid steps as we need reconnection logic. However subsequent disconnect events were prevented from reaching parents of protocol/client as it remembered which was the last sent and if current event is the same as last event, it would skip notification. Before Halo replication feature - https://review.gluster.org/16177, last_sent_event for this test case would be GF_EVENT_DISCONNECT and hence subsequent disconnects were skipped notification to parent xlators. But Halo replication introduced another event GF_EVENT_CHILD_PING which gets notified to parents of protocol/client whenever there is a successful ping response. In this case, the successful ping response would be from glusterd and would change conf->last_sent_event to GF_EVENT_CHILD_PING. This made subsequent disconnect events are not skipped.

A patch to propagate GF_EVENT_CHILD_PING only after a successful handshake prevents spurious CHILD_DOWN events to afr. However, I am not sure whether this breaks Halo replication. Would request afr team members comment on the patch (I'll post shortly).

Comment 3 Worker Ant 2019-06-04 14:10:13 UTC
REVIEW: https://review.gluster.org/22821 (protocol/client: propagte GF_EVENT_CHILD_PING only after a successful handshake) posted (#1) for review on master by Raghavendra G

Comment 4 Worker Ant 2019-06-27 14:11:38 UTC
REVIEW: https://review.gluster.org/22821 (protocol/client: propagte GF_EVENT_CHILD_PING only for connections to brick) merged (#9) on master by Raghavendra G

Comment 5 Amgad 2019-08-09 03:01:22 UTC
This is a serious bug and blocking deployments -- I don't see it in the 6.x stream! what release and when it's released?

Comment 6 Ravishankar N 2019-08-09 05:05:14 UTC
I've sent the backport to the current release branches: https://review.gluster.org/#/q/topic:ref-1716979+(status:open+OR+status:merged)

Comment 7 Amgad 2019-08-11 04:31:46 UTC
Does that mean, it's not yet in 6.x or 5.x? When is the release with the fix is due?

Comment 8 Ravishankar N 2019-08-11 05:29:16 UTC
Yes, that is correct. The release schedule is at https://www.gluster.org/release-schedule/. I'm not sure of the dates, Hari should be able to tell you if it is valid. I'm adding a need-info on him.

That said, Amgad, could you explain why this bug is blocking your deployments? I do not see this as a blocker.

Comment 9 Amgad 2019-08-11 19:21:14 UTC
Thanks Ravi. The link is showing the initial release date and maintenance (30th). Does that mean, 6.5-1 will include the fix coming on August 30th?

The bug is blocking because the impact showed during testing is very serious!

Ravi:
Would you kindly provide a feedback for Bug 1739320 as well?

Comment 10 hari gowtham 2019-08-12 06:37:23 UTC
Hi Amgad,

The 5.x and 6.x has reached its slowed out phase (releases after .3 or .4).
So we are supposed to have the next release in two months.
The dates have been changed and more info about the change can be found here:
https://lists.gluster.org/pipermail/gluster-devel/2019-August/056521.html

And this month's releases of 5 and 6 are in the testing phase, 
announcement will be expected by the end of the day.

As we didn't get to know these blocker issues during the bugs gathering part 
(https://lists.gluster.org/pipermail/gluster-devel/2019-August/056500.html) 
we missed out these bugs for this particular release.

Please do, come forward with the bugs by then so we can plan things better,
make them easier and thus get a better product. 
As we might miss backporting a few bugs we have fixed. It would be great, 
If you can come forward and help us.

@Ravi, thanks for back-ports.

I will check the backports and take them in. 
The backports of this bug will be a part of the next release.
As to when the next release will be, it is supposed to be in two months from now.
We will look into if we can plan another release for the next month.

Comment 11 Amgad 2019-08-14 03:56:27 UTC
Hari

Unfortunately, this bug made our platform unstable and we can't wait for 2-month. Could it be backported at least for release 6.x this month as a cherry pick.
Appreciate the support!

Regards,
Amgad

Comment 12 hari gowtham 2019-08-20 06:12:55 UTC
we have backported the patches for this bug to every active release branch.
About having a release 6 alone, we need to check with others involved in the process and get back to you.

Comment 13 hari gowtham 2019-08-20 10:06:07 UTC
And can you use the night rpms available at http://artifacts.ci.centos.org/gluster/nightly/
The latest rpm has the fix in it.

Comment 14 Amgad 2019-08-28 16:01:07 UTC
Thanks Hari:

>we have backported the patches for this bug to every active release branch.

What does exactly mean? does it mean the bug is in 6.3-1 now for instance? or 5.5-1?

Regards,
Amgad

Comment 15 hari gowtham 2019-08-30 10:57:58 UTC
(In reply to Amgad from comment #14)
> Thanks Hari:
> 
> >we have backported the patches for this bug to every active release branch.
> 
> What does exactly mean? does it mean the bug is in 6.3-1 now for instance?
> or 5.5-1?
> 
> Regards,
> Amgad

The bug was root caused to be found on master and other branches.
So the fixes have to sent to those branches as well.

Comment 16 Amgad 2019-09-01 17:52:32 UTC
Hi Hari:

>The bug was root caused to be found on master and other branches.
>So the fixes have to sent to those branches as well.

Based on your comment above, I understood that the fix is not in any official release out (5.5-1, 6.3-1, 6.5-1 for instances), and has to be built whether from master or 6.x, 5.x. branches source code, please confirm!!!!!

In another way, I have to build it from 6.x and it will be code on top of the already released 6.5-1 now

Comment 17 hari gowtham 2019-09-03 12:14:11 UTC
Hi Amgad,

Yes, it not there in any of the official releases.
But if the rpm with the above fix is what you are looking for, 
you can look into the nighly built and make use of that.

If the above work around isn't what you need, you can build rpm for release 6 branch.
This will have all the fixes till 6.5 and the ones after that. (this bug is one of them).

Comment 18 Amgad 2019-09-03 14:57:11 UTC
Thanks Hari for confirmation. I'll have to build from 6.x since we had two more IPv6 fixes to pick (Bug 1739320).

Comment 19 hari gowtham 2019-09-04 06:55:03 UTC
Amgad, just a heads up once IPv6 makes it way to release 6, 
then that fix (along with this bug's fix) will also be made available 
in the nightly once it gets built. So if the nightly are fine, you can make use of the nightly build.