Issue was reported upstream by a user via https://github.com/gluster/glusterfs/issues/648
I'm seeing that if I kill a brick in a replica 3 system, AFR keeps getting child_down event repeatedly for the same child.
Version-Release number of selected component (if applicable):
master (source install)
Steps to Reproduce:
1. Create a replica 3 volume and start it.
2. Put a break point in __afr_handle_child_down_event() in glustershd process.
3. Kill any one brick.
The break point keeps getting hit once every 3 seconds or so repeatedly.
Only 1 event per one disconnect.
I haven't checked if the same happens for GF_EVENT_CHILD_UP as well. I think this is regression that needs to be fixed. If this is not a bug please feel free to close stating why.
The multiple disconnect events are due to reconnect/disconnect to glusterd (port 24007). rpc/clnt has a reconnect feature which tries to reconnect to a disconnected brick and client connection to brick is a two step process:
1. connect to glusterd, get brick port then disconnect
2. connect to brick
In this case step 1 would be successful and step 2 won't happen as glusterd wouldn't send back the brick port (as brick is dead). Nevertheless there is a chain of connect/disconnect (to glusterd) at rpc layer and these are valid steps as we need reconnection logic. However subsequent disconnect events were prevented from reaching parents of protocol/client as it remembered which was the last sent and if current event is the same as last event, it would skip notification. Before Halo replication feature - https://review.gluster.org/16177, last_sent_event for this test case would be GF_EVENT_DISCONNECT and hence subsequent disconnects were skipped notification to parent xlators. But Halo replication introduced another event GF_EVENT_CHILD_PING which gets notified to parents of protocol/client whenever there is a successful ping response. In this case, the successful ping response would be from glusterd and would change conf->last_sent_event to GF_EVENT_CHILD_PING. This made subsequent disconnect events are not skipped.
A patch to propagate GF_EVENT_CHILD_PING only after a successful handshake prevents spurious CHILD_DOWN events to afr. However, I am not sure whether this breaks Halo replication. Would request afr team members comment on the patch (I'll post shortly).
REVIEW: https://review.gluster.org/22821 (protocol/client: propagte GF_EVENT_CHILD_PING only after a successful handshake) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/22821 (protocol/client: propagte GF_EVENT_CHILD_PING only for connections to brick) merged (#9) on master by Raghavendra G
This is a serious bug and blocking deployments -- I don't see it in the 6.x stream! what release and when it's released?
I've sent the backport to the current release branches: https://review.gluster.org/#/q/topic:ref-1716979+(status:open+OR+status:merged)
Does that mean, it's not yet in 6.x or 5.x? When is the release with the fix is due?
Yes, that is correct. The release schedule is at https://www.gluster.org/release-schedule/. I'm not sure of the dates, Hari should be able to tell you if it is valid. I'm adding a need-info on him.
That said, Amgad, could you explain why this bug is blocking your deployments? I do not see this as a blocker.
Thanks Ravi. The link is showing the initial release date and maintenance (30th). Does that mean, 6.5-1 will include the fix coming on August 30th?
The bug is blocking because the impact showed during testing is very serious!
Would you kindly provide a feedback for Bug 1739320 as well?
The 5.x and 6.x has reached its slowed out phase (releases after .3 or .4).
So we are supposed to have the next release in two months.
The dates have been changed and more info about the change can be found here:
And this month's releases of 5 and 6 are in the testing phase,
announcement will be expected by the end of the day.
As we didn't get to know these blocker issues during the bugs gathering part
we missed out these bugs for this particular release.
Please do, come forward with the bugs by then so we can plan things better,
make them easier and thus get a better product.
As we might miss backporting a few bugs we have fixed. It would be great,
If you can come forward and help us.
@Ravi, thanks for back-ports.
I will check the backports and take them in.
The backports of this bug will be a part of the next release.
As to when the next release will be, it is supposed to be in two months from now.
We will look into if we can plan another release for the next month.
Unfortunately, this bug made our platform unstable and we can't wait for 2-month. Could it be backported at least for release 6.x this month as a cherry pick.
Appreciate the support!
we have backported the patches for this bug to every active release branch.
About having a release 6 alone, we need to check with others involved in the process and get back to you.
And can you use the night rpms available at http://artifacts.ci.centos.org/gluster/nightly/
The latest rpm has the fix in it.
>we have backported the patches for this bug to every active release branch.
What does exactly mean? does it mean the bug is in 6.3-1 now for instance? or 5.5-1?
(In reply to Amgad from comment #14)
> Thanks Hari:
> >we have backported the patches for this bug to every active release branch.
> What does exactly mean? does it mean the bug is in 6.3-1 now for instance?
> or 5.5-1?
The bug was root caused to be found on master and other branches.
So the fixes have to sent to those branches as well.
>The bug was root caused to be found on master and other branches.
>So the fixes have to sent to those branches as well.
Based on your comment above, I understood that the fix is not in any official release out (5.5-1, 6.3-1, 6.5-1 for instances), and has to be built whether from master or 6.x, 5.x. branches source code, please confirm!!!!!
In another way, I have to build it from 6.x and it will be code on top of the already released 6.5-1 now
Yes, it not there in any of the official releases.
But if the rpm with the above fix is what you are looking for,
you can look into the nighly built and make use of that.
If the above work around isn't what you need, you can build rpm for release 6 branch.
This will have all the fixes till 6.5 and the ones after that. (this bug is one of them).
Thanks Hari for confirmation. I'll have to build from 6.x since we had two more IPv6 fixes to pick (Bug 1739320).
Amgad, just a heads up once IPv6 makes it way to release 6,
then that fix (along with this bug's fix) will also be made available
in the nightly once it gets built. So if the nightly are fine, you can make use of the nightly build.