+++ This bug was initially created as a clone of Bug #1478710 +++ +++ This bug was initially created as a clone of Bug #1477024 +++ +++ This bug was initially created as a clone of Bug #1477020 +++ Description of problem: when one gluster pod is restarted on a CNS deployment with 3 gluster pods with around 100 volumes mounted to 100 app pods, brick from the restarted pod fails to connect to mount, self-healing daemons. As a result, Any new write to the mount fails to get written on the new brick. This issue is seen on all the 100 volumes in the Trusted Storage Pool. Following error messages are seen in the brick logs. [2017-08-01 02:59:35.247187] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument] [2017-08-01 02:59:35.247334] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-08-01 04:39:29.200776] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument] [2017-08-01 04:39:29.200829] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully gluster vol status shows that all bricks are up. gluster v status vol_fe3995a5e9b186486e7d01a326b296d4 Status of volume: vol_fe3995a5e9b186486e7d01a326b296d4 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.201:/var/lib/heketi/mounts/v g_57416b0c6c42778c9fcc913f3e1aa6a0/brick_53 ae1a82ee4d7858018d1c53f3c61865/brick 49152 0 Y 810 Brick 10.70.46.203:/var/lib/heketi/mounts/v g_e57848f756a6fd3b559c7ab5d0f026ed/brick_49 9e2845415a6d0337871206664c55b3/brick 49152 0 Y 1017 Brick 10.70.46.197:/var/lib/heketi/mounts/v g_6fb7232af84e00b7c23ffdf9a825e355/brick_f7 7473532b0f3f483fbe7f5ac5c67811/brick 49152 0 Y 1041 Self-heal Daemon on localhost N/A N/A Y 819 Self-heal Daemon on 10.70.46.203 N/A N/A Y 57006 Self-heal Daemon on 10.70.46.197 N/A N/A Y 57409 Task Status of Volume vol_fe3995a5e9b186486e7d01a326b296d4 ------------------------------------------------------------------------------ There are no active volume tasks In the above test, gluster pod running on node 10.70.46.201 was restarted. Version-Release number of selected component (if applicable): glusterfs-3.8.4-35.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. create a cns setup with 100 app pods consuming 100 pvc 2. restart one of the three gluster pod Actual results: brick process fails to connect to fuse mount or self-heal Expected results: brick process should connect to fuse mount, self-heal should get triggered automatically Additional info: Logs shall be attached shortly --- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-01 01:37:46 EDT --- This bug is automatically being proposed for the current release of Container-Native Storage under active development, by setting the release flag 'cns‑3.6.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-01 01:42:02 EDT --- This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from krishnaram Karthick on 2017-08-01 01:50:44 EDT --- setup is left in the same state for dev to have a look. Please ping me offline for setup details. This bug could also be related to https://bugzilla.redhat.com/show_bug.cgi?id=1476828. I'd leave it to dev to confirm if the issues seen in both the cases are same. Proposing this as a blocker as this is a common use case and gluster pod failure in cns should not cause all the volumes to get into this state. --- Additional comment from Atin Mukherjee on 2017-08-01 03:58:02 EDT --- Karthick - please do not attach this bug to the in-flight tracker. I know in CNS this is a bit different process. In RHGS, while we ack this bug, the tracker is attached. --- Additional comment from Rejy M Cyriac on 2017-08-02 03:46:44 EDT --- At the 'RHGS 3.3.0 - Release Blocker Bug Triage and Status Check' meeting on 02 August, it was decided to ACCEPT this BZ for fix at the RHGS 3.3.0 release --- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-02 03:46:56 EDT --- Since this bug has been approved for the RHGS 3.3.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.3.0+', and through the Internal Whiteboard entry of '3.3.0', the Target Release is being automatically set to 'RHGS 3.3.0' --- Additional comment from Worker Ant on 2017-08-06 08:36:06 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Sometime on cns after pod is restarted client is getting Transport endpoint error while brick mux is on) posted (#1) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-06 08:42:44 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Sometime on cns after pod is restarted client is getting Transport endpoint error while brick mux is on) posted (#2) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-06 10:04:20 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Sometime on cns after pod is restarted client is getting Transport endpoint error while brick mux is on) posted (#3) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-06 11:40:55 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Sometime on cns after pod is restarted client is getting Transport endpoint error while brick mux is on) posted (#4) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-06 13:29:52 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#5) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Worker Ant on 2017-08-06 13:57:52 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#6) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Worker Ant on 2017-08-07 01:16:57 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#7) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-07 01:31:19 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#8) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-08 05:26:18 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#9) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-08 06:03:48 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#10) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-08 08:16:44 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#11) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-08 11:29:14 EDT --- REVIEW: https://review.gluster.org/17984 (glusterd: Block brick attach request till the brick's ctx is set) posted (#12) for review on master by MOHIT AGRAWAL (moagrawa) --- Additional comment from Worker Ant on 2017-08-08 18:28:34 EDT --- COMMIT: https://review.gluster.org/17984 committed in master by Jeff Darcy (jeff.us) ------ commit c13d69babc228a2932994962d6ea8afe2cdd620a Author: Mohit Agrawal <moagrawa> Date: Tue Aug 8 14:36:17 2017 +0530 glusterd: Block brick attach request till the brick's ctx is set Problem: In multiplexing setup in a container environment we hit a race where before the first brick finishes its handshake with glusterd, the subsequent attach requests went through and they actually failed and glusterd has no mechanism to realize it. This resulted into all the such bricks not to be active resulting into clients not able to connect. Solution: Introduce a new flag port_registered in glusterd_brickinfo to make sure about pmap_signin finish before the subsequent attach bricks can be processed. Test: To reproduce the issue followed below steps 1) Create 100 volumes on 3 nodes(1x3) in CNS environment 2) Enable brick multiplexing 3) Reboot one container 4) Run below command for v in ‛gluster v list‛ do glfsheal $v | grep -i "transport" done After apply the patch command should not fail. Note: A big thanks to Atin for suggest the fix. BUG: 1478710 Change-Id: I8e1bd6132122b3a5b0dd49606cea564122f2609b Signed-off-by: Mohit Agrawal <moagrawa> Reviewed-on: https://review.gluster.org/17984 Reviewed-by: Atin Mukherjee <amukherj> Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jeff.us>
REVIEW: https://review.gluster.org/18004 (glusterd: Block brick attach request till the brick's ctx is set) posted (#1) for review on release-3.12 by Atin Mukherjee (amukherj)
COMMIT: https://review.gluster.org/18004 committed in release-3.12 by Shyamsundar Ranganathan (srangana) ------ commit d66af9ac76f84faa33ecb2eb390656f5637e6fee Author: Mohit Agrawal <moagrawa> Date: Tue Aug 8 14:36:17 2017 +0530 glusterd: Block brick attach request till the brick's ctx is set Problem: In multiplexing setup in a container environment we hit a race where before the first brick finishes its handshake with glusterd, the subsequent attach requests went through and they actually failed and glusterd has no mechanism to realize it. This resulted into all the such bricks not to be active resulting into clients not able to connect. Solution: Introduce a new flag port_registered in glusterd_brickinfo to make sure about pmap_signin finish before the subsequent attach bricks can be processed. Test: To reproduce the issue followed below steps 1) Create 100 volumes on 3 nodes(1x3) in CNS environment 2) Enable brick multiplexing 3) Reboot one container 4) Run below command for v in ‛gluster v list‛ do glfsheal $v | grep -i "transport" done After apply the patch command should not fail. Note: A big thanks to Atin for suggest the fix. >Reviewed-on: https://review.gluster.org/17984 >Reviewed-by: Atin Mukherjee <amukherj> >Smoke: Gluster Build System <jenkins.org> >CentOS-regression: Gluster Build System <jenkins.org> >Reviewed-by: Jeff Darcy <jeff.us> >(cherry picked from commit c13d69babc228a2932994962d6ea8afe2cdd620a) BUG: 1479662 Change-Id: I8e1bd6132122b3a5b0dd49606cea564122f2609b Signed-off-by: Mohit Agrawal <moagrawa> Reviewed-on: https://review.gluster.org/18004 Tested-by: Atin Mukherjee <amukherj> Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/