Description of problem: when one gluster pod is restarted on a CNS deployment with 3 gluster pods with around 100 volumes mounted to 100 app pods, brick from the restarted pod fails to connect to mount, self-healing daemons. As a result, Any new write to the mount fails to get written on the new brick. This issue is seen on all the 100 volumes in the Trusted Storage Pool. Following error messages are seen in the brick logs. [2017-08-01 02:59:35.247187] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument] [2017-08-01 02:59:35.247334] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-08-01 04:39:29.200776] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument] [2017-08-01 04:39:29.200829] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully gluster vol status shows that all bricks are up. gluster v status vol_fe3995a5e9b186486e7d01a326b296d4 Status of volume: vol_fe3995a5e9b186486e7d01a326b296d4 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.201:/var/lib/heketi/mounts/v g_57416b0c6c42778c9fcc913f3e1aa6a0/brick_53 ae1a82ee4d7858018d1c53f3c61865/brick 49152 0 Y 810 Brick 10.70.46.203:/var/lib/heketi/mounts/v g_e57848f756a6fd3b559c7ab5d0f026ed/brick_49 9e2845415a6d0337871206664c55b3/brick 49152 0 Y 1017 Brick 10.70.46.197:/var/lib/heketi/mounts/v g_6fb7232af84e00b7c23ffdf9a825e355/brick_f7 7473532b0f3f483fbe7f5ac5c67811/brick 49152 0 Y 1041 Self-heal Daemon on localhost N/A N/A Y 819 Self-heal Daemon on 10.70.46.203 N/A N/A Y 57006 Self-heal Daemon on 10.70.46.197 N/A N/A Y 57409 Task Status of Volume vol_fe3995a5e9b186486e7d01a326b296d4 ------------------------------------------------------------------------------ There are no active volume tasks In the above test, gluster pod running on node 10.70.46.201 was restarted. Version-Release number of selected component (if applicable): glusterfs-3.8.4-35.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. create a cns setup with 100 app pods consuming 100 pvc 2. restart one of the three gluster pod Actual results: brick process fails to connect to fuse mount or self-heal Expected results: brick process should connect to fuse mount, self-heal should get triggered automatically Additional info: Logs shall be attached shortly
Container image rhgs3/rhgs-server-rhel7:3.3.0-11 has the fix
Verified this bug in cns build - cns-deploy-5.0.0-15.el7rhgs.x86_64 (glusterfs-3.8.4-40.el7rhgs.x86_64) Deleting gluster pod with 100 volumes brings up the gluster pod and all the bricks are up, self-heal completes. Moving the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2877