Descriptionkrishnaram Karthick
2017-08-01 05:37:37 UTC
Description of problem:
when one gluster pod is restarted on a CNS deployment with 3 gluster pods with around 100 volumes mounted to 100 app pods, brick from the restarted pod fails to connect to mount, self-healing daemons.
As a result, Any new write to the mount fails to get written on the new brick.
This issue is seen on all the 100 volumes in the Trusted Storage Pool.
Following error messages are seen in the brick logs.
[2017-08-01 02:59:35.247187] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 02:59:35.247334] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-08-01 04:39:29.200776] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 04:39:29.200829] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
gluster vol status shows that all bricks are up.
gluster v status vol_fe3995a5e9b186486e7d01a326b296d4
Status of volume: vol_fe3995a5e9b186486e7d01a326b296d4
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.46.201:/var/lib/heketi/mounts/v
g_57416b0c6c42778c9fcc913f3e1aa6a0/brick_53
ae1a82ee4d7858018d1c53f3c61865/brick 49152 0 Y 810
Brick 10.70.46.203:/var/lib/heketi/mounts/v
g_e57848f756a6fd3b559c7ab5d0f026ed/brick_49
9e2845415a6d0337871206664c55b3/brick 49152 0 Y 1017
Brick 10.70.46.197:/var/lib/heketi/mounts/v
g_6fb7232af84e00b7c23ffdf9a825e355/brick_f7
7473532b0f3f483fbe7f5ac5c67811/brick 49152 0 Y 1041
Self-heal Daemon on localhost N/A N/A Y 819
Self-heal Daemon on 10.70.46.203 N/A N/A Y 57006
Self-heal Daemon on 10.70.46.197 N/A N/A Y 57409
Task Status of Volume vol_fe3995a5e9b186486e7d01a326b296d4
------------------------------------------------------------------------------
There are no active volume tasks
In the above test, gluster pod running on node 10.70.46.201 was restarted.
Version-Release number of selected component (if applicable):
glusterfs-3.8.4-35.el7rhgs.x86_64
How reproducible:
1/1
Steps to Reproduce:
1. create a cns setup with 100 app pods consuming 100 pvc
2. restart one of the three gluster pod
Actual results:
brick process fails to connect to fuse mount or self-heal
Expected results:
brick process should connect to fuse mount, self-heal should get triggered automatically
Additional info:
Logs shall be attached shortly
Container image rhgs3/rhgs-server-rhel7:3.3.0-11 has the fix
Comment 7krishnaram Karthick
2017-08-22 05:38:38 UTC
Verified this bug in cns build - cns-deploy-5.0.0-15.el7rhgs.x86_64 (glusterfs-3.8.4-40.el7rhgs.x86_64)
Deleting gluster pod with 100 volumes brings up the gluster pod and all the bricks are up, self-heal completes.
Moving the bug to verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2017:2877