Bug 1477020

Summary: [Tracker Bug (RHGS)] when gluster pod is restarted, bricks from the restarted pod fails to connect to fuse, self-heal etc
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: CNS-deploymentAssignee: Michael Adam <madam>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.6CC: akhakhar, annair, hchiramm, jarrpa, madam, mliyazud, mzywusko, pprakash, rhs-bugs, rreddy, rtalur
Target Milestone: ---   
Target Release: CNS 3.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1477024 (view as bug list) Environment:
Last Closed: 2017-10-11 06:58:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1477024    
Bug Blocks: 1445448    

Description krishnaram Karthick 2017-08-01 05:37:37 UTC
Description of problem:
when one gluster pod is restarted on a CNS deployment with 3 gluster pods with around 100 volumes mounted to 100 app pods, brick from the restarted pod fails to connect to mount, self-healing daemons.

As a result, Any new write to the mount fails to get written on the new brick. 

This issue is seen on all the 100 volumes in the Trusted Storage Pool.

Following error messages are seen in the brick logs.

[2017-08-01 02:59:35.247187] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 02:59:35.247334] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-08-01 04:39:29.200776] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 04:39:29.200829] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully

gluster vol status shows that all bricks are up.

gluster v status vol_fe3995a5e9b186486e7d01a326b296d4 
Status of volume: vol_fe3995a5e9b186486e7d01a326b296d4
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.201:/var/lib/heketi/mounts/v
g_57416b0c6c42778c9fcc913f3e1aa6a0/brick_53
ae1a82ee4d7858018d1c53f3c61865/brick        49152     0          Y       810  
Brick 10.70.46.203:/var/lib/heketi/mounts/v
g_e57848f756a6fd3b559c7ab5d0f026ed/brick_49
9e2845415a6d0337871206664c55b3/brick        49152     0          Y       1017 
Brick 10.70.46.197:/var/lib/heketi/mounts/v
g_6fb7232af84e00b7c23ffdf9a825e355/brick_f7
7473532b0f3f483fbe7f5ac5c67811/brick        49152     0          Y       1041 
Self-heal Daemon on localhost               N/A       N/A        Y       819  
Self-heal Daemon on 10.70.46.203            N/A       N/A        Y       57006
Self-heal Daemon on 10.70.46.197            N/A       N/A        Y       57409
 
Task Status of Volume vol_fe3995a5e9b186486e7d01a326b296d4
------------------------------------------------------------------------------
There are no active volume tasks

In the above test, gluster pod running on node 10.70.46.201 was restarted.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-35.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. create a cns setup with 100 app pods consuming 100 pvc
2. restart one of the three gluster pod

Actual results:
brick process fails to connect to fuse mount or self-heal 

Expected results:
brick process should connect to fuse mount, self-heal should get triggered automatically

Additional info:
Logs shall be attached shortly

Comment 6 Humble Chirammal 2017-08-16 07:17:10 UTC
Container image rhgs3/rhgs-server-rhel7:3.3.0-11 has the fix

Comment 7 krishnaram Karthick 2017-08-22 05:38:38 UTC
Verified this bug in cns build - cns-deploy-5.0.0-15.el7rhgs.x86_64 (glusterfs-3.8.4-40.el7rhgs.x86_64)

Deleting gluster pod with 100 volumes brings up the gluster pod and all the bricks are up, self-heal completes.

Moving the bug to verified.

Comment 8 errata-xmlrpc 2017-10-11 06:58:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2877