Bug 1477020 - [Tracker Bug (RHGS)] when gluster pod is restarted, bricks from the restarted pod fails to connect to fuse, self-heal etc
[Tracker Bug (RHGS)] when gluster pod is restarted, bricks from the restarted...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: CNS-deployment (Show other bugs)
cns-3.6
Unspecified Unspecified
unspecified Severity high
: ---
: CNS 3.6
Assigned To: Michael Adam
krishnaram Karthick
:
Depends On: 1477024
Blocks: 1445448
  Show dependency treegraph
 
Reported: 2017-08-01 01:37 EDT by krishnaram Karthick
Modified: 2017-10-11 02:58 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1477024 (view as bug list)
Environment:
Last Closed: 2017-10-11 02:58:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description krishnaram Karthick 2017-08-01 01:37:37 EDT
Description of problem:
when one gluster pod is restarted on a CNS deployment with 3 gluster pods with around 100 volumes mounted to 100 app pods, brick from the restarted pod fails to connect to mount, self-healing daemons.

As a result, Any new write to the mount fails to get written on the new brick. 

This issue is seen on all the 100 volumes in the Trusted Storage Pool.

Following error messages are seen in the brick logs.

[2017-08-01 02:59:35.247187] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 02:59:35.247334] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-08-01 04:39:29.200776] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 04:39:29.200829] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully

gluster vol status shows that all bricks are up.

gluster v status vol_fe3995a5e9b186486e7d01a326b296d4 
Status of volume: vol_fe3995a5e9b186486e7d01a326b296d4
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.201:/var/lib/heketi/mounts/v
g_57416b0c6c42778c9fcc913f3e1aa6a0/brick_53
ae1a82ee4d7858018d1c53f3c61865/brick        49152     0          Y       810  
Brick 10.70.46.203:/var/lib/heketi/mounts/v
g_e57848f756a6fd3b559c7ab5d0f026ed/brick_49
9e2845415a6d0337871206664c55b3/brick        49152     0          Y       1017 
Brick 10.70.46.197:/var/lib/heketi/mounts/v
g_6fb7232af84e00b7c23ffdf9a825e355/brick_f7
7473532b0f3f483fbe7f5ac5c67811/brick        49152     0          Y       1041 
Self-heal Daemon on localhost               N/A       N/A        Y       819  
Self-heal Daemon on 10.70.46.203            N/A       N/A        Y       57006
Self-heal Daemon on 10.70.46.197            N/A       N/A        Y       57409
 
Task Status of Volume vol_fe3995a5e9b186486e7d01a326b296d4
------------------------------------------------------------------------------
There are no active volume tasks

In the above test, gluster pod running on node 10.70.46.201 was restarted.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-35.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. create a cns setup with 100 app pods consuming 100 pvc
2. restart one of the three gluster pod

Actual results:
brick process fails to connect to fuse mount or self-heal 

Expected results:
brick process should connect to fuse mount, self-heal should get triggered automatically

Additional info:
Logs shall be attached shortly
Comment 6 Humble Chirammal 2017-08-16 03:17:10 EDT
Container image rhgs3/rhgs-server-rhel7:3.3.0-11 has the fix
Comment 7 krishnaram Karthick 2017-08-22 01:38:38 EDT
Verified this bug in cns build - cns-deploy-5.0.0-15.el7rhgs.x86_64 (glusterfs-3.8.4-40.el7rhgs.x86_64)

Deleting gluster pod with 100 volumes brings up the gluster pod and all the bricks are up, self-heal completes.

Moving the bug to verified.
Comment 8 errata-xmlrpc 2017-10-11 02:58:29 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2877

Note You need to log in before you can comment on or make changes to this bug.