1477020 – [Tracker Bug (RHGS)] when gluster pod is restarted, bricks from the restarted pod fails to connect to fuse, self-heal etc

Bug 1477020 - [Tracker Bug (RHGS)] when gluster pod is restarted, bricks from the restarted pod fails to connect to fuse, self-heal etc

Summary: [Tracker Bug (RHGS)] when gluster pod is restarted, bricks from the restarted...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	CNS-deployment
Sub Component:
Version:	cns-3.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.6
Assignee:	Michael Adam
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:	1477024
Blocks:	1445448
TreeView+	depends on / blocked

Reported:	2017-08-01 05:37 UTC by krishnaram Karthick
Modified:	2017-10-11 06:58 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1477024 (view as bug list)
Environment:
Last Closed:	2017-10-11 06:58:29 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:2877	0	normal	SHIPPED_LIVE	rhgs-server-container bug fix and enhancement update	2017-10-11 11:11:39 UTC

Description krishnaram Karthick 2017-08-01 05:37:37 UTC

Description of problem:
when one gluster pod is restarted on a CNS deployment with 3 gluster pods with around 100 volumes mounted to 100 app pods, brick from the restarted pod fails to connect to mount, self-healing daemons.

As a result, Any new write to the mount fails to get written on the new brick. 

This issue is seen on all the 100 volumes in the Trusted Storage Pool.

Following error messages are seen in the brick logs.

[2017-08-01 02:59:35.247187] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 02:59:35.247334] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2017-08-01 04:39:29.200776] E [server-helpers.c:388:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7effacfdb8c5] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x289cb) [0x7eff8dc659cb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xe064) [0x7eff8dc4b064] ) 0-server: invalid argument: client [Invalid argument]
[2017-08-01 04:39:29.200829] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully

gluster vol status shows that all bricks are up.

gluster v status vol_fe3995a5e9b186486e7d01a326b296d4 
Status of volume: vol_fe3995a5e9b186486e7d01a326b296d4
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.201:/var/lib/heketi/mounts/v
g_57416b0c6c42778c9fcc913f3e1aa6a0/brick_53
ae1a82ee4d7858018d1c53f3c61865/brick        49152     0          Y       810  
Brick 10.70.46.203:/var/lib/heketi/mounts/v
g_e57848f756a6fd3b559c7ab5d0f026ed/brick_49
9e2845415a6d0337871206664c55b3/brick        49152     0          Y       1017 
Brick 10.70.46.197:/var/lib/heketi/mounts/v
g_6fb7232af84e00b7c23ffdf9a825e355/brick_f7
7473532b0f3f483fbe7f5ac5c67811/brick        49152     0          Y       1041 
Self-heal Daemon on localhost               N/A       N/A        Y       819  
Self-heal Daemon on 10.70.46.203            N/A       N/A        Y       57006
Self-heal Daemon on 10.70.46.197            N/A       N/A        Y       57409
 
Task Status of Volume vol_fe3995a5e9b186486e7d01a326b296d4
------------------------------------------------------------------------------
There are no active volume tasks

In the above test, gluster pod running on node 10.70.46.201 was restarted.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-35.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. create a cns setup with 100 app pods consuming 100 pvc
2. restart one of the three gluster pod

Actual results:
brick process fails to connect to fuse mount or self-heal 

Expected results:
brick process should connect to fuse mount, self-heal should get triggered automatically

Additional info:
Logs shall be attached shortly

Comment 6 Humble Chirammal 2017-08-16 07:17:10 UTC

Container image rhgs3/rhgs-server-rhel7:3.3.0-11 has the fix

Comment 7 krishnaram Karthick 2017-08-22 05:38:38 UTC

Verified this bug in cns build - cns-deploy-5.0.0-15.el7rhgs.x86_64 (glusterfs-3.8.4-40.el7rhgs.x86_64)

Deleting gluster pod with 100 volumes brings up the gluster pod and all the bricks are up, self-heal completes.

Moving the bug to verified.

Comment 8 errata-xmlrpc 2017-10-11 06:58:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2877

Note You need to log in before you can comment on or make changes to this bug.