Bug 1160233

Summary:	[USS] : Rebalance process tries to connect to snapd and in case when snapd crashes it might affect rebalance process
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	senaik
Component:	snapshot	Assignee:	Avra Sengupta <asengupt>
Status:	CLOSED ERRATA	QA Contact:	senaik
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.0	CC:	asengupt, rhinduja, rhs-bugs, rjoseph, storage-qa-internal, surs
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.0.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	USS
Fixed In Version:	glusterfs-3.6.0.35-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1164711 (view as bug list)		Environment:
Last Closed:	2015-01-15 13:41:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1162694, 1164711

Description senaik 2014-11-04 11:57:25 UTC

Description of problem:
======================
If snapd is down/crashed , rebalance process hangs as it is trying to connect to snapd. Checking rebalance status shows it is in progress from a long time as it is trying to connect to snapd which has crashed


Version-Release number of selected component (if applicable):
==============================================================
glusterfs 3.6.0.30

How reproducible:
================
1/1


Steps to Reproduce:
==================
1.Create a 2x2 dist rep volume and start it 

2.Fuse and NFS mount the volume and create some I O

3.While IO is in progress create some snapshots 

4. After snapshots are completed, cd to .snaps and access the snaps resulted in snapd crash (tracked by bz 1160138 )

5. Check rebalance status 
gluster v rebalance vol2 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                2      294Bytes          1508             0             0          in progress            6332.00
       snapshot14.lab.eng.blr.redhat.com                0        0Bytes         12838             0             0            completed             157.00
       snapshot15.lab.eng.blr.redhat.com               14         6.6KB          1540             0             0          in progress            6332.00
       snapshot16.lab.eng.blr.redhat.com                0        0Bytes         12828             0             0            completed             120.00
volume rebalance: vol2: success: 

Rebalance process is in progress on 2 nodes and remains in this state as it is trying to connect to snapd which has crashed. 

------------Part if rebalance log-------------

[2014-11-04 10:48:33.215941] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:33.222052] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:34.228429] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:34.234436] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:36.241848] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:36.248168] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:37.253538] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:37.259399] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:39.266618] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:39.272841] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:40.278413] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:40.284449] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:42.290661] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:42.297403] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:43.302585] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:43.309132] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
(END) 
-----------------------------------------------------------------

Actual results:
===============
Rebalance process hangs if snapd crashes


Expected results:
================
If snapd crashes rebalance process should not be affected . Rebalance process should not access snapshots , because if snapd crashes rebalance process might hang as it is trying to connect to snapd which has crashed 


Additional info:

Comment 4 Avra Sengupta 2014-12-01 07:18:20 UTC

Fixed with https://code.engineering.redhat.com/gerrit/37556

Comment 5 senaik 2014-12-03 12:27:15 UTC

Version :glusterfs 3.6.0.35
=======
While rebalance process was in progress, stopped snapd from different servers and rebalance process completed successfully. 

Marking the bug as 'Verified'

Comment 7 errata-xmlrpc 2015-01-15 13:41:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html