Bug 1160233

Summary: [USS] : Rebalance process tries to connect to snapd and in case when snapd crashes it might affect rebalance process
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: senaik
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED ERRATA QA Contact: senaik
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: asengupt, rhinduja, rhs-bugs, rjoseph, storage-qa-internal, surs
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.0.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: USS
Fixed In Version: glusterfs-3.6.0.35-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1164711 (view as bug list) Environment:
Last Closed: 2015-01-15 13:41:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1162694, 1164711    

Description senaik 2014-11-04 11:57:25 UTC
Description of problem:
======================
If snapd is down/crashed , rebalance process hangs as it is trying to connect to snapd. Checking rebalance status shows it is in progress from a long time as it is trying to connect to snapd which has crashed


Version-Release number of selected component (if applicable):
==============================================================
glusterfs 3.6.0.30

How reproducible:
================
1/1


Steps to Reproduce:
==================
1.Create a 2x2 dist rep volume and start it 

2.Fuse and NFS mount the volume and create some I O

3.While IO is in progress create some snapshots 

4. After snapshots are completed, cd to .snaps and access the snaps resulted in snapd crash (tracked by bz 1160138 )

5. Check rebalance status 
gluster v rebalance vol2 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                2      294Bytes          1508             0             0          in progress            6332.00
       snapshot14.lab.eng.blr.redhat.com                0        0Bytes         12838             0             0            completed             157.00
       snapshot15.lab.eng.blr.redhat.com               14         6.6KB          1540             0             0          in progress            6332.00
       snapshot16.lab.eng.blr.redhat.com                0        0Bytes         12828             0             0            completed             120.00
volume rebalance: vol2: success: 

Rebalance process is in progress on 2 nodes and remains in this state as it is trying to connect to snapd which has crashed. 

------------Part if rebalance log-------------

[2014-11-04 10:48:33.215941] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:33.222052] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:34.228429] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:34.234436] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:36.241848] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:36.248168] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:37.253538] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:37.259399] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:39.266618] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:39.272841] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:40.278413] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:40.284449] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:42.290661] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:42.297403] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
[2014-11-04 10:48:43.302585] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0)
[2014-11-04 10:48:43.309132] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused)
(END) 
-----------------------------------------------------------------

Actual results:
===============
Rebalance process hangs if snapd crashes


Expected results:
================
If snapd crashes rebalance process should not be affected . Rebalance process should not access snapshots , because if snapd crashes rebalance process might hang as it is trying to connect to snapd which has crashed 


Additional info:

Comment 4 Avra Sengupta 2014-12-01 07:18:20 UTC
Fixed with https://code.engineering.redhat.com/gerrit/37556

Comment 5 senaik 2014-12-03 12:27:15 UTC
Version :glusterfs 3.6.0.35
=======
While rebalance process was in progress, stopped snapd from different servers and rebalance process completed successfully. 

Marking the bug as 'Verified'

Comment 7 errata-xmlrpc 2015-01-15 13:41:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html