Description of problem: ====================== If snapd is down/crashed , rebalance process hangs as it is trying to connect to snapd. Checking rebalance status shows it is in progress from a long time as it is trying to connect to snapd which has crashed Version-Release number of selected component (if applicable): ============================================================== glusterfs 3.6.0.30 How reproducible: ================ 1/1 Steps to Reproduce: ================== 1.Create a 2x2 dist rep volume and start it 2.Fuse and NFS mount the volume and create some I O 3.While IO is in progress create some snapshots 4. After snapshots are completed, cd to .snaps and access the snaps resulted in snapd crash (tracked by bz 1160138 ) 5. Check rebalance status gluster v rebalance vol2 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 294Bytes 1508 0 0 in progress 6332.00 snapshot14.lab.eng.blr.redhat.com 0 0Bytes 12838 0 0 completed 157.00 snapshot15.lab.eng.blr.redhat.com 14 6.6KB 1540 0 0 in progress 6332.00 snapshot16.lab.eng.blr.redhat.com 0 0Bytes 12828 0 0 completed 120.00 volume rebalance: vol2: success: Rebalance process is in progress on 2 nodes and remains in this state as it is trying to connect to snapd which has crashed. ------------Part if rebalance log------------- [2014-11-04 10:48:33.215941] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:33.222052] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) [2014-11-04 10:48:34.228429] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:34.234436] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) [2014-11-04 10:48:36.241848] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:36.248168] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) [2014-11-04 10:48:37.253538] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:37.259399] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) [2014-11-04 10:48:39.266618] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:39.272841] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) [2014-11-04 10:48:40.278413] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:40.284449] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) [2014-11-04 10:48:42.290661] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 2-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:42.297403] E [socket.c:2169:socket_connect_finish] 2-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) [2014-11-04 10:48:43.302585] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 6-vol2-snapd-client: changing port to 49179 (from 0) [2014-11-04 10:48:43.309132] E [socket.c:2169:socket_connect_finish] 6-vol2-snapd-client: connection to 127.0.0.1:49179 failed (Connection refused) (END) ----------------------------------------------------------------- Actual results: =============== Rebalance process hangs if snapd crashes Expected results: ================ If snapd crashes rebalance process should not be affected . Rebalance process should not access snapshots , because if snapd crashes rebalance process might hang as it is trying to connect to snapd which has crashed Additional info:
Fixed with https://code.engineering.redhat.com/gerrit/37556
Version :glusterfs 3.6.0.35 ======= While rebalance process was in progress, stopped snapd from different servers and rebalance process completed successfully. Marking the bug as 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html