Bug 1244527
Summary: | DHT-rebalance: Rebalance hangs on distribute volume when glusterd is stopped on peer node | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Triveni Rao <trao> | |
Component: | distribute | Assignee: | Anand Nekkunti <anekkunt> | |
Status: | CLOSED ERRATA | QA Contact: | Byreddy <bsrirama> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.1 | CC: | amukherj, anekkunt, asriram, asrivast, divya, nsathyan, rcyriac, sashinde, sasundar | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.1.1 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | glusterd | |||
Fixed In Version: | glusterfs-3.7.1-12 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the "gluster vol rebalance <vol_name> start" command might be hung if any nodes in a cluster go down simultaneously. With this fix, this issue is resolved.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1245142 (view as bug list) | Environment: | ||
Last Closed: | 2015-10-05 07:12:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1216951, 1245142, 1249925, 1251815 |
Description
Triveni Rao
2015-07-19 17:25:47 UTC
[root@casino-vm1 ~]# rpm -qa | grep gluster gluster-nagios-addons-0.2.3-1.el6rhs.x86_64 glusterfs-api-3.7.1-10.el6rhs.x86_64 glusterfs-geo-replication-3.7.1-10.el6rhs.x86_64 gluster-nagios-common-0.2.0-1.el6rhs.noarch glusterfs-libs-3.7.1-10.el6rhs.x86_64 glusterfs-client-xlators-3.7.1-10.el6rhs.x86_64 glusterfs-fuse-3.7.1-10.el6rhs.x86_64 glusterfs-server-3.7.1-10.el6rhs.x86_64 glusterfs-rdma-3.7.1-10.el6rhs.x86_64 vdsm-gluster-4.16.20-1.1.el6rhs.noarch glusterfs-3.7.1-10.el6rhs.x86_64 glusterfs-cli-3.7.1-10.el6rhs.x86_64 [root@casino-vm1 ~]# sosreport uploaded http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1244527/sosreport-casino-vm1.lab.eng.blr.redhat.com.003-20150719063653.tar replication options [2015-07-19 10:35:55.178461] E [MSGID: 106119] [glusterd-syncop.c:1819:gd_sync_task_begin] 0-management: Unable to acquire lock for great The message "I [MSGID: 106488] [glusterd-handler.c:1463:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req" repeated 2 times between [2015-07-19 10:36:55.325248] and [2015-07-19 10:36:55.328233] [2015-07-19 10:37:45.643073] I [MSGID: 106487] [glusterd-handler.c:1402:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2015-07-19 10:37:45.766638] I [MSGID: 106062] [glusterd-geo-rep.c:309:__glusterd_handle_gsync_set] 0-management: slave not found, while handling geo-replication options [2015-07-19 10:37:45.767872] W [glusterd-locks.c:575:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f6736153580] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1d4)[0x7f672abf5d04] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x915)[0x7f672abf1ec5] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f672abf1f8b] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_gsync_set+0x16f)[0x7f672abd083f] ))))) 0-management: Lock for great held by abbb8155-7e01-4fea-983e-2e2c929ebd7c [2015-07-19 10:37:45.767895] E [MSGID: 106119] [glusterd-syncop.c:1819:gd_sync_task_begin] 0-management: Unable to acquire lock for great [2015-07-19 10:37:46.020277] I [MSGID: 106499] [glusterd-handler.c:4258:__glusterd_handle_status_volume] 0-management: Received status volume req for volume great [2015-07-19 10:37:46.020553] W [glusterd-locks.c:575:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f6736153580] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1d4)[0x7f672abf5d04] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x915)[0x7f672abf1ec5] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b)[0x7f672abf1f8b] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_status_volume+0x1b2)[0x7f672ab3dd42] ))))) 0-management: Lock for great held by abbb8155-7e01-4fea-983e-2e2c929ebd7c [2015-07-19 10:37:46.022241] I [MSGID: 106499] [glusterd-handler.c:4258:__glusterd_handle_status_volume] 0-management: Received status volume req for volume test [2015-07-19 10:37:45.881620] I [MSGID: 106062] [glusterd-geo-rep.c:309:__glusterd_handle_gsync_set] 0-management: slave not found, while handling geo-replication options [2015-07-19 10:37:46.020575] E [MSGID: 106119] [glusterd-syncop.c:1819:gd_sync_task_begin] 0-management: Unable to acquire lock for great [2015-07-19 10:48:38.245354] W [glusterd-locks.c:575:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f6736153580] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1d4)[0x7f672abf5d04] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_op_txn_begin+0x51a)[0x7f672ab5792a] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_defrag_volume+0x27e)[0x7f672abbb47e] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7f672ab3e81f] ))))) 0-management: Lock for great held by abbb8155-7e01-4fea-983e-2e2c929ebd7c [2015-07-19 10:48:38.245561] E [MSGID: 106119] [glusterd-handler.c:719:glusterd_op_txn_begin] 0-management: Unable to acquire lock for great (END) RCA: when one of the glusterd went down during rebalance start then call back function (_glusterd_commit_op_cbk ) called with rpc_status is -1, In case of rpc success we are getting txn_id from response but in failure case of rpc, we are referring global_txn_id which is always Zero, this resulting op_sm into inconsistent state. upstream patch: http://review.gluster.org/#/c/11728/ Downstream patch https://code.engineering.redhat.com/gerrit/#/c/55080/ is been merged now, moving the state to 'Modified'. Veriified this bug with the version "glusterfs-3.7.1-12" Issue verified steps: ===================== 1. Created Distributed volume using two node cluster. 2. Mounted the volume as FUSE 3. Created some files in the mnt point. 4. Started rebalance on one node and killed glusterD on other node. 5. rebalance started successfully and after that able to issue other volume related cmds. Fix is working good, Moving this bug to verified state. Please review and sign-off the edited doc text. Divya glusterd was not hanging always, we might end up with glusterd hung. I would linke to changes as Previously, the "gluster vol rebalance <vol_name> start" command might be hung if any nodes in a cluster go down simultaneously. With this fix, this issue is resolved. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html |