Description of problem: after the BZ 981653 I am finding that "gluster volume status" fails on other nodes of the cluster also Version-Release number of selected component (if applicable): [root@quota2 ~]# rpm -qa | grep glusterfs glusterfs-3.4.0.12rhs.beta2-1.el6rhs.x86_64 glusterfs-fuse-3.4.0.12rhs.beta2-1.el6rhs.x86_64 glusterfs-server-3.4.0.12rhs.beta2-1.el6rhs.x86_64 How reproducible: found after BZ 981653 Steps to Reproduce: after BZ 981653 execute gluster volume status on any of the nodes of the cluster. Actual results: [root@quota2 ~]# gluster volume status Another transaction is in progress. Please try again after sometime. glusterd logs say, 2013-07-05 05:13:59.931692] E [glusterd-utils.c:333:glusterd_lock] 0-management: Unable to get lock for uuid: cc7bc8ba-fa3a-43d9-a899-114e34d27eb4, lock held by: 236e161a-fc82-4964-8e6d-bb0d9160990d [2013-07-05 05:14:02.238117] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.70.37.98:24007 failed (Connection refused) [2013-07-05 05:22:13.211163] E [glusterd-utils.c:333:glusterd_lock] 0-management: Unable to get lock for uuid: cc7bc8ba-fa3a-43d9-a899-114e34d27eb4, lock held by: 236e161a-fc82-4964-8e6d-bb0d9160990d [2013-07-05 05:22:13.211243] E [glusterd-syncop.c:1128:gd_sync_task_begin] 0-management: Unable to acquire lock [2013-07-05 05:22:13.211373] E [glusterd-utils.c:375:glusterd_unlock] 0-management: Cluster lock held by 236e161a-fc82-4964-8e6d-bb0d9160990d ,unlock req from cc7bc8ba-fa3a-43d9-a899-114e34d27eb4! [2013-07-05 05:22:13.211404] E [glusterd-utils.c:333:glusterd_lock] 0-management: Unable to get lock for uuid: cc7bc8ba-fa3a-43d9-a899-114e34d27eb4, lock held by: 236e161a-fc82-4964-8e6d-bb0d9160990d [2013-07-05 05:31:08.545951] E [glusterd-utils.c:333:glusterd_lock] 0-management: Unable to get lock for uuid: cc7bc8ba-fa3a-43d9-a899-114e34d27eb4, lock held by: 236e161a-fc82-4964-8e6d-bb0d9160990d [2013-07-05 05:31:08.546016] E [glusterd-syncop.c:1128:gd_sync_task_begin] 0-management: Unable to acquire lock [2013-07-05 05:31:08.546121] E [glusterd-utils.c:375:glusterd_unlock] 0-management: Cluster lock held by 236e161a-fc82-4964-8e6d-bb0d9160990d ,unlock req from cc7bc8ba-fa3a-43d9-a899-114e34d27eb4! [2013-07-05 05:31:08.546142] E [glusterd-utils.c:333:glusterd_lock] 0-management: Unable to get lock for uuid: cc7bc8ba-fa3a-43d9-a899-114e34d27eb4, lock held by: 236e161a-fc82-4964-8e6d-bb0d9160990d [2013-07-05 05:31:13.491554] I [glusterd-handler.c:966:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2013-07-05 05:38:01.968142] I [glusterd-handler.c:966:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2013-07-05 05:38:02.187306] E [glusterd-utils.c:333:glusterd_lock] 0-management: Unable to get lock for uuid: cc7bc8ba-fa3a-43d9-a899-114e34d27eb4, lock held by: 236e161a-fc82-4964-8e6d-bb0d9160990d [2013-07-05 05:38:02.187355] E [glusterd-syncop.c:1128:gd_sync_task_begin] 0-management: Unable to acquire lock [2013-07-05 05:38:02.187453] E [glusterd-utils.c:375:glusterd_unlock] 0-management: Cluster lock held by 236e161a-fc82-4964-8e6d-bb0d9160990d ,unlock req from cc7bc8ba-fa3a-43d9-a899-114e34d27eb4! [2013-07-05 05:38:02.187490] E [glusterd-utils.c:333:glusterd_lock] 0-management: Unable to get lock for uuid: cc7bc8ba-fa3a-43d9-a899-114e34d27eb4, lock held by: 236e161a-fc82-4964-8e6d-bb0d9160990d (END) Expected results: if node crashed because of some reason, some other node should provide the information without fail. other wise whole cluster becomes unuseful without some "workaround" Additional info:
https://code.engineering.redhat.com/gerrit/#/c/10364/ <-- Posted for review. PROBLEM: When the originator of a volume transaction goes down while it is still the owning the lock, volume ops issued from the other nodes also fail with the message that the lock is still held by the node that went down. FIX: Upon receiving DISCONNECT from the originator of a transaction, on the rest of the nodes, perform the following actions: a. Release the lock; and b. reset the state of the node to GD_OP_STATE_DEFAULT. Note: This bug is not confined to 'volume quota' command. This state may be reached for any volume command when the originator goes down while in possession of the lock.
The change has been merged in downstream. Hence moving the state of the bug to MODIFIED.
https://code.engineering.redhat.com/gerrit/#/c/10364/ <-- Same as in comment #4
Tested this with glusterfs-3.4.0.17rhs-1 Steps ===== 1. Created a trusted storage pool of 3 nodes 2. Created a replica volume with 2 bricks ( 1 brick in node1 and another in node2 ) 3. Start the volume 4. Abruptly powered down node1 5. Issue "gluster volume heal <vol-name>" from node2 6. 'heal' command waits [BZ 866758] for frame-timeout which is 600 secs 7. Issue, gluster volume status from the node3. You will get the error as follows : [Thu Aug 8 10:50:50 UTC 2013 root.37.61:~ ] # gluster volume status Another transaction is in progress. Please try again after sometime. NOTE: above command is executed in node3, which doesn't actually have bricks in it 8. Abruptly power down, node2 also. 9. Check for "gluster volume status" "gluster volume status" succeeded and thus moving it to VERIFIED state
Correction with #comment8, Verified with glusterfs-3.4.0.18rhs-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html