Description of problem: ********************************************** Rebooting 3 nodes out of 4 node ctdb cluster ends up in keeping the three nodes in UNHEALTHY state and on one of the node /gluster/lock is not getting mounted. Version-Release number of selected component (if applicable): glusterfs-3.6.0.51-1.el6rhs.x86_64 ctdb2.5-2.5.4-1.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create 4 node ctdb cluster 2.Reboot 3 nodes out of 4 nodes 3.verify the ctdb status , ctdb ip Actual results: ************************** Three nodes remains in UNHEALTHY state and on one of the node /gluster/lock is not getting mounted. Expected results: ************************** All nodes should come to OK state. Additional info:
Reproducing the issue again with log level DEBUG then will update the sosreports.
When rebooting one node in 4 node ctdb cluster , one of the node remains in unhealthy state because the /gluster/lock mount doesn't happen on this node. The errors from the logs: 2015/05/04 02:53:09.889335 [set_recmode: 9120]: ERROR: recovery lock file /gluster/lock/lockfile not locked when recovering! 2015/05/04 02:53:09.900173 [ 3109]: Freeze priority 1 2015/05/04 02:53:09.901113 [ 3109]: Freeze priority 2 2015/05/04 02:53:09.901862 [ 3109]: Freeze priority 3 2015/05/04 02:53:11.034595 [ 3109]: Thawing priority 1 2015/05/04 02:53:11.034631 [ 3109]: Release freeze handler for prio 1 2015/05/04 02:53:11.034667 [ 3109]: Thawing priority 2 2015/05/04 02:53:11.034678 [ 3109]: Release freeze handler for prio 2 2015/05/04 02:53:11.034697 [ 3109]: Thawing priority 3 2015/05/04 02:53:11.034708 [ 3109]: Release freeze handler for prio 3 2015/05/04 02:53:11.035309 [set_recmode: 9182]: ERROR: recovery lock file /gluster/lock/lockfile not locked when recovering! The snippet from gluster logs: [2015-05-04 06:52:55.385969] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 5ceff865-23a8-48fb-b13e-d2252ee5d0f4 [2015-05-04 06:52:55.387695] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 486078dd-0eab-4223-b80e-59099da74ec2 [2015-05-04 06:52:55.420389] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 5f495b59-54f2-4949-b1ee-5fd0a324d582 [2015-05-04 06:52:55.423631] W [glusterd-op-sm.c:3975:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed [2015-05-04 06:52:55.425004] I [glusterd-handler.c:3841:__glusterd_handle_status_volume] 0-management: Received status volume req for volume ctdb [2015-05-04 06:52:55.427132] W [glusterd-locks.c:547:glusterd_mgmt_v3_lock] 0-management: Lock for ctdb held by 0e6d1fb9-e34e-4733-bb87-734dd920080b [2015-05-04 06:52:55.427151] E [glusterd-op-sm.c:3054:glusterd_op_ac_lock] 0-management: Unable to acquire lock for ctdb [2015-05-04 06:52:55.427178] E [glusterd-op-sm.c:6539:glusterd_op_sm] 0-management: handler returned: -1 [2015-05-04 06:52:55.429000] E [glusterd-syncop.c:86:gd_mgmt_v3_collate_errors] 0-: Locking failed on 10.16.157.78. Please check log file for details. [2015-05-04 06:52:55.429042] W [glusterd-locks.c:641:glusterd_mgmt_v3_unlock] 0-management: Lock owner mismatch. Lock for vol ctdb held by 0e6d1fb9-e34e-4733-bb87-734dd920080b [2015-05-04 06:52:55.429056] E [glusterd-op-sm.c:3102:glusterd_op_ac_unlock] 0-management: Unable to release lock for ctdb [2015-05-04 06:52:55.429084] E [glusterd-op-sm.c:6539:glusterd_op_sm] 0-management: handler returned: 1 [2015-05-04 06:52:55.429104] E [glusterd-syncop.c:1724:gd_sync_task_begin] 0-management: Locking Peers Failed. [2015-05-04 06:52:55.430345] E [glusterd-syncop.c:86:gd_mgmt_v3_collate_errors] 0-: Unlocking failed on 10.16.157.78. Please check log file for details. Version-Release number of selected component (if applicable): rpm -qa | grep ctdb ctdb2.5-2.5.4-1.el6rhs.x86_64 glusterfs-3.6.0.53-1.el6rhs.x86_64 How reproducible: Inconsistent. 1/5 Steps to Reproduce: 1. Create a 4 node ctdb cluster 2. Reboot one node to verify failover 3. check ctdb status , ctdb ip Actual results: The node that was rebooted remains in unhealthy state because the /gluster/lock is not mounted once the node comes up. Expected results: Once the node comes up it should come back to healthy state and all /gluster/lock should be mounted . Additional info: After doing force start of the ctdb volume ,the mount happened and the node becomes healthy.
Setting needinfo of Michael to get reply for comment #6
Please see https://bugzilla.redhat.com/show_bug.cgi?id=1177603#c10 for an explanation along the lines of this bug.
*** This bug has been marked as a duplicate of bug 1177603 ***