Bug 1202328

Summary: CTDB:Three nodes out of 4 node CTDB cluster remains in UNHEALTHY state when these three nodes are rebooted
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: surabhi <sbhaloth>
Component: ctdbAssignee: Anoop C S <anoopcs>
Status: CLOSED DUPLICATE QA Contact: Vivek Das <vdas>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: anoopcs, bkunal, gdeschner, madam, nlevinki, rhs-smb, sheggodu, wenshi
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: gluster
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 10:05:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1408949    

Description surabhi 2015-03-16 12:11:59 UTC
Description of problem:
**********************************************
Rebooting 3 nodes out of 4 node ctdb cluster ends up in keeping the three nodes in UNHEALTHY state and on one of the node /gluster/lock is not getting mounted.


Version-Release number of selected component (if applicable):
glusterfs-3.6.0.51-1.el6rhs.x86_64
ctdb2.5-2.5.4-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create 4 node ctdb cluster
2.Reboot 3 nodes out of 4 nodes 
3.verify the ctdb status , ctdb ip 

Actual results:
**************************
Three nodes remains in UNHEALTHY state and on one of the node /gluster/lock is not getting mounted.

Expected results:
**************************
All nodes should come to OK state.

Additional info:

Comment 2 surabhi 2015-03-16 12:13:34 UTC
Reproducing the issue again with log level DEBUG then will update the sosreports.

Comment 3 surabhi 2015-05-04 10:03:28 UTC
When rebooting one node in 4 node ctdb cluster , one of the node remains in unhealthy state because the /gluster/lock mount doesn't happen on this node.
The errors from the logs:


2015/05/04 02:53:09.889335 [set_recmode: 9120]: ERROR: recovery lock file /gluster/lock/lockfile not locked when recovering!
2015/05/04 02:53:09.900173 [ 3109]: Freeze priority 1
2015/05/04 02:53:09.901113 [ 3109]: Freeze priority 2
2015/05/04 02:53:09.901862 [ 3109]: Freeze priority 3
2015/05/04 02:53:11.034595 [ 3109]: Thawing priority 1
2015/05/04 02:53:11.034631 [ 3109]: Release freeze handler for prio 1
2015/05/04 02:53:11.034667 [ 3109]: Thawing priority 2
2015/05/04 02:53:11.034678 [ 3109]: Release freeze handler for prio 2
2015/05/04 02:53:11.034697 [ 3109]: Thawing priority 3
2015/05/04 02:53:11.034708 [ 3109]: Release freeze handler for prio 3
2015/05/04 02:53:11.035309 [set_recmode: 9182]: ERROR: recovery lock file /gluster/lock/lockfile not locked when recovering!


The snippet from gluster logs:

[2015-05-04 06:52:55.385969] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 5ceff865-23a8-48fb-b13e-d2252ee5d0f4
[2015-05-04 06:52:55.387695] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 486078dd-0eab-4223-b80e-59099da74ec2
[2015-05-04 06:52:55.420389] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 5f495b59-54f2-4949-b1ee-5fd0a324d582
[2015-05-04 06:52:55.423631] W [glusterd-op-sm.c:3975:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
[2015-05-04 06:52:55.425004] I [glusterd-handler.c:3841:__glusterd_handle_status_volume] 0-management: Received status volume req for volume ctdb
[2015-05-04 06:52:55.427132] W [glusterd-locks.c:547:glusterd_mgmt_v3_lock] 0-management: Lock for ctdb held by 0e6d1fb9-e34e-4733-bb87-734dd920080b
[2015-05-04 06:52:55.427151] E [glusterd-op-sm.c:3054:glusterd_op_ac_lock] 0-management: Unable to acquire lock for ctdb
[2015-05-04 06:52:55.427178] E [glusterd-op-sm.c:6539:glusterd_op_sm] 0-management: handler returned: -1
[2015-05-04 06:52:55.429000] E [glusterd-syncop.c:86:gd_mgmt_v3_collate_errors] 0-: Locking failed on 10.16.157.78. Please check log file for details.
[2015-05-04 06:52:55.429042] W [glusterd-locks.c:641:glusterd_mgmt_v3_unlock] 0-management: Lock owner mismatch. Lock for vol ctdb held by 0e6d1fb9-e34e-4733-bb87-734dd920080b
[2015-05-04 06:52:55.429056] E [glusterd-op-sm.c:3102:glusterd_op_ac_unlock] 0-management: Unable to release lock for ctdb
[2015-05-04 06:52:55.429084] E [glusterd-op-sm.c:6539:glusterd_op_sm] 0-management: handler returned: 1
[2015-05-04 06:52:55.429104] E [glusterd-syncop.c:1724:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2015-05-04 06:52:55.430345] E [glusterd-syncop.c:86:gd_mgmt_v3_collate_errors] 0-: Unlocking failed on 10.16.157.78. Please check log file for details.

Version-Release number of selected component (if applicable):
 rpm -qa | grep ctdb
ctdb2.5-2.5.4-1.el6rhs.x86_64
glusterfs-3.6.0.53-1.el6rhs.x86_64


How reproducible:
Inconsistent. 1/5 


Steps to Reproduce:
1. Create a 4 node ctdb cluster
2. Reboot one node to verify failover
3. check ctdb status , ctdb ip

Actual results:
The node that was rebooted remains in unhealthy state because the /gluster/lock is not mounted once the node comes up.


Expected results:
Once the node comes up it should come back to healthy state and all /gluster/lock should be mounted .

Additional info:

After doing force start of the ctdb volume ,the mount happened and the node becomes healthy.

Comment 8 Bipin Kunal 2017-08-01 14:00:18 UTC
Setting needinfo of Michael to get reply for comment #6

Comment 9 Anoop C S 2017-09-13 10:52:01 UTC
Please see https://bugzilla.redhat.com/show_bug.cgi?id=1177603#c10 for an explanation along the lines of this bug.

Comment 10 Michael Adam 2018-04-10 10:05:13 UTC

*** This bug has been marked as a duplicate of bug 1177603 ***