Bug 999970

Summary: CTDB:With the network fluctuation the ctdb node is going to banned state and not recovering within 300 secs
Product: Red Hat Gluster Storage Reporter: surabhi <sbhaloth>
Component: sambaAssignee: Ira Cooper <ira>
Status: CLOSED EOL QA Contact: surabhi <sbhaloth>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.1CC: lmohanty, rjoseph, sdharane, surs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: ctdb
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-03 17:22:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description surabhi 2013-08-22 13:24:59 UTC
Description of problem:
In a ctdb setup,with bringing down the network interface for a node, the node is going to banned state.

First bring down the network interface for one node.
The node will go to DISCONNECTED|UNHEALTHY|INACTIVE - which is fine
Bring up the network again.
The node goes to banned state:
=> sometimes it is going to banned state and coming back to healthy state but not within the banrecovery time which is by default 300 secs.
=> sometimes it is going to banned state and staying there forever.

Also I have made sure not to bring down the network interface for the node on which ctdb volume resides.

Version-Release number of selected component (if applicable):
samba-glusterfs-3.6.9-159.1.el6rhs.x86_64
glusterfs-3.4.0.19rhs-2.el6rhs.x86_64

How reproducible:
Not consistent.
once in 4 times

Steps to Reproduce:
1.Create a ctdb setup.
2.Bring down the network interface for a node (ip link set dev eth0 down)
3.Bring up the network interface.

Actual results:
The node goes to banned state:
=> sometimes it is going to banned state and coming back to healthy state but not within the banrecovery time which is by default 300 secs.
=> sometimes it is going to banned state and staying there forever.

Expected results:
If the node has failed many recovery attempts then only it should go to banned state otherwise it should go to disconnected|unhealthy|inactive and then come back to healthy state.And even if it has gone to banned state for any reason it should come back to healthy state after 300 secs.

Additional info:
After bringing down the n/w , the ctdb command status is as following:
# ctdb status
Number of nodes:4
pnn:0 10.16.159.152    OK
pnn:1 10.16.159.153    OK (THIS NODE)
2013/08/22 09:13:11.716108 [30084]: client/ctdb_client.c:759 control timed out. reqid:6 opcode:124 dstnode:2
2013/08/22 09:13:11.716240 [30084]: client/ctdb_client.c:870 ctdb_control_recv failed
2013/08/22 09:13:11.716260 [30084]: client/ctdb_client.c:2523 ctdb_control for get ifaces failed ret:-1 res:-1

From ctdb log: the time it took to come from banned to unhealthy and then to healthy is more than 300 sec (which is default).

2013/08/22 07:56:25.940939 [ 2644]: Banning this node for 300 seconds
2013/08/22 07:56:25.976470 [ 2644]: No public addresses file found. Nothing to do for 10.interfaces
2013/08/22 07:57:08.588746 [ 2700]: We are still serving a public address '10.16.159.175' that we should not be serving.
2013/08/22 07:57:08.588811 [ 2700]: Trigger takeoverrun
2013/08/22 07:57:10.028444 [ 2700]: We are still serving a public address '10.16.159.175' that we should not be serving.
2013/08/22 08:03:43.709320 [ 2644]: Banning timedout
2013/08/22 08:03:44.023738 [ 2644]: Freeze priority 1
2013/08/22 08:03:44.025728 [ 2644]: Freeze priority 2
2013/08/22 08:03:45.911088 [ 2644]: Release freeze handler for prio 3
2013/08/22 08:03:48.496343 [ 2644]: Node became HEALTHY. Ask recovery master 1 to perform ip reallocation

Comment 2 Christopher R. Hertel 2013-08-26 13:45:54 UTC
> In a ctdb setup, with bringing down the network interface for a node, the
> node is going to banned state.

This is normal behavior for CTDB.  Regarding nodes in the Banned state, the manual page for CTDB states:

"This node does not provide any services. All banned nodes should be investigated and require an administrative action to rectify."

This is a limitation of CTDB.

Comment 3 surabhi 2013-08-27 09:40:52 UTC
So the banned node does not provide any services is understandable and with many network reboots a node may go to banned state.Here in this case the node is going to banned state which is fine but it should recover within recoveryban period which is 300 seconds by default.But it is not recovering.Also while trying I saw that sometimes it is staying in banned state forever.
So should it not recover within default recovery ban period?

Comment 4 surabhi 2013-08-27 10:02:21 UTC
Also if I try on a fresh setup, even with single instance of network down , the node is going to banned state.
As per man pages :
BANNED - This node failed too many recovery attempts and has been banned from participating in the cluster for a period of RecoveryBanPeriod seconds.

Comment 5 Vivek Agarwal 2015-12-03 17:22:32 UTC
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.