Hide Forgot
Description of problem: In a ctdb setup,with bringing down the network interface for a node, the node is going to banned state. First bring down the network interface for one node. The node will go to DISCONNECTED|UNHEALTHY|INACTIVE - which is fine Bring up the network again. The node goes to banned state: => sometimes it is going to banned state and coming back to healthy state but not within the banrecovery time which is by default 300 secs. => sometimes it is going to banned state and staying there forever. Also I have made sure not to bring down the network interface for the node on which ctdb volume resides. Version-Release number of selected component (if applicable): samba-glusterfs-3.6.9-159.1.el6rhs.x86_64 glusterfs-3.4.0.19rhs-2.el6rhs.x86_64 How reproducible: Not consistent. once in 4 times Steps to Reproduce: 1.Create a ctdb setup. 2.Bring down the network interface for a node (ip link set dev eth0 down) 3.Bring up the network interface. Actual results: The node goes to banned state: => sometimes it is going to banned state and coming back to healthy state but not within the banrecovery time which is by default 300 secs. => sometimes it is going to banned state and staying there forever. Expected results: If the node has failed many recovery attempts then only it should go to banned state otherwise it should go to disconnected|unhealthy|inactive and then come back to healthy state.And even if it has gone to banned state for any reason it should come back to healthy state after 300 secs. Additional info: After bringing down the n/w , the ctdb command status is as following: # ctdb status Number of nodes:4 pnn:0 10.16.159.152 OK pnn:1 10.16.159.153 OK (THIS NODE) 2013/08/22 09:13:11.716108 [30084]: client/ctdb_client.c:759 control timed out. reqid:6 opcode:124 dstnode:2 2013/08/22 09:13:11.716240 [30084]: client/ctdb_client.c:870 ctdb_control_recv failed 2013/08/22 09:13:11.716260 [30084]: client/ctdb_client.c:2523 ctdb_control for get ifaces failed ret:-1 res:-1 From ctdb log: the time it took to come from banned to unhealthy and then to healthy is more than 300 sec (which is default). 2013/08/22 07:56:25.940939 [ 2644]: Banning this node for 300 seconds 2013/08/22 07:56:25.976470 [ 2644]: No public addresses file found. Nothing to do for 10.interfaces 2013/08/22 07:57:08.588746 [ 2700]: We are still serving a public address '10.16.159.175' that we should not be serving. 2013/08/22 07:57:08.588811 [ 2700]: Trigger takeoverrun 2013/08/22 07:57:10.028444 [ 2700]: We are still serving a public address '10.16.159.175' that we should not be serving. 2013/08/22 08:03:43.709320 [ 2644]: Banning timedout 2013/08/22 08:03:44.023738 [ 2644]: Freeze priority 1 2013/08/22 08:03:44.025728 [ 2644]: Freeze priority 2 2013/08/22 08:03:45.911088 [ 2644]: Release freeze handler for prio 3 2013/08/22 08:03:48.496343 [ 2644]: Node became HEALTHY. Ask recovery master 1 to perform ip reallocation
> In a ctdb setup, with bringing down the network interface for a node, the > node is going to banned state. This is normal behavior for CTDB. Regarding nodes in the Banned state, the manual page for CTDB states: "This node does not provide any services. All banned nodes should be investigated and require an administrative action to rectify." This is a limitation of CTDB.
So the banned node does not provide any services is understandable and with many network reboots a node may go to banned state.Here in this case the node is going to banned state which is fine but it should recover within recoveryban period which is 300 seconds by default.But it is not recovering.Also while trying I saw that sometimes it is staying in banned state forever. So should it not recover within default recovery ban period?
Also if I try on a fresh setup, even with single instance of network down , the node is going to banned state. As per man pages : BANNED - This node failed too many recovery attempts and has been banned from participating in the cluster for a period of RecoveryBanPeriod seconds.
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.