Bug 1322677 - CTDB: ctdb node remains in banned state until the ctdb service is restarted.
Summary: CTDB: ctdb node remains in banned state until the ctdb service is restarted.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: samba
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.1.3
Assignee: Michael Adam
QA Contact: surabhi
URL:
Whiteboard:
Depends On:
Blocks: 1311817
TreeView+ depends on / blocked
 
Reported: 2016-03-31 05:56 UTC by surabhi
Modified: 2016-06-23 05:36 UTC (History)
7 users (show)

Fixed In Version: samba-4.4.3-7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-23 05:36:52 UTC
Embargoed:


Attachments (Terms of Use)
Patch proposed to upstream (2.27 KB, patch)
2016-05-31 23:44 UTC, Michael Adam
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1245 0 normal SHIPPED_LIVE gluster-smb bug fix and enhancement update 2016-06-23 09:13:06 UTC
Samba Project 11945 0 None None None 2016-06-01 09:51:12 UTC

Description surabhi 2016-03-31 05:56:09 UTC
Description of problem:
*****************************
On RHGS+SAMBA setup doing some tests(multi-channel) where ctdb goes to banned state , even restart of ctdb services doesn't bring Nodes to OK state.

bring down the interface and verify the ctdb failover.the node goes to banned state and even after the interface up and restarting ctdb services the node remains in banned state until firewall is reloaded or iptables are flushed.

If firewall is reloaded or iptables are flushed then restarting ctdb brings nodes to OK state.

All the firewall settings as per admin guide for samba and ctdb is already set on the system.

Note: This node has multiple interfaces configured.


Version-Release number of selected component (if applicable):
*****************************
samba-4.4.0-1.el7rhgs

How reproducible:
2/3

Steps to Reproduce:
1.bring down an interface to trigger ip failover and ctdb node going to unhealthy state. (with multi-channel the nod eis going to banned state will raise another issue for that )
2.bring up the interface to make the node healthy
3.check the ctdb status to make sure both the nodes are up and healthy.

Actual results:
One of the node remians in banned state even after bringing up teh interface and restarting ctdb services. it only comes to OK state once the firewall is reloaded or iptables are flushed.


Expected results:
CTDB node should come to OK state once the interface is up or ctdb service is restarted.

Additional info:

Comment 6 surabhi 2016-05-23 06:41:42 UTC
Tried the test case on latest build, the issue remains same where once the ctdb node goes to banned state it doesn't come back to healthy until ctdb is restarted.(Which is still the issue)

In the original problem description it was mentioned that only after reloading the firewall, CTDB was coming to OK state.
Changing the summary accordingly.

Comment 7 Michael Adam 2016-05-23 07:30:37 UTC
(In reply to surabhi from comment #6)
> Tried the test case on latest build, the issue remains same where once the
> ctdb node goes to banned state it doesn't come back to healthy until ctdb is
> restarted.(Which is still the issue)

Now this sounds different from the original description...

> In the original problem description it was mentioned that only after
> reloading the firewall, CTDB was coming to OK state.

Correct! And restarting ctdb did not help.

> Changing the summary accordingly.

Hmm. Not sure this is the appropriate action.
IMHO, we should not change bugs to match different
problem descriptions... 

The problem you have seen this time sounds very much like the description of the phenomenon of bug #1333360. There, the reason for ctdb not getting healthy and unbanned again was smbd / notifyd crashing. Can you see crashes in samba here?

Comment 11 surabhi 2016-05-23 08:01:45 UTC
(In reply to Michael Adam from comment #7)
> (In reply to surabhi from comment #6)
> > Tried the test case on latest build, the issue remains same where once the
> > ctdb node goes to banned state it doesn't come back to healthy until ctdb is
> > restarted.(Which is still the issue)
> 
> Now this sounds different from the original description...
> 
> > In the original problem description it was mentioned that only after
> > reloading the firewall, CTDB was coming to OK state.
> 
> Correct! And restarting ctdb did not help.
> 
> > Changing the summary accordingly.
> 
> Hmm. Not sure this is the appropriate action.
> IMHO, we should not change bugs to match different
> problem descriptions... 

Just did that to filter the firewall issue(not seen now), but if you think this creates confusion will put the original summary.
> 
> The problem you have seen this time sounds very much like the description of
> the phenomenon of bug #1333360. There, the reason for ctdb not getting
> healthy and unbanned again was smbd / notifyd crashing. Can you see crashes
> in samba here?

Yeah I always see crash with ctdb node going to banned state. Purposely raised two BZ's to track seperately , one for crash and one for ctdb not coming back to OK state as during the time of execution not sure if the root cause is same for both.

Comment 12 Michael Adam 2016-05-23 10:30:48 UTC
(In reply to surabhi from comment #11)
> (In reply to Michael Adam from comment #7)
> > (In reply to surabhi from comment #6)
> > > Tried the test case on latest build, the issue remains same where once the
> > > ctdb node goes to banned state it doesn't come back to healthy until ctdb is
> > > restarted.(Which is still the issue)
> > 
> > Now this sounds different from the original description...
> > 
> > > In the original problem description it was mentioned that only after
> > > reloading the firewall, CTDB was coming to OK state.
> > 
> > Correct! And restarting ctdb did not help.
> > 
> > > Changing the summary accordingly.
> > 
> > Hmm. Not sure this is the appropriate action.
> > IMHO, we should not change bugs to match different
> > problem descriptions... 
> 
> Just did that to filter the firewall issue(not seen now), but if you think
> this creates confusion will put the original summary.
> > 
> > The problem you have seen this time sounds very much like the description of
> > the phenomenon of bug #1333360. There, the reason for ctdb not getting
> > healthy and unbanned again was smbd / notifyd crashing. Can you see crashes
> > in samba here?
> 
> Yeah I always see crash with ctdb node going to banned state. Purposely
> raised two BZ's to track seperately , one for crash and one for ctdb not
> coming back to OK state as during the time of execution not sure if the root
> cause is same for both.

I see three different problems:

1. After a network down, which triggers ctdb to get into banned
   state, ctdb does not get out of banned state if network if is
   brought up again, until firewall is reloaded. (ctdbd restart
   does not fix it).
   (This bug.)

2. After network if down, which triggers ctdb to get into banned
   state, ctdb does not get out of banned state if network is
   brought up again until ctdbd is restarted.

3. After network if down, smbd(notifyd) crashes.
   (This is bug #1333360.)

I don't know whether there is a bug for number 2.
But I think that we have 3 ==> 2, i.e the
crashes at ifdown are the reason for ctdb not getting
healthy again after ifup. And it also explains why
ctdb restart fixes it.

I think this bug should be used to track the firewall issue
as originally filed. And I don't even know if we need a
separate bug for number 2.
For Number 3 we have a patch and we need to retest with that
fix once we have it in a build.

Comment 16 surabhi 2016-05-30 07:53:34 UTC
After updating to build samba-4.4.3-6.el7rhgs.x86_64 , performed the following steps and still ctdb remains in banned state until ctdb is restarted.
The crash is fixed and is verified in another BZ but the banning issue remains.

Here, if a nic is brought down and brought back up the ctdb node remians in banned state until ctdb service is restarted.
Moving the BZ to assigned for further investigation.

Comment 17 Michael Adam 2016-05-31 23:44:05 UTC
Created attachment 1163387 [details]
Patch proposed to upstream

Based on my analysis, I created the attached patch and sent it to upstream (samba-technical) for review and comments.

It would be great to get a scratch build with this for testing it in our environment.

Comment 21 surabhi 2016-06-02 11:14:47 UTC
Took the scratch build  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118338 and verified the issue on original setup which was as follows:

1. There are two CTDB nodes with following nw config:
Node1 : eth0, eth1, eth2, eth3
Node2 : eth0, eth1

eth0 on both the nodes : On Public network
eth1, eth2, eth3 NIC's are in private network(with static IP's configured) used for internal communication between nodes.

The steps used:

1. Create a dis-rep volume and mount it on windows client(which also has NIC configured in private network) using VIP(corresponding to eth1) of node1.
2. Start copying a large file from windows local share to samba share.
3. Bring down the interface eth1 with the command "ifdown eth1"
4. Observe the IP failover.
5. Once the failover has happened,Bring up eth1 with command "ifup eth1"
6. Observe ctdb status 

Result: 
The ctdb node goes to banned state and even after NIC is up, ctdb node remains in banned state.


********************************************************************

Tried the same test on another cluster which was a 4 node ctdb cluster on samba 4.4.3-5 build with following configuration :

1. All 4 nodes have NIC eth0 configured with same subnet which is used for both public and private network

Bring down NIC on node 0 , observe ctdb status
ctdb status shows disconnected for that node.
Bring up the NIC on Node 0 which was brought down , and now observe the ctdb status.
Now ctdb status shows this node 0 as banned and it remians in banned state.

**************************

After updating the scratch build provided by Obnox :
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118338

After the NIC is brought back up , all the nodes are coming out of banned and come to OK state.
********************************************

Comment 22 Michael Adam 2016-06-06 10:54:03 UTC
Update from the continued rca:

- The original issue is fixed with the provided fix.
- Going to provide official build with this fix.

- The remaining issue is a different issue,
  which occurs only in a very special scenario:
  - Two node cluster.
  - IF must be brought down on recmaster
    (no issue if IF brought down on non-recmaster).

  The new issue is that ctdbd does not notice that the
  other node is back up again, and because it is a 2-node
  cluster, there is no other node to correct the wrong
  self-assessment of the banned node that it is the
  recovery master... Hence the node coming out of banning
  timeout keeps trying to perform recoveries, but fails
  due to not getting the recovery-lock, and hence bannes
  itself again.


Going to split out this very special Bug as new bug and
move this one to ON_QA once the build arrives.

Comment 23 surabhi 2016-06-07 11:36:55 UTC
As mentioned by Obnox in #C22 : Verified this bug on a 4 node setup:

Tried the same test on a cluster which was a 4 node ctdb cluster on samba 4.4.3-7 build with following configuration :

1. All 4 nodes have NIC eth0 configured with same subnet which is used for both public and private network

Bring down NIC on node 0 , observe ctdb status
ctdb status shows disconnected for that node.
Bring up the NIC on Node 0 which was brought down , and now observe the ctdb status.
Now ctdb status shows this node 0 as banned for a while and then comes back to OK state which is as expected.

**************************

On a 2 node setup as mentioned in #C21, the issue remains same, will be opening another BZ for that.
Marking this BZ verified.

Comment 25 errata-xmlrpc 2016-06-23 05:36:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1245


Note You need to log in before you can comment on or make changes to this bug.