Description of problem: ***************************** On RHGS+SAMBA setup doing some tests(multi-channel) where ctdb goes to banned state , even restart of ctdb services doesn't bring Nodes to OK state. bring down the interface and verify the ctdb failover.the node goes to banned state and even after the interface up and restarting ctdb services the node remains in banned state until firewall is reloaded or iptables are flushed. If firewall is reloaded or iptables are flushed then restarting ctdb brings nodes to OK state. All the firewall settings as per admin guide for samba and ctdb is already set on the system. Note: This node has multiple interfaces configured. Version-Release number of selected component (if applicable): ***************************** samba-4.4.0-1.el7rhgs How reproducible: 2/3 Steps to Reproduce: 1.bring down an interface to trigger ip failover and ctdb node going to unhealthy state. (with multi-channel the nod eis going to banned state will raise another issue for that ) 2.bring up the interface to make the node healthy 3.check the ctdb status to make sure both the nodes are up and healthy. Actual results: One of the node remians in banned state even after bringing up teh interface and restarting ctdb services. it only comes to OK state once the firewall is reloaded or iptables are flushed. Expected results: CTDB node should come to OK state once the interface is up or ctdb service is restarted. Additional info:
Tried the test case on latest build, the issue remains same where once the ctdb node goes to banned state it doesn't come back to healthy until ctdb is restarted.(Which is still the issue) In the original problem description it was mentioned that only after reloading the firewall, CTDB was coming to OK state. Changing the summary accordingly.
(In reply to surabhi from comment #6) > Tried the test case on latest build, the issue remains same where once the > ctdb node goes to banned state it doesn't come back to healthy until ctdb is > restarted.(Which is still the issue) Now this sounds different from the original description... > In the original problem description it was mentioned that only after > reloading the firewall, CTDB was coming to OK state. Correct! And restarting ctdb did not help. > Changing the summary accordingly. Hmm. Not sure this is the appropriate action. IMHO, we should not change bugs to match different problem descriptions... The problem you have seen this time sounds very much like the description of the phenomenon of bug #1333360. There, the reason for ctdb not getting healthy and unbanned again was smbd / notifyd crashing. Can you see crashes in samba here?
(In reply to Michael Adam from comment #7) > (In reply to surabhi from comment #6) > > Tried the test case on latest build, the issue remains same where once the > > ctdb node goes to banned state it doesn't come back to healthy until ctdb is > > restarted.(Which is still the issue) > > Now this sounds different from the original description... > > > In the original problem description it was mentioned that only after > > reloading the firewall, CTDB was coming to OK state. > > Correct! And restarting ctdb did not help. > > > Changing the summary accordingly. > > Hmm. Not sure this is the appropriate action. > IMHO, we should not change bugs to match different > problem descriptions... Just did that to filter the firewall issue(not seen now), but if you think this creates confusion will put the original summary. > > The problem you have seen this time sounds very much like the description of > the phenomenon of bug #1333360. There, the reason for ctdb not getting > healthy and unbanned again was smbd / notifyd crashing. Can you see crashes > in samba here? Yeah I always see crash with ctdb node going to banned state. Purposely raised two BZ's to track seperately , one for crash and one for ctdb not coming back to OK state as during the time of execution not sure if the root cause is same for both.
(In reply to surabhi from comment #11) > (In reply to Michael Adam from comment #7) > > (In reply to surabhi from comment #6) > > > Tried the test case on latest build, the issue remains same where once the > > > ctdb node goes to banned state it doesn't come back to healthy until ctdb is > > > restarted.(Which is still the issue) > > > > Now this sounds different from the original description... > > > > > In the original problem description it was mentioned that only after > > > reloading the firewall, CTDB was coming to OK state. > > > > Correct! And restarting ctdb did not help. > > > > > Changing the summary accordingly. > > > > Hmm. Not sure this is the appropriate action. > > IMHO, we should not change bugs to match different > > problem descriptions... > > Just did that to filter the firewall issue(not seen now), but if you think > this creates confusion will put the original summary. > > > > The problem you have seen this time sounds very much like the description of > > the phenomenon of bug #1333360. There, the reason for ctdb not getting > > healthy and unbanned again was smbd / notifyd crashing. Can you see crashes > > in samba here? > > Yeah I always see crash with ctdb node going to banned state. Purposely > raised two BZ's to track seperately , one for crash and one for ctdb not > coming back to OK state as during the time of execution not sure if the root > cause is same for both. I see three different problems: 1. After a network down, which triggers ctdb to get into banned state, ctdb does not get out of banned state if network if is brought up again, until firewall is reloaded. (ctdbd restart does not fix it). (This bug.) 2. After network if down, which triggers ctdb to get into banned state, ctdb does not get out of banned state if network is brought up again until ctdbd is restarted. 3. After network if down, smbd(notifyd) crashes. (This is bug #1333360.) I don't know whether there is a bug for number 2. But I think that we have 3 ==> 2, i.e the crashes at ifdown are the reason for ctdb not getting healthy again after ifup. And it also explains why ctdb restart fixes it. I think this bug should be used to track the firewall issue as originally filed. And I don't even know if we need a separate bug for number 2. For Number 3 we have a patch and we need to retest with that fix once we have it in a build.
After updating to build samba-4.4.3-6.el7rhgs.x86_64 , performed the following steps and still ctdb remains in banned state until ctdb is restarted. The crash is fixed and is verified in another BZ but the banning issue remains. Here, if a nic is brought down and brought back up the ctdb node remians in banned state until ctdb service is restarted. Moving the BZ to assigned for further investigation.
Created attachment 1163387 [details] Patch proposed to upstream Based on my analysis, I created the attached patch and sent it to upstream (samba-technical) for review and comments. It would be great to get a scratch build with this for testing it in our environment.
Took the scratch build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118338 and verified the issue on original setup which was as follows: 1. There are two CTDB nodes with following nw config: Node1 : eth0, eth1, eth2, eth3 Node2 : eth0, eth1 eth0 on both the nodes : On Public network eth1, eth2, eth3 NIC's are in private network(with static IP's configured) used for internal communication between nodes. The steps used: 1. Create a dis-rep volume and mount it on windows client(which also has NIC configured in private network) using VIP(corresponding to eth1) of node1. 2. Start copying a large file from windows local share to samba share. 3. Bring down the interface eth1 with the command "ifdown eth1" 4. Observe the IP failover. 5. Once the failover has happened,Bring up eth1 with command "ifup eth1" 6. Observe ctdb status Result: The ctdb node goes to banned state and even after NIC is up, ctdb node remains in banned state. ******************************************************************** Tried the same test on another cluster which was a 4 node ctdb cluster on samba 4.4.3-5 build with following configuration : 1. All 4 nodes have NIC eth0 configured with same subnet which is used for both public and private network Bring down NIC on node 0 , observe ctdb status ctdb status shows disconnected for that node. Bring up the NIC on Node 0 which was brought down , and now observe the ctdb status. Now ctdb status shows this node 0 as banned and it remians in banned state. ************************** After updating the scratch build provided by Obnox : https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118338 After the NIC is brought back up , all the nodes are coming out of banned and come to OK state. ********************************************
Update from the continued rca: - The original issue is fixed with the provided fix. - Going to provide official build with this fix. - The remaining issue is a different issue, which occurs only in a very special scenario: - Two node cluster. - IF must be brought down on recmaster (no issue if IF brought down on non-recmaster). The new issue is that ctdbd does not notice that the other node is back up again, and because it is a 2-node cluster, there is no other node to correct the wrong self-assessment of the banned node that it is the recovery master... Hence the node coming out of banning timeout keeps trying to perform recoveries, but fails due to not getting the recovery-lock, and hence bannes itself again. Going to split out this very special Bug as new bug and move this one to ON_QA once the build arrives.
As mentioned by Obnox in #C22 : Verified this bug on a 4 node setup: Tried the same test on a cluster which was a 4 node ctdb cluster on samba 4.4.3-7 build with following configuration : 1. All 4 nodes have NIC eth0 configured with same subnet which is used for both public and private network Bring down NIC on node 0 , observe ctdb status ctdb status shows disconnected for that node. Bring up the NIC on Node 0 which was brought down , and now observe the ctdb status. Now ctdb status shows this node 0 as banned for a while and then comes back to OK state which is as expected. ************************** On a 2 node setup as mentioned in #C21, the issue remains same, will be opening another BZ for that. Marking this BZ verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1245