Description of problem: As part of cnv chaos testing we explored scenario where we blocked network connectivity between workers and masters by using nftables on workers. The cluster is not able to detect that something happen to nodes. It looks all green from UI and cli perspective but commands fail. After blocking all worker nodes cluster become unusable without telling admin that something is wrong and after couple of minutes using both cli and ui we get unauthorized. After running: nft flush table filter all the nodes both are usable again without any need to login. Version-Release number of selected component (if applicable): 4.6 How reproducible: 100% Steps to Reproduce: 1. ssh to a worker node 2. systemctl start nftables 3. nft add rule ip filter INPUT ip saddr <master0-2> counter drop 4. nft add rule ip filter OUTPUT ip saddr <master0-2> counter drop Actual results: No information that something is wrong with a node. It stays in Ready even though node status is stale - reported >30mins ago. Expected results: Admin should be notified that node(s) is(are) not ready and trigger corresponding logic to recover the workload. Additional info: There are only few alerts fired: KubeClientErrors, AlertmanagerReceiversNotConfigured and AggregatedAPIDown but node status is not affected.
I don't think this is valid since the controller manager has a 5 minute timeout for node connectivity changes.
@Ryan, you haven't asked for any information so why are you closing this bug as insufficient_data. After 5 minute timeout a node do not changes its status. Please ask for information you need and reopen the bug.
Piotr: Question 1 asks if you waited more than 5-6 minutes to see the status change. I do not see a response to that question.
yes, here are the steps: 1. ssh to a worker node and enable the rules as in description of this bug 2. wait 5 mins and run first step again with different worker I repeated above 3 time (number of nodes in my cluster). No nodes were reported as not ready. After some time when last worker was blocked cluster stopped working but reported ready status for all workers till the very last second. Here you can find must-gather logs run after rule reset on all the workers: https://drive.google.com/file/d/1nqIiuCu9zVeZZJESE-8SkU0N36XUCoHE/view?usp=sharing
There have been fixes in the realm of http2 timeouts. The PR just recently merged, and this use case should be retested with the patches. I believe you will need to use a nightly build if possible to pick these patches up for now. https://github.com/openshift/kubernetes/pull/466
ATM we are working on automation of this test case and we still see it failing with 4.6.4. In which version the fix is available?
(In reply to Piotr Kliczewski from comment #10) > failing with 4.6.4. In which version the fix is available? Looking at the corresponding bug https://bugzilla.redhat.com/show_bug.cgi?id=1901208 The fix went into 4.6.8. That and any following versions should have the fix.
We checked with 4.6.8 and we do not see any difference.
Looks like the problem here is that the connections are established. There is probably an accept rule for established connections prior to the manually created drop commands.
Ryan, I would like to make sure that it indeed is not a bug. There were already couple of potential fixes which did not solve this issue. I do not like closing this bug without checking it works fine now. What are the steps and OCP version to check?
Ryan when you close the bug please provide version which should be used to verify.
Ryan, please see https://bugzilla.redhat.com/show_bug.cgi?id=1887480#c19
(In reply to Ryan Phillips from comment #17) > Looks like the problem here is that the connections are established. There > is probably an accept rule for established connections prior to the manually > created drop commands. (In reply to Piotr Kliczewski from comment #19) > Ryan when you close the bug please provide version which should be used to > verify. He's saying that it's not "fixed" in any version; the problem isn't with OCP, it's with the test case. It is expecting traffic to be dropped when there are other rules that prevent some of it from being dropped. There is no good way to test "what happens when the node loses network connectivity" other than _actually_ making the node lose network connectivity. If you try to simulate losing network connectivity by making networking configuration changes on the node, you are not testing "what happens when the node loses network connectivity", you are testing "what happens when the administrator makes unsupported networking configuration changes", and the answer to that is "it's undefined, and may differ from release to release".
(In reply to Dan Winship from comment #21) > There is no good way to test "what happens when the node loses network > connectivity" other than _actually_ making the node lose network > connectivity. If you try to simulate losing network connectivity by making > networking configuration changes on the node, you are not testing "what > happens when the node loses network connectivity", you are testing "what > happens when the administrator makes unsupported networking configuration > changes", and the answer to that is "it's undefined, and may differ from > release to release". We are not testing when the node loses connectivity we are testing potential issues with networking gear. I investigated issue in the past when one of the two switches working in round robin failed. In this situation half of the packets were fine and the other half was dropped.
Please provide either solution to the issue or discuss why it should be closed before closing it.
(In reply to Piotr Kliczewski from comment #22) > We are not testing when the node loses connectivity we are testing potential > issues with networking gear. OK, but still, you're not testing potential issues with networking gear, you're testing what happens when someone attempts to _simulate_ issues with external networking gear by making unsupported networking configuration changes on the node itself. But the answer to "what happens when someone makes unsupported changes?" is "it's undefined". I assume there are ways that you could make the node behave like it's connected to a broken switch, but there is no _supported_ way to do that (from inside the node), which means that even if you find a way that works in one release, there's no guarantee that the same approach will work in future releases (or when different combinations of features are enabled, etc), because you're depending on undocumented behavior of the product (eg, assuming that a particular nft rule, inserted in a particular location, will have a specific effect, and will not end up being overridden by another rule created by OCP, as appears to have happened in this case according to comment 17). If you want to test the behavior of the system when the external network is misbehaving, the only supported way to do that is to actually make the _external_ network misbehave.
Do I understand correctly that you define networking gear issues as unsupported? Due to limitations of our CI it is not possible to trigger hardware issues. It is best effort to see how the software is handling such cases. As I mentioned all the test cases we run are based on real life customer experience. If you want to wait on this to happen at customer side it is fine by me. You can always point to this bug and say that those failures are unsupported :D
> Do I understand correctly that you define networking gear issues as unsupported? That's not even remotely close to what I said.