Bug 1887480 - [CNV][Chaos] Networking issues between masters and workers
Summary: [CNV][Chaos] Networking issues between masters and workers
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1908661
TreeView+ depends on / blocked
 
Reported: 2020-10-12 15:21 UTC by Piotr Kliczewski
Modified: 2021-11-09 23:09 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-08 20:00:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Piotr Kliczewski 2020-10-12 15:21:06 UTC
Description of problem:
As part of cnv chaos testing we explored scenario where we blocked network connectivity between workers and masters by using nftables on workers. The cluster is not able to detect that something happen to nodes. It looks all green from UI and cli perspective but commands fail.

After blocking all worker nodes cluster become unusable without telling admin that something is wrong and after couple of minutes using both cli and ui we get unauthorized.

After running: 

nft flush table filter

all the nodes both are usable again without any need to login. 


Version-Release number of selected component (if applicable):
4.6

How reproducible:
100%

Steps to Reproduce:
1. ssh to a worker node
2. systemctl start nftables 
3. nft add rule ip filter INPUT ip saddr <master0-2> counter drop
4. nft add rule ip filter OUTPUT ip saddr <master0-2> counter drop

Actual results:
No information that something is wrong with a node. It stays in Ready even though node status is stale - reported >30mins ago. 

Expected results:
Admin should be notified that node(s) is(are) not ready and trigger corresponding logic to recover the workload.

Additional info:
There are only few alerts fired: KubeClientErrors, AlertmanagerReceiversNotConfigured and AggregatedAPIDown but node status is not affected.

Comment 3 Ryan Phillips 2020-10-12 16:53:04 UTC
I don't think this is valid since the controller manager has a 5 minute timeout for node connectivity changes.

Comment 5 Piotr Kliczewski 2020-10-25 13:40:32 UTC
@Ryan, you haven't asked for any information so why are you closing this bug as insufficient_data. After 5 minute timeout a node do not changes its status.
Please ask for information you need and reopen the bug.

Comment 6 Ryan Phillips 2020-11-10 15:36:58 UTC
Piotr: Question 1 asks if you waited more than 5-6 minutes to see the status change. I do not see a response to that question.

Comment 7 Piotr Kliczewski 2020-11-10 15:55:57 UTC
yes, here are the steps:
1. ssh to a worker node and enable the rules as in description of this bug
2. wait 5 mins and run first step again with different worker

I repeated above 3 time (number of nodes in my cluster). No nodes were reported as not ready. After some time when last worker was blocked cluster stopped working but reported ready status for all workers till the very last second.
Here you can find must-gather logs run after rule reset on all the workers: https://drive.google.com/file/d/1nqIiuCu9zVeZZJESE-8SkU0N36XUCoHE/view?usp=sharing

Comment 9 Ryan Phillips 2020-12-17 16:43:35 UTC
There have been fixes in the realm of http2 timeouts. The PR just recently merged, and this use case should be retested with the patches. I believe you will need to use a nightly build if possible to pick these patches up for now.

https://github.com/openshift/kubernetes/pull/466

Comment 10 Piotr Kliczewski 2020-12-18 10:12:58 UTC
ATM we are working on automation of this test case and we still see it failing with 4.6.4. In which version the fix is available?

Comment 11 Neelesh Agrawal 2021-01-04 15:30:17 UTC
(In reply to Piotr Kliczewski from comment #10)
> failing with 4.6.4. In which version the fix is available?

Looking at the corresponding bug https://bugzilla.redhat.com/show_bug.cgi?id=1901208
The fix went into 4.6.8. That and any following versions should have the fix.

Comment 12 Piotr Kliczewski 2021-01-05 08:55:02 UTC
We checked with 4.6.8 and we do not see any difference.

Comment 17 Ryan Phillips 2021-06-01 16:23:20 UTC
Looks like the problem here is that the connections are established. There is probably an accept rule for established connections prior to the manually created drop commands.

Comment 18 Piotr Kliczewski 2021-06-02 06:26:08 UTC
Ryan, I would like to make sure that it indeed is not a bug. There were already couple of potential fixes which did not solve this issue. I do not like closing this bug without checking it works fine now.
What are the steps and OCP version to check?

Comment 19 Piotr Kliczewski 2021-07-20 12:23:09 UTC
Ryan when you close the bug please provide version which should be used to verify.

Comment 20 Tom Sweeney 2021-07-20 14:29:08 UTC
Ryan, please see https://bugzilla.redhat.com/show_bug.cgi?id=1887480#c19

Comment 21 Dan Winship 2021-11-08 20:00:38 UTC
(In reply to Ryan Phillips from comment #17)
> Looks like the problem here is that the connections are established. There
> is probably an accept rule for established connections prior to the manually
> created drop commands.

(In reply to Piotr Kliczewski from comment #19)
> Ryan when you close the bug please provide version which should be used to
> verify.

He's saying that it's not "fixed" in any version; the problem isn't with OCP, it's with the test case. It is expecting traffic to be dropped when there are other rules that prevent some of it from being dropped.

There is no good way to test "what happens when the node loses network connectivity" other than _actually_ making the node lose network connectivity. If you try to simulate losing network connectivity by making networking configuration changes on the node, you are not testing "what happens when the node loses network connectivity", you are testing "what happens when the administrator makes unsupported networking configuration changes", and the answer to that is "it's undefined, and may differ from release to release".

Comment 22 Piotr Kliczewski 2021-11-09 07:44:32 UTC
(In reply to Dan Winship from comment #21)
> There is no good way to test "what happens when the node loses network
> connectivity" other than _actually_ making the node lose network
> connectivity. If you try to simulate losing network connectivity by making
> networking configuration changes on the node, you are not testing "what
> happens when the node loses network connectivity", you are testing "what
> happens when the administrator makes unsupported networking configuration
> changes", and the answer to that is "it's undefined, and may differ from
> release to release".

We are not testing when the node loses connectivity we are testing potential issues with networking gear.
I investigated issue in the past when one of the two switches working in round robin failed. In this situation
half of the packets were fine and the other half was dropped.

Comment 23 Piotr Kliczewski 2021-11-09 07:45:54 UTC
Please provide either solution to the issue or discuss why it should be closed before closing it.

Comment 24 Dan Winship 2021-11-09 16:14:01 UTC
(In reply to Piotr Kliczewski from comment #22)
> We are not testing when the node loses connectivity we are testing potential
> issues with networking gear.

OK, but still, you're not testing potential issues with networking gear, you're testing what happens when someone attempts to _simulate_ issues with external networking gear by making unsupported networking configuration changes on the node itself. But the answer to "what happens when someone makes unsupported changes?" is "it's undefined".

I assume there are ways that you could make the node behave like it's connected to a broken switch, but there is no _supported_ way to do that (from inside the node), which means that even if you find a way that works in one release, there's no guarantee that the same approach will work in future releases (or when different combinations of features are enabled, etc), because you're depending on undocumented behavior of the product (eg, assuming that a particular nft rule, inserted in a particular location, will have a specific effect, and will not end up being overridden by another rule created by OCP, as appears to have happened in this case according to comment 17).

If you want to test the behavior of the system when the external network is misbehaving, the only supported way to do that is to actually make the _external_ network misbehave.

Comment 25 Piotr Kliczewski 2021-11-09 18:04:00 UTC
Do I understand correctly that you define networking gear issues as unsupported?

Due to limitations of our CI it is not possible to trigger hardware issues. It is best effort to see how the software is handling such cases.
As I mentioned all the test cases we run are based on real life customer experience. 

If you want to wait on this to happen at customer side it is fine by me. You can always point to this bug and say that those failures are unsupported :D

Comment 26 Dan Winship 2021-11-09 23:09:20 UTC
> Do I understand correctly that you define networking gear issues as unsupported?

That's not even remotely close to what I said.


Note You need to log in before you can comment on or make changes to this bug.