1887490 – [CNV][Chaos] Networking issues between masters and workers

Bug 1887490 - [CNV][Chaos] Networking issues between masters and workers

Summary: [CNV][Chaos] Networking issues between masters and workers

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jose A. Rivera
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1908661
TreeView+	depends on / blocked

Reported:	2020-10-12 15:40 UTC by Piotr Kliczewski
Modified:	2023-08-09 17:00 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-06 16:14:41 UTC
Embargoed:

Attachments	(Terms of Use)

Description Piotr Kliczewski 2020-10-12 15:40:26 UTC

Description:
As part of cnv chaos testing we explored scenario where we blocked network connectivity between workers and masters by using nftables on workers. After blocking the connectivity for more than 30mins and resetting the nft rules OCS stopped working (tested 3hrs after the test). It was not possible to bind pvcs and endless loop was observed:

InstallCheckFailed install timeout
NeedsReinstall installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
AllRequirementsMet all requirements found, attempting install

There was condition:
MinimumReplicasUnavailable Deployment does not have minimum availability.

Version of all relevant components (if applicable):
4.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Sometimes. 1 time out of 2 test runs

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
it is not a regression

Steps to Reproduce:
1. ssh to a worker node
2. systemctl start nftables
3. nft add rule ip filter INPUT ip saddr <master0-2> counter drop
4. nft add rule ip filter OUTPUT ip saddr <master0-2> counter drop

Actual results:
OCS not usable - not able to bind new pvcs

Expected results:
The storage should work fine

Additional info:

Comment 2 Mudit Agarwal 2020-10-13 15:46:09 UTC

Not sure whether this should be an OCS issue or not but lets start with installation rather than unclassified.

Comment 3 Sébastien Han 2020-10-13 16:12:55 UTC

Mudit, so far "installation" has not proven to be really useful.

Can we get more logs? If the deployment never rolls out, isn't that an OCP issue?

Moving to OCS-op based on "NeedsReinstall installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...".

Comment 4 Piotr Kliczewski 2020-10-14 07:19:09 UTC

Important information is that OCS was fully operational before disrupting test and there was workload (vm) using it.
The issue is that when connectivity between nodes were restored the cluster recovered by OCS became unusable.

Comment 5 Jose A. Rivera 2020-10-20 14:43:46 UTC

While interesting, this is not something that should be a blocker for OCS 4.6. Moving to OCS 4.7.

Comment 6 Jose A. Rivera 2021-01-29 16:06:49 UTC

This is still interesting, and still not a blocker. We really need more information before we can proceed, at the very least full OCP and OCS must-gather after the chaos was initiated. It may end up being a general OCP bug. Also, what platform was this on?

Setting NEEDINFO and moving out to OCS 4.8.

Comment 7 Piotr Kliczewski 2021-02-01 08:04:18 UTC

Here is must-gather [1] of one of such scenarios. Unfortunately OCS was stable but it could give you more information about what is happening in the cluster. This issue could be time dependent and as such not easy to reproduce.  


[1] https://drive.google.com/file/d/1nqIiuCu9zVeZZJESE-8SkU0N36XUCoHE/view?usp=sharing

Comment 8 Yaniv Kaul 2021-03-31 11:57:22 UTC

Jose, what is needed to move forward with this?

Comment 9 Jose A. Rivera 2021-06-09 16:17:15 UTC

Sorry for letting this sit around so long. Since there hasn't been any other follow-up from the chaos testing, we can probably safely move this to ODF 4.9. However, I'll try and set up something with the CNV team to see if this is still relevant and look further into it if desired.

Comment 11 Mudit Agarwal 2021-10-06 16:14:41 UTC

No update for a long time, not sure whether it is still relevant.
Closing it, please reopen if required.

Note You need to log in before you can comment on or make changes to this bug.