Description: As part of cnv chaos testing we explored scenario where we blocked network connectivity between workers and masters by using nftables on workers. After blocking the connectivity for more than 30mins and resetting the nft rules OCS stopped working (tested 3hrs after the test). It was not possible to bind pvcs and endless loop was observed: InstallCheckFailed install timeout NeedsReinstall installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... AllRequirementsMet all requirements found, attempting install There was condition: MinimumReplicasUnavailable Deployment does not have minimum availability. Version of all relevant components (if applicable): 4.6 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Sometimes. 1 time out of 2 test runs Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: it is not a regression Steps to Reproduce: 1. ssh to a worker node 2. systemctl start nftables 3. nft add rule ip filter INPUT ip saddr <master0-2> counter drop 4. nft add rule ip filter OUTPUT ip saddr <master0-2> counter drop Actual results: OCS not usable - not able to bind new pvcs Expected results: The storage should work fine Additional info:
Not sure whether this should be an OCS issue or not but lets start with installation rather than unclassified.
Mudit, so far "installation" has not proven to be really useful. Can we get more logs? If the deployment never rolls out, isn't that an OCP issue? Moving to OCS-op based on "NeedsReinstall installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...".
Important information is that OCS was fully operational before disrupting test and there was workload (vm) using it. The issue is that when connectivity between nodes were restored the cluster recovered by OCS became unusable.
While interesting, this is not something that should be a blocker for OCS 4.6. Moving to OCS 4.7.
This is still interesting, and still not a blocker. We really need more information before we can proceed, at the very least full OCP and OCS must-gather after the chaos was initiated. It may end up being a general OCP bug. Also, what platform was this on? Setting NEEDINFO and moving out to OCS 4.8.
Here is must-gather [1] of one of such scenarios. Unfortunately OCS was stable but it could give you more information about what is happening in the cluster. This issue could be time dependent and as such not easy to reproduce. [1] https://drive.google.com/file/d/1nqIiuCu9zVeZZJESE-8SkU0N36XUCoHE/view?usp=sharing
Jose, what is needed to move forward with this?
Sorry for letting this sit around so long. Since there hasn't been any other follow-up from the chaos testing, we can probably safely move this to ODF 4.9. However, I'll try and set up something with the CNV team to see if this is still relevant and look further into it if desired.
No update for a long time, not sure whether it is still relevant. Closing it, please reopen if required.