Bug 2100703

Summary: [Metro-DR] NetworkFence CR is not reconciled by the operator
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Raghavendra Talur <rtalur>
Component: csi-addonsAssignee: Niels de Vos <ndevos>
Status: CLOSED CURRENTRELEASE QA Contact: akarsha <akrai>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.11CC: aclewett, akrai, hnallurv, muagarwa, ndevos, ocs-bugs, odf-bz-bot, rar, sheggodu
Target Milestone: ---Keywords: TestBlocker
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When a Ceph-CSI Pod is started, it passes the IP-address of the Pod to the CSI-Addons sidecar. When the Pod restarts, it is possible that the IP-address was changed. Consequence: If the restart does not cause a change to the name of the Pod, it can happen that the CSIAddonsNode CR contains the previous IP-address. In case the previous IP-address is listed in the CSIAddonsNode CR, the CSI-Addons Controller will not be able to detect the new IP-address, and fails to connect to the side-car. Fix: Use the name of the Ceph-CSI Pod and the Namespace where the Pod is running, instead of the IP-address. Result: The CSI-Addons Controller will be able to lookup the CSIAddonsNode CR, get the endpoint attrubute and and resolve the name of the Pod in the Namespace to an IP-address.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Raghavendra Talur 2022-06-24 03:12:52 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When a networkfence cr is created, the operator doesn't reconcile it. It looks like the csiaddons pod is not able to connect to another CSI pod. More details in the attached log.

Version of all relevant components (if applicable):
4.11

Comment 3 Raghavendra Talur 2022-06-24 03:15:39 UTC
The workaround is to 

stop the csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod
delete the csiaddonnodes object
start csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod

Comment 5 Niels de Vos 2022-06-24 11:22:57 UTC
How often does this happen? Does it happen consistently, or was this a one-time occurrence?

I guess we could investigate marking a node unavailable if there is some error, and retry on certain errors (resolve the node, and create a new connection).

Comment 6 Mudit Agarwal 2022-07-05 14:24:42 UTC
Not a 4.11 blocker

Comment 7 Raghavendra Talur 2022-07-19 20:02:03 UTC
(In reply to Niels de Vos from comment #5)
> How often does this happen? Does it happen consistently, or was this a
> one-time occurrence?
> 
> I guess we could investigate marking a node unavailable if there is some
> error, and retry on certain errors (resolve the node, and create a new
> connection).

This was seen only once when this bug was filed, but in the past week, we seem to have hit this issue twice. Rakshith R had collected the required logs last time but I had not been able to reproduce it after that.


Attaching a must-gather for further debugging.

Comment 8 Annette Clewett 2022-07-19 20:06:34 UTC
rtalur attached must-gather.

Comment 9 Raghavendra Talur 2022-07-19 20:10:19 UTC
*** Bug 2106613 has been marked as a duplicate of this bug. ***

Comment 10 Annette Clewett 2022-07-19 20:12:43 UTC
rtalur ocs-must-gather link - https://drive.google.com/file/d/1dbKCFJdFctlJMFF0Hfnq4VXYfrD1sIXd/view?usp=sharing.

Comment 16 Niels de Vos 2022-07-27 11:27:39 UTC
A workaround has been posted for review in upstream at https://github.com/csi-addons/kubernetes-csi-addons/pull/186

We're still investigating a more appropriate solution.

Comment 17 Niels de Vos 2022-07-28 18:37:44 UTC
https://github.com/csi-addons/kubernetes-csi-addons/pull/190 is an alternative that does not delete and re-create the CSIAddonsNode CR.

I'd like to test the solution, but without steps to reproduce it is rather difficult :-/

Comment 26 Niels de Vos 2022-10-12 12:29:39 UTC
https://github.com/red-hat-storage/kubernetes-csi-addons/commit/a9febe2efde7a9426cb53f27a86efb3535913e34 is the backport that should prevent this issue from happening again. It was included in the release-4.12 branch with https://github.com/red-hat-storage/kubernetes-csi-addons/pull/54 .

Builds from the beginning of September have the fix already.