Bug 2100703 - [Metro-DR] NetworkFence CR is not reconciled by the operator
Summary: [Metro-DR] NetworkFence CR is not reconciled by the operator
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-addons
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.12.0
Assignee: Niels de Vos
QA Contact: akarsha
URL:
Whiteboard:
: 2106613 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-24 03:12 UTC by Raghavendra Talur
Modified: 2023-08-09 16:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When a Ceph-CSI Pod is started, it passes the IP-address of the Pod to the CSI-Addons sidecar. When the Pod restarts, it is possible that the IP-address was changed. Consequence: If the restart does not cause a change to the name of the Pod, it can happen that the CSIAddonsNode CR contains the previous IP-address. In case the previous IP-address is listed in the CSIAddonsNode CR, the CSI-Addons Controller will not be able to detect the new IP-address, and fails to connect to the side-car. Fix: Use the name of the Ceph-CSI Pod and the Namespace where the Pod is running, instead of the IP-address. Result: The CSI-Addons Controller will be able to lookup the CSIAddonsNode CR, get the endpoint attrubute and and resolve the name of the Pod in the Namespace to an IP-address.
Clone Of:
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github csi-addons kubernetes-csi-addons pull 186 0 None open sidecar: delete pre-existing csiaddonsnode object & recreate 2022-07-27 11:27:38 UTC
Github csi-addons kubernetes-csi-addons pull 190 0 None Merged Use `pod://` URL formatting for CSIAddonsNode endpoints 2022-10-12 12:29:32 UTC
Github red-hat-storage kubernetes-csi-addons pull 54 0 None Merged [release-4.12] sync downstream with upstream 2022-10-12 12:29:39 UTC

Description Raghavendra Talur 2022-06-24 03:12:52 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When a networkfence cr is created, the operator doesn't reconcile it. It looks like the csiaddons pod is not able to connect to another CSI pod. More details in the attached log.

Version of all relevant components (if applicable):
4.11

Comment 3 Raghavendra Talur 2022-06-24 03:15:39 UTC
The workaround is to 

stop the csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod
delete the csiaddonnodes object
start csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod

Comment 5 Niels de Vos 2022-06-24 11:22:57 UTC
How often does this happen? Does it happen consistently, or was this a one-time occurrence?

I guess we could investigate marking a node unavailable if there is some error, and retry on certain errors (resolve the node, and create a new connection).

Comment 6 Mudit Agarwal 2022-07-05 14:24:42 UTC
Not a 4.11 blocker

Comment 7 Raghavendra Talur 2022-07-19 20:02:03 UTC
(In reply to Niels de Vos from comment #5)
> How often does this happen? Does it happen consistently, or was this a
> one-time occurrence?
> 
> I guess we could investigate marking a node unavailable if there is some
> error, and retry on certain errors (resolve the node, and create a new
> connection).

This was seen only once when this bug was filed, but in the past week, we seem to have hit this issue twice. Rakshith R had collected the required logs last time but I had not been able to reproduce it after that.


Attaching a must-gather for further debugging.

Comment 8 Annette Clewett 2022-07-19 20:06:34 UTC
rtalur attached must-gather.

Comment 9 Raghavendra Talur 2022-07-19 20:10:19 UTC
*** Bug 2106613 has been marked as a duplicate of this bug. ***

Comment 10 Annette Clewett 2022-07-19 20:12:43 UTC
rtalur ocs-must-gather link - https://drive.google.com/file/d/1dbKCFJdFctlJMFF0Hfnq4VXYfrD1sIXd/view?usp=sharing.

Comment 16 Niels de Vos 2022-07-27 11:27:39 UTC
A workaround has been posted for review in upstream at https://github.com/csi-addons/kubernetes-csi-addons/pull/186

We're still investigating a more appropriate solution.

Comment 17 Niels de Vos 2022-07-28 18:37:44 UTC
https://github.com/csi-addons/kubernetes-csi-addons/pull/190 is an alternative that does not delete and re-create the CSIAddonsNode CR.

I'd like to test the solution, but without steps to reproduce it is rather difficult :-/

Comment 26 Niels de Vos 2022-10-12 12:29:39 UTC
https://github.com/red-hat-storage/kubernetes-csi-addons/commit/a9febe2efde7a9426cb53f27a86efb3535913e34 is the backport that should prevent this issue from happening again. It was included in the release-4.12 branch with https://github.com/red-hat-storage/kubernetes-csi-addons/pull/54 .

Builds from the beginning of September have the fix already.


Note You need to log in before you can comment on or make changes to this bug.