2100703 – [Metro-DR] NetworkFence CR is not reconciled by the operator

Bug 2100703 - [Metro-DR] NetworkFence CR is not reconciled by the operator

Summary: [Metro-DR] NetworkFence CR is not reconciled by the operator

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-addons
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.12.0
Assignee:	Niels de Vos
QA Contact:	akarsha
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2106613 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-24 03:12 UTC by Raghavendra Talur
Modified:	2023-08-09 16:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When a Ceph-CSI Pod is started, it passes the IP-address of the Pod to the CSI-Addons sidecar. When the Pod restarts, it is possible that the IP-address was changed. Consequence: If the restart does not cause a change to the name of the Pod, it can happen that the CSIAddonsNode CR contains the previous IP-address. In case the previous IP-address is listed in the CSIAddonsNode CR, the CSI-Addons Controller will not be able to detect the new IP-address, and fails to connect to the side-car. Fix: Use the name of the Ceph-CSI Pod and the Namespace where the Pod is running, instead of the IP-address. Result: The CSI-Addons Controller will be able to lookup the CSIAddonsNode CR, get the endpoint attrubute and and resolve the name of the Pod in the Namespace to an IP-address.
Clone Of:
Environment:
Last Closed:	2023-02-08 14:06:28 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	csi-addons kubernetes-csi-addons pull 186	None	open	sidecar: delete pre-existing csiaddonsnode object & recreate	2022-07-27 11:27:38 UTC
Github	csi-addons kubernetes-csi-addons pull 190	None	Merged	Use `pod://` URL formatting for CSIAddonsNode endpoints	2022-10-12 12:29:32 UTC
Github	red-hat-storage kubernetes-csi-addons pull 54	None	Merged	[release-4.12] sync downstream with upstream	2022-10-12 12:29:39 UTC

Description Raghavendra Talur 2022-06-24 03:12:52 UTC

Description of problem (please be detailed as possible and provide log
snippests):
When a networkfence cr is created, the operator doesn't reconcile it. It looks like the csiaddons pod is not able to connect to another CSI pod. More details in the attached log.

Version of all relevant components (if applicable):
4.11

Comment 3 Raghavendra Talur 2022-06-24 03:15:39 UTC

The workaround is to 

stop the csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod
delete the csiaddonnodes object
start csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod

Comment 5 Niels de Vos 2022-06-24 11:22:57 UTC

How often does this happen? Does it happen consistently, or was this a one-time occurrence?

I guess we could investigate marking a node unavailable if there is some error, and retry on certain errors (resolve the node, and create a new connection).

Comment 6 Mudit Agarwal 2022-07-05 14:24:42 UTC

Not a 4.11 blocker

Comment 7 Raghavendra Talur 2022-07-19 20:02:03 UTC

(In reply to Niels de Vos from comment #5)
> How often does this happen? Does it happen consistently, or was this a
> one-time occurrence?
> 
> I guess we could investigate marking a node unavailable if there is some
> error, and retry on certain errors (resolve the node, and create a new
> connection).

This was seen only once when this bug was filed, but in the past week, we seem to have hit this issue twice. Rakshith R had collected the required logs last time but I had not been able to reproduce it after that.


Attaching a must-gather for further debugging.

Comment 8 Annette Clewett 2022-07-19 20:06:34 UTC

rtalur attached must-gather.

Comment 9 Raghavendra Talur 2022-07-19 20:10:19 UTC

*** Bug 2106613 has been marked as a duplicate of this bug. ***

Comment 10 Annette Clewett 2022-07-19 20:12:43 UTC

rtalur ocs-must-gather link - https://drive.google.com/file/d/1dbKCFJdFctlJMFF0Hfnq4VXYfrD1sIXd/view?usp=sharing.

Comment 16 Niels de Vos 2022-07-27 11:27:39 UTC

A workaround has been posted for review in upstream at https://github.com/csi-addons/kubernetes-csi-addons/pull/186

We're still investigating a more appropriate solution.

Comment 17 Niels de Vos 2022-07-28 18:37:44 UTC

https://github.com/csi-addons/kubernetes-csi-addons/pull/190 is an alternative that does not delete and re-create the CSIAddonsNode CR.

I'd like to test the solution, but without steps to reproduce it is rather difficult :-/

Comment 26 Niels de Vos 2022-10-12 12:29:39 UTC

https://github.com/red-hat-storage/kubernetes-csi-addons/commit/a9febe2efde7a9426cb53f27a86efb3535913e34 is the backport that should prevent this issue from happening again. It was included in the release-4.12 branch with https://github.com/red-hat-storage/kubernetes-csi-addons/pull/54 .

Builds from the beginning of September have the fix already.

Note You need to log in before you can comment on or make changes to this bug.