Description of problem (please be detailed as possible and provide log snippests): When a networkfence cr is created, the operator doesn't reconcile it. It looks like the csiaddons pod is not able to connect to another CSI pod. More details in the attached log. Version of all relevant components (if applicable): 4.11
The workaround is to stop the csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod delete the csiaddonnodes object start csi-rbdplugin-provisioner pod and csi-addons-controller-manager pod
How often does this happen? Does it happen consistently, or was this a one-time occurrence? I guess we could investigate marking a node unavailable if there is some error, and retry on certain errors (resolve the node, and create a new connection).
Not a 4.11 blocker
(In reply to Niels de Vos from comment #5) > How often does this happen? Does it happen consistently, or was this a > one-time occurrence? > > I guess we could investigate marking a node unavailable if there is some > error, and retry on certain errors (resolve the node, and create a new > connection). This was seen only once when this bug was filed, but in the past week, we seem to have hit this issue twice. Rakshith R had collected the required logs last time but I had not been able to reproduce it after that. Attaching a must-gather for further debugging.
rtalur attached must-gather.
*** Bug 2106613 has been marked as a duplicate of this bug. ***
rtalur ocs-must-gather link - https://drive.google.com/file/d/1dbKCFJdFctlJMFF0Hfnq4VXYfrD1sIXd/view?usp=sharing.
A workaround has been posted for review in upstream at https://github.com/csi-addons/kubernetes-csi-addons/pull/186 We're still investigating a more appropriate solution.
https://github.com/csi-addons/kubernetes-csi-addons/pull/190 is an alternative that does not delete and re-create the CSIAddonsNode CR. I'd like to test the solution, but without steps to reproduce it is rather difficult :-/
https://github.com/red-hat-storage/kubernetes-csi-addons/commit/a9febe2efde7a9426cb53f27a86efb3535913e34 is the backport that should prevent this issue from happening again. It was included in the release-4.12 branch with https://github.com/red-hat-storage/kubernetes-csi-addons/pull/54 . Builds from the beginning of September have the fix already.