This bug was initially created as a copy of Bug #2257259 I am copying this bug because: Description of problem (please be detailed as possible and provide log snippests): csi-addons-controller-manager pod is reset after running the must-gather command The csi-addons controller ends with the following msg: ``` 2024-01-08T12:50:48.279Z INFO Stopping and waiting for non leader election runnables 2024-01-08T12:50:48.279Z INFO Stopping and waiting for leader election runnables 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"} 2024-01-08T12:50:48.279Z INFO Stopping and waiting for caches 2024-01-08T12:50:48.279Z INFO Stopping and waiting for webhooks 2024-01-08T12:50:48.279Z INFO Stopping and waiting for HTTP servers 2024-01-08T12:50:48.279Z INFO controller-runtime.metrics Shutting down metrics server with timeout of 1 minute 2024-01-08T12:50:48.279Z INFO shutting down server {"kind": "health probe", "addr": "[::]:8081"} 2024-01-08T12:50:48.279Z INFO Wait completed, proceeding to shutdown the manager ``` Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Check oc get pods -nopenshift-storage | grep csi-addons 2. Run must-gather 3. Look for the csi-addons pod again Actual results: must-gather should get collected without restarting the csi-addons controller Expected results: must-gather gets collected but it restarts the csi-addons controller Additional info:
Bug in NEW/ASSIGNED state. Moving the bug to 4.15.3 for a decision on RCA/FIX.
(In reply to krishnaram Karthick from comment #2) > Bug in NEW/ASSIGNED state. > Moving the bug to 4.15.3 for a decision on RCA/FIX. The bug is a clone for 4.15. The fix is already merged for the actual bz, once the bug is approved I will create a backport PR. So, I think we should be good to take it in 4.15.2
With OCP 4.15 and ODF 4.15.2-1, upon running must gather command "oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15", the csi-addon-controller-manager pod goes for a restart. (venv) [jopinto@jopinto brown416]$ date Mon Apr 22 21:05:06 IST 2024 (venv) [jopinto@jopinto brown416]$ oc get pods -n openshift-storage -o wide | grep csi-addons csi-addons-controller-manager-c889b47c9-rk5nf 2/2 Running 0 46m 10.129.2.44 jopinto-c13416-696z2-worker-3-cldl4 <none> <none> (venv) [jopinto@jopinto brown416]$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15 >must.txt W0422 21:05:35.331132 426437 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gather", "copy" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gather", "copy" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gather", "copy" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gather", "copy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") [jopinto@jopinto brown416]$ oc describe pod csi-addons-controller-manager-c889b47c9-rf4b8 -n openshift-storage Name: csi-addons-controller-manager-c889b47c9-rf4b8 Namespace: openshift-storage Priority: 0 Node: jopinto-c13416-696z2-worker-3-cldl4/10.241.128.4 Start Time: Mon, 22 Apr 2024 21:05:34 +0530 ... ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 115s default-scheduler Successfully assigned openshift-storage/csi-addons-controller-manager-c889b47c9-rf4b8 to jopinto-c13416-696z2-worker-3-cldl4 Normal AddedInterface 114s multus Add eth0 [10.129.2.46/23] from ovn-kubernetes Normal Pulled 114s kubelet Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:063f7f9ee19b67e2184b2b461c86e15608951864bb84c7398b2e441f9ec6164f" already present on machine Normal Created 114s kubelet Created container kube-rbac-proxy Normal Started 114s kubelet Started container kube-rbac-proxy Normal Pulled 114s kubelet Container image "registry.redhat.io/odf4/odf-csi-addons-rhel9-operator@sha256:7146ad801388ffd84d71fe1640f77cc9710883f4f9b22e359df75c1f424d99ce" already present on machine Normal Created 114s kubelet Created container manager Normal Started 114s kubelet Started container manager [jopinto@jopinto brown416]$
Moving back the bug to assigned state based on https://bugzilla.redhat.com/show_bug.cgi?id=2267607#c7
The fix should be working for the bz but due to some reasons the fix might have missed the latest 4.15 build and due to release timelines we are not able to have a new build for 4.15. Hence, moving the bug back to POST state and proposing it for 4.15.3