Description of problem (please be detailed as possible and provide log snippests): csi-addons-controller-manager pod is reset after running the must-gather command The csi-addons controller ends with the following msg: ``` 2024-01-08T12:50:48.279Z INFO Stopping and waiting for non leader election runnables 2024-01-08T12:50:48.279Z INFO Stopping and waiting for leader election runnables 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"} 2024-01-08T12:50:48.279Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"} 2024-01-08T12:50:48.279Z INFO All workers finished {"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"} 2024-01-08T12:50:48.279Z INFO Stopping and waiting for caches 2024-01-08T12:50:48.279Z INFO Stopping and waiting for webhooks 2024-01-08T12:50:48.279Z INFO Stopping and waiting for HTTP servers 2024-01-08T12:50:48.279Z INFO controller-runtime.metrics Shutting down metrics server with timeout of 1 minute 2024-01-08T12:50:48.279Z INFO shutting down server {"kind": "health probe", "addr": "[::]:8081"} 2024-01-08T12:50:48.279Z INFO Wait completed, proceeding to shutdown the manager ``` Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Check oc get pods -nopenshift-storage | grep csi-addons 2. Run must-gather 3. Look for the csi-addons pod again Actual results: must-gather should get collected without restarting the csi-addons controller Expected results: must-gather gets collected but it restarts the csi-addons controller Additional info:
It seems the operator looses its leader election, and exits. logs: 2024-01-08T14:11:53.734Z INFO Stopping and waiting for non leader election runnables 2024-01-08T14:11:53.734Z INFO Stopping and waiting for leader election runnables 2024-01-08T14:11:53.734Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"} 2024-01-08T14:11:53.734Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"} 2024-01-08T14:11:53.734Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"} 2024-01-08T14:11:53.734Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"} 2024-01-08T14:11:53.734Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"} 2024-01-08T14:11:53.734Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"} 2024-01-08T14:11:53.734Z INFO All workers finished {"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"} 2024-01-08T14:11:53.734Z INFO All workers finished {"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"} 2024-01-08T14:11:53.734Z INFO All workers finished {"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"} 2024-01-08T14:11:53.734Z INFO All workers finished {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"} 2024-01-08T14:11:53.734Z INFO All workers finished {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"} 2024-01-08T14:11:53.734Z INFO All workers finished {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"} 2024-01-08T14:11:53.734Z INFO Stopping and waiting for caches 2024-01-08T14:11:53.735Z INFO Stopping and waiting for webhooks 2024-01-08T14:11:53.735Z INFO Stopping and waiting for HTTP servers 2024-01-08T14:11:53.735Z INFO shutting down server {"kind": "health probe", "addr": "[::]:8081"} 2024-01-08T14:11:53.735Z INFO controller-runtime.metrics Shutting down metrics server with timeout of 1 minute 2024-01-08T14:11:53.735Z INFO Wait completed, proceeding to shutdown the manager 2024-01-08T14:11:53.735Z ERROR error received after stop sequence was engaged {"error": "leader election lost"} sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:490 events: 0s Normal Killing pod/csi-addons-controller-manager-569cc9df96-b84mn Stopping container manager 0s Normal Killing pod/csi-addons-controller-manager-569cc9df96-b84mn Stopping container kube-rbac-proxy 0s Warning FailedToUpdateEndpoint endpoints/csi-addons-controller-manager-metrics-service Failed to update endpoint openshift-storage/csi-addons-controller-manager-metrics-service: Operation cannot be fulfilled on endpoints "csi-addons-controller-manager-metrics-service": the object has been modified; please apply your changes to the latest version and try again
The problem is not (only?) related to leader election. Even without leader election enabled (--leader-elect=false), and running "oc adm must-gather") the pod restarts.
Going through the logs on a node where the operator is stopped: Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.348925 2327 kubelet.go:2441] "SyncLoop DELETE" source="api" pods=[openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr] Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.349177 2327 kuberuntime_container.go:742] "Killing container with a grace period" pod="openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr" podUID=2edcc1ff-86d6-46d9-9abf-685b1dffd4df containerName="manager" containerID="cri-o://7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6" gracePeriod=10 Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.349238 2327 kuberuntime_container.go:742] "Killing container with a grace period" pod="openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr" podUID=2edcc1ff-86d6-46d9-9abf-685b1dffd4df containerName="kube-rbac-proxy" containerID="cri-o://bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580" gracePeriod=10 Jan 09 12:51:12 ip-10-0-12-180 crio[2242]: time="2024-01-09 12:51:12.349592526Z" level=info msg="Stopping container: 7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6 (timeout: 10s)" id=a40d2914-9291-41ee-950c-8d1fd386513d name=/runtime.v1.RuntimeService/StopContainer Jan 09 12:51:12 ip-10-0-12-180 crio[2242]: time="2024-01-09 12:51:12.349592526Z" level=info msg="Stopping container: bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580 (timeout: 10s)" id=a8f3fedd-8b8e-4837-aab0-ed8a74966809 name=/runtime.v1.RuntimeService/StopContainer Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.356742 2327 kubelet.go:2435] "SyncLoop REMOVE" source="api" pods=[openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr] Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580.scope: Deactivated successfully. Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580.scope: Consumed 1.143s CPU time. Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6.scope: Deactivated successfully. Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-conmon-bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580.scope: Deactivated successfully. Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-conmon-7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6.scope: Deactivated successfully. The messages "SyncLoop DELETE" source="api" "SyncLoop REMOVE" source="api" suggest that the api-server instructs kubelet to stop the pod (with 2 containers). Unfortunately the reason is not mentioned in this log.
I think the problem is with the 'oc adm must-gather' command itself, and not with the collecting of the data. This causes a restart of the csi-addons operator as well, and it does not collect anything: oc adm must-gather --image=quay.io/centos/centos:stream9-minimal -- /usr/bin/date One of the 1st actions that the 'oc adm must-gather' command does, is creating a new namespace. This namespace is by default prefixed with "openshift-must-gather-", which starts with "openshift-" and that is a special case for ODF. The following command causes a restart of the operator as well: $ oc create ns openshift-rhbz2257259 namespace/openshift-rhbz2257259 created The new namespace contains a csiaddons annotation: $ oc describe ns/openshift-rhbz2257259 Name: openshift-rhbz2257259 Labels: kubernetes.io/metadata.name=openshift-rhbz2257259 Annotations: openshift.io/sa.scc.mcs: s0:c28,c27 openshift.io/sa.scc.supplemental-groups: 1000810000/10000 openshift.io/sa.scc.uid-range: 1000810000/10000 reclaimspace.csiaddons.openshift.io/schedule: @weekly Status: Active No resource quota. No LimitRange resource. It seems that ocs-operator adds the annotation, and then explicitly restarts the csi-addons operator! https://github.com/red-hat-storage/ocs-operator/blob/68b42d973015a38c926a7e165b2f749d8d7091a8/controllers/namespace/namespace_controller.go#L80 introduced in https://github.com/red-hat-storage/ocs-operator/pull/2074 Unfortunately I can not find why the restarting is required. Nikhil, the PR comes from you, do you remember details about this requirement?
Hi, Yes, the restart is intentional and was added to take into account existing RBD PVCs (i.e, upgraded clusters) can make use of the annotation to reclaim space. The must-gather tool and automation needs to be updated to consider this situation. Ref EPIC: https://issues.redhat.com/browse/RHSTOR-4468
Thanks Nikhil! Is it intentional that the restart is triggered when a new openshift-* Namespace is created? The new Namespace would be empty, so there is no need to take exising PVCs into account. Would that be a bug for addressing in ocs-operator? I.e. "do not restart the kubernetes-csi-addons operator when a Namespace without PVCs was annotated". Separate from that, I now think restarting is the wrong approach in general. QE reported this BZ, but customers will notice the unexpected restart of the operator too. We should improve the kubernetes-csi-addons operator in such a way, that restarting is not required.
I think the restart is needed for new namespaces because if we don't do a restart i.e, csi-addons won't consider that the namespace allows you make use of space reclaim feature. There might not be any PVCs now, but if in future new PVCs are created then a restart of the csi-addons pod has to be done somehow, for the namespace annotation to be considered for space reclaim. There are some limitations as to why restart is the only feasible option here for csi-addons pod, I will let Madhu explain them better here.
Ok, in that case so we should be just skipping reconcile for a create event of a namespace? Similar to how we are doing for delete events, https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/namespace/namespace_controller.go#L193-L204 ? So, if a PVC is to be created in the new namespace, The user will have to manually add the annotation to the namespace, right?
Posted a PR for the same, this should take care of it. https://github.com/red-hat-storage/ocs-operator/pull/2479
Changing the bug component to ocs-operator, and updating the assignee to me. If an update to csi-addons is needed, I think we can create a separate bug to track that.
Since the fix PR is merged and the bz is approved for 4.16, moving the bug to ON_QA.
*** Bug 2245246 has been marked as a duplicate of this bug. ***
Bug reproduced on odf-operator.v4.16.0-89.stable [OCP version: 4.16.0-0.nightly-2024-04-26-145258] Running mg command: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.16 ... Checking csi-addons-controller-manager pod in parallel $ oc get pods -n openshift-storage csi-addons-controller-manager-544669f475-lj8np -w NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-544669f475-lj8np 2/2 Running 0 15m csi-addons-controller-manager-544669f475-lj8np 2/2 Terminating 0 15m csi-addons-controller-manager-544669f475-lj8np 2/2 Terminating 0 15m $ oc get pods -n openshift-storage csi-addons-controller-manager-544669f475-xjx62 NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-544669f475-xjx62 2/2 Running 0 94s $ oc get ns | grep must openshift-must-gather-42778 Active 22s In addition, User cannot create new project with openshift prefix: (venv) oviner~/test$ oc new-project openshift-te Error from server (Forbidden): project.project.openshift.io "openshift-te" is forbidden: cannot request a project starting with "openshift-"
Bug fixed: Test procedure: 1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform 2. Install ODF4.16 odf-operator.v4.16.0-92.stable 3. Running 5 times must-gather: $ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.16 4.Check csi-addons-controller-manager pod [no restarted] csi-addons-controller-manager-564b8b74d7-njh2t 2/2 Running 0 24h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591