Bug 2257259 - csi-addons-controller-manager pod is reset after running the must-gather command
Summary: csi-addons-controller-manager pod is reset after running the must-gather command
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.16.0
Assignee: Nikhil Ladha
QA Contact: Oded
URL:
Whiteboard:
: 2245246 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-08 13:08 UTC by yati padia
Modified: 2024-07-17 13:12 UTC (History)
6 users (show)

Fixed In Version: 4.16.0-92
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-17 13:11:59 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2479 0 None Merged skip csi-addons pod restart for new namespaces 2024-04-16 06:27:39 UTC
Red Hat Bugzilla 2245246 0 unspecified CLOSED csi-addons-controller-manager pod is reset after running the must-gather command 2024-04-29 06:21:00 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:12:03 UTC

Description yati padia 2024-01-08 13:08:05 UTC
Description of problem (please be detailed as possible and provide log
snippests):

 csi-addons-controller-manager pod is reset after running the must-gather command 
The csi-addons controller ends with the following msg:
```
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for non leader election runnables
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for leader election runnables
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for caches
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for webhooks
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for HTTP servers
2024-01-08T12:50:48.279Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-01-08T12:50:48.279Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-01-08T12:50:48.279Z	INFO	Wait completed, proceeding to shutdown the manager
```

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?



Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1. Check oc get pods -nopenshift-storage | grep csi-addons
2. Run must-gather
3. Look for the csi-addons pod again


Actual results:
must-gather should get collected without restarting the csi-addons controller

Expected results:
must-gather gets collected but it restarts the csi-addons controller


Additional info:

Comment 2 Niels de Vos 2024-01-08 14:21:38 UTC
It seems the operator looses its leader election, and exits.

logs:

2024-01-08T14:11:53.734Z	INFO	Stopping and waiting for non leader election runnables
2024-01-08T14:11:53.734Z	INFO	Stopping and waiting for leader election runnables
2024-01-08T14:11:53.734Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T14:11:53.734Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T14:11:53.734Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T14:11:53.734Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T14:11:53.734Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T14:11:53.734Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T14:11:53.734Z	INFO	All workers finished	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T14:11:53.734Z	INFO	All workers finished	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T14:11:53.734Z	INFO	All workers finished	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T14:11:53.734Z	INFO	All workers finished	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T14:11:53.734Z	INFO	All workers finished	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T14:11:53.734Z	INFO	All workers finished	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T14:11:53.734Z	INFO	Stopping and waiting for caches
2024-01-08T14:11:53.735Z	INFO	Stopping and waiting for webhooks
2024-01-08T14:11:53.735Z	INFO	Stopping and waiting for HTTP servers
2024-01-08T14:11:53.735Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-01-08T14:11:53.735Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-01-08T14:11:53.735Z	INFO	Wait completed, proceeding to shutdown the manager
2024-01-08T14:11:53.735Z	ERROR	error received after stop sequence was engaged	{"error": "leader election lost"}
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
	/workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:490


events:

0s          Normal    Killing                        pod/csi-addons-controller-manager-569cc9df96-b84mn                                   Stopping container manager
0s          Normal    Killing                        pod/csi-addons-controller-manager-569cc9df96-b84mn                                   Stopping container kube-rbac-proxy
0s          Warning   FailedToUpdateEndpoint         endpoints/csi-addons-controller-manager-metrics-service                              Failed to update endpoint openshift-storage/csi-addons-controller-manager-metrics-service: Operation cannot be fulfilled on endpoints "csi-addons-controller-manager-metrics-service": the object has been modified; please apply your changes to the latest version and try again

Comment 3 Niels de Vos 2024-01-08 17:46:52 UTC
The problem is not (only?) related to leader election. Even without leader election enabled (--leader-elect=false), and running "oc adm must-gather") the pod restarts.

Comment 4 Niels de Vos 2024-01-09 13:37:47 UTC
Going through the logs on a node where the operator is stopped:

Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.348925    2327 kubelet.go:2441] "SyncLoop DELETE" source="api" pods=[openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr]
Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.349177    2327 kuberuntime_container.go:742] "Killing container with a grace period" pod="openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr" podUID=2edcc1ff-86d6-46d9-9abf-685b1dffd4df containerName="manager" containerID="cri-o://7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6" gracePeriod=10
Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.349238    2327 kuberuntime_container.go:742] "Killing container with a grace period" pod="openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr" podUID=2edcc1ff-86d6-46d9-9abf-685b1dffd4df containerName="kube-rbac-proxy" containerID="cri-o://bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580" gracePeriod=10
Jan 09 12:51:12 ip-10-0-12-180 crio[2242]: time="2024-01-09 12:51:12.349592526Z" level=info msg="Stopping container: 7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6 (timeout: 10s)" id=a40d2914-9291-41ee-950c-8d1fd386513d name=/runtime.v1.RuntimeService/StopContainer
Jan 09 12:51:12 ip-10-0-12-180 crio[2242]: time="2024-01-09 12:51:12.349592526Z" level=info msg="Stopping container: bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580 (timeout: 10s)" id=a8f3fedd-8b8e-4837-aab0-ed8a74966809 name=/runtime.v1.RuntimeService/StopContainer
Jan 09 12:51:12 ip-10-0-12-180 kubenswrapper[2327]: I0109 12:51:12.356742    2327 kubelet.go:2435] "SyncLoop REMOVE" source="api" pods=[openshift-storage/csi-addons-controller-manager-9799ddf95-bcgfr]
Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580.scope: Deactivated successfully.
Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580.scope: Consumed 1.143s CPU time.
Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6.scope: Deactivated successfully.
Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-conmon-bdcf5de6445d49c88cd6d9b9d7fd93075a94a5f4a1e4bba2bf5326370ada1580.scope: Deactivated successfully.
Jan 09 12:51:12 ip-10-0-12-180 systemd[1]: crio-conmon-7d0790805ad7e0a4e8e723defd6160179591449c7e861b0c4fc600fd433726c6.scope: Deactivated successfully.


The messages

"SyncLoop DELETE" source="api"
"SyncLoop REMOVE" source="api"

suggest that the api-server instructs kubelet to stop the pod (with 2 containers). Unfortunately the reason is not mentioned in this log.

Comment 5 Niels de Vos 2024-02-14 18:07:02 UTC
I think the problem is with the 'oc adm must-gather' command itself, and not with the collecting of the data.

This causes a restart of the csi-addons operator as well, and it does not collect anything:

oc adm must-gather --image=quay.io/centos/centos:stream9-minimal -- /usr/bin/date


One of the 1st actions that the 'oc adm must-gather' command does, is creating a new namespace. This namespace is by default prefixed with "openshift-must-gather-", which starts with "openshift-" and that is a special case for ODF.

The following command causes a restart of the operator as well:

$ oc create ns openshift-rhbz2257259
namespace/openshift-rhbz2257259 created

The new namespace contains a csiaddons annotation:

$ oc describe ns/openshift-rhbz2257259
Name:         openshift-rhbz2257259
Labels:       kubernetes.io/metadata.name=openshift-rhbz2257259
Annotations:  openshift.io/sa.scc.mcs: s0:c28,c27
              openshift.io/sa.scc.supplemental-groups: 1000810000/10000
              openshift.io/sa.scc.uid-range: 1000810000/10000
              reclaimspace.csiaddons.openshift.io/schedule: @weekly
Status:       Active

No resource quota.

No LimitRange resource.


It seems that ocs-operator adds the annotation, and then explicitly restarts the csi-addons operator!

https://github.com/red-hat-storage/ocs-operator/blob/68b42d973015a38c926a7e165b2f749d8d7091a8/controllers/namespace/namespace_controller.go#L80
introduced in
https://github.com/red-hat-storage/ocs-operator/pull/2074

Unfortunately I can not find why the restarting is required. Nikhil, the PR comes from you, do you remember details about this requirement?

Comment 6 Nikhil Ladha 2024-02-15 10:17:06 UTC
Hi,

Yes, the restart is intentional and was added to take into account existing RBD PVCs (i.e, upgraded clusters) can make use of the annotation to reclaim space.
The must-gather tool and automation needs to be updated to consider this situation.

Ref EPIC: https://issues.redhat.com/browse/RHSTOR-4468

Comment 7 Niels de Vos 2024-02-15 12:29:00 UTC
Thanks Nikhil!

Is it intentional that the restart is triggered when a new openshift-* Namespace is created? The new Namespace would be empty, so there is no need to take exising PVCs into account.

Would that be a bug for addressing in ocs-operator? I.e. "do not restart the kubernetes-csi-addons operator when a Namespace without PVCs was annotated".


Separate from that, I now think restarting is the wrong approach in general. QE reported this BZ, but customers will notice the unexpected restart of the operator too. We should improve the kubernetes-csi-addons operator in such a way, that restarting is not required.

Comment 8 Nikhil Ladha 2024-02-15 13:05:13 UTC
I think the restart is needed for new namespaces because if we don't do a restart i.e, csi-addons won't consider that the namespace allows you make use of space reclaim feature. There might not be any PVCs now, but if in future new PVCs are created then a restart of the csi-addons pod has to be done somehow, for the namespace annotation to be considered for space reclaim.

There are some limitations as to why restart is the only feasible option here for csi-addons pod, I will let Madhu explain them better here.

Comment 10 Nikhil Ladha 2024-02-15 13:39:10 UTC
Ok, in that case so we should be just skipping reconcile for a create event of a namespace? Similar to how we are doing for delete events, https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/namespace/namespace_controller.go#L193-L204 ?


So, if a PVC is to be created in the new namespace, The user will have to manually add the annotation to the namespace, right?

Comment 12 Nikhil Ladha 2024-02-27 06:24:49 UTC
Posted a PR for the same, this should take care of it.
https://github.com/red-hat-storage/ocs-operator/pull/2479

Comment 13 Nikhil Ladha 2024-02-27 06:35:46 UTC
Changing the bug component to ocs-operator, and updating the assignee to me.
If an update to csi-addons is needed, I think we can create a separate bug to track that.

Comment 16 Nikhil Ladha 2024-04-16 06:30:06 UTC
Since the fix PR is merged and the bz is approved for 4.16, moving the bug to ON_QA.

Comment 18 yati padia 2024-04-29 06:20:38 UTC
*** Bug 2245246 has been marked as a duplicate of this bug. ***

Comment 19 Oded 2024-04-30 11:39:08 UTC
Bug reproduced on odf-operator.v4.16.0-89.stable [OCP version: 4.16.0-0.nightly-2024-04-26-145258]


Running mg command: 
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.16
...

Checking csi-addons-controller-manager pod in parallel
$ oc get pods -n openshift-storage csi-addons-controller-manager-544669f475-lj8np  -w
NAME                                             READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-544669f475-lj8np   2/2     Running   0          15m
csi-addons-controller-manager-544669f475-lj8np   2/2     Terminating   0          15m
csi-addons-controller-manager-544669f475-lj8np   2/2     Terminating   0          15m

$ oc get pods -n openshift-storage csi-addons-controller-manager-544669f475-xjx62 
NAME                                             READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-544669f475-xjx62   2/2     Running   0          94s

$ oc get ns | grep must
openshift-must-gather-42778                        Active   22s


In addition, User cannot create new project with openshift prefix:
(venv) oviner~/test$ oc new-project openshift-te
Error from server (Forbidden): project.project.openshift.io "openshift-te" is forbidden: cannot request a project starting with "openshift-"

Comment 22 Oded 2024-05-06 09:31:11 UTC
Bug fixed:

Test procedure: 
1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform
2. Install ODF4.16 odf-operator.v4.16.0-92.stable
3. Running 5 times must-gather:
$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.16
4.Check csi-addons-controller-manager pod [no restarted]
csi-addons-controller-manager-564b8b74d7-njh2t                    2/2     Running     0             24h

Comment 24 errata-xmlrpc 2024-07-17 13:11:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.