2267607 – [4.15.z clone] csi-addons-controller-manager pod is reset after running the must-gather command

Bug 2267607 - [4.15.z clone] csi-addons-controller-manager pod is reset after running the must-gather command

Summary: [4.15.z clone] csi-addons-controller-manager pod is reset after running the m...

Keywords:
Status:	CLOSED DUPLICATE of bug 2278641
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Nikhil Ladha
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-04 06:06 UTC by Nikhil Ladha
Modified:	2024-05-03 12:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-05-03 12:27:39 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2540	0	None	open	Bug 2267607: [release-4.15] skip csi-addons pod restart for new namespaces	2024-04-10 12:38:36 UTC

Description Nikhil Ladha 2024-03-04 06:06:40 UTC

This bug was initially created as a copy of Bug #2257259

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippests):

 csi-addons-controller-manager pod is reset after running the must-gather command 
The csi-addons controller ends with the following msg:
```
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for non leader election runnables
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for leader election runnables
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for caches
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for webhooks
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for HTTP servers
2024-01-08T12:50:48.279Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-01-08T12:50:48.279Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-01-08T12:50:48.279Z	INFO	Wait completed, proceeding to shutdown the manager
```

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?



Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1. Check oc get pods -nopenshift-storage | grep csi-addons
2. Run must-gather
3. Look for the csi-addons pod again


Actual results:
must-gather should get collected without restarting the csi-addons controller

Expected results:
must-gather gets collected but it restarts the csi-addons controller


Additional info:

Comment 2 krishnaram Karthick 2024-04-01 10:30:48 UTC

Bug in NEW/ASSIGNED state. 
Moving the bug to 4.15.3 for a decision on RCA/FIX.

Comment 3 Nikhil Ladha 2024-04-01 10:34:15 UTC

(In reply to krishnaram Karthick from comment #2)
> Bug in NEW/ASSIGNED state. 
> Moving the bug to 4.15.3 for a decision on RCA/FIX.

The bug is a clone for 4.15.
The fix is already merged for the actual bz, once the bug is approved I will create a backport PR.
So, I think we should be good to take it in 4.15.2

Comment 7 Joy John Pinto 2024-04-22 15:41:13 UTC

With OCP 4.15 and ODF 4.15.2-1, upon running must gather command "oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15", the csi-addon-controller-manager pod goes for a restart.

(venv) [jopinto@jopinto brown416]$ date
Mon Apr 22 21:05:06 IST 2024
(venv) [jopinto@jopinto brown416]$ oc get pods -n openshift-storage -o wide | grep csi-addons
csi-addons-controller-manager-c889b47c9-rk5nf                     2/2     Running     0             46m    10.129.2.44    jopinto-c13416-696z2-worker-3-cldl4   <none>           <none>
(venv) [jopinto@jopinto brown416]$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15 >must.txt
W0422 21:05:35.331132  426437 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gather", "copy" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gather", "copy" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gather", "copy" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gather", "copy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

[jopinto@jopinto brown416]$ oc describe pod csi-addons-controller-manager-c889b47c9-rf4b8 -n openshift-storage
Name:         csi-addons-controller-manager-c889b47c9-rf4b8
Namespace:    openshift-storage
Priority:     0
Node:         jopinto-c13416-696z2-worker-3-cldl4/10.241.128.4
Start Time:   Mon, 22 Apr 2024 21:05:34 +0530
...
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       115s  default-scheduler  Successfully assigned openshift-storage/csi-addons-controller-manager-c889b47c9-rf4b8 to jopinto-c13416-696z2-worker-3-cldl4
  Normal  AddedInterface  114s  multus             Add eth0 [10.129.2.46/23] from ovn-kubernetes
  Normal  Pulled          114s  kubelet            Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:063f7f9ee19b67e2184b2b461c86e15608951864bb84c7398b2e441f9ec6164f" already present on machine
  Normal  Created         114s  kubelet            Created container kube-rbac-proxy
  Normal  Started         114s  kubelet            Started container kube-rbac-proxy
  Normal  Pulled          114s  kubelet            Container image "registry.redhat.io/odf4/odf-csi-addons-rhel9-operator@sha256:7146ad801388ffd84d71fe1640f77cc9710883f4f9b22e359df75c1f424d99ce" already present on machine
  Normal  Created         114s  kubelet            Created container manager
  Normal  Started         114s  kubelet            Started container manager
[jopinto@jopinto brown416]$

Comment 8 Joy John Pinto 2024-04-24 06:33:45 UTC

Moving back the bug to assigned state based on https://bugzilla.redhat.com/show_bug.cgi?id=2267607#c7

Comment 9 Nikhil Ladha 2024-04-24 06:38:36 UTC

The fix should be working for the bz but due to some reasons the fix might have missed the latest 4.15 build and due to release timelines we are not able to have a new build for 4.15.
Hence, moving the bug back to POST state and proposing it for 4.15.3

Comment 10 Nikhil Ladha 2024-04-24 06:39:49 UTC

The fix should be working for the bz but due to some reasons the fix might have missed the latest 4.15 build and due to release timelines we are not able to have a new build for 4.15.
Hence, moving the bug back to POST state and proposing it for 4.15.3

Note You need to log in before you can comment on or make changes to this bug.