Bug 2267607

Summary: [4.15.z clone] csi-addons-controller-manager pod is reset after running the must-gather command
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Nikhil Ladha <nladha>
Component: ocs-operatorAssignee: Nikhil Ladha <nladha>
Status: CLOSED DUPLICATE QA Contact: Joy John Pinto <jopinto>
Severity: medium Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: asriram, kramdoss, nladha, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-05-03 12:27:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nikhil Ladha 2024-03-04 06:06:40 UTC
This bug was initially created as a copy of Bug #2257259

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippests):

 csi-addons-controller-manager pod is reset after running the must-gather command 
The csi-addons controller ends with the following msg:
```
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for non leader election runnables
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for leader election runnables
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T12:50:48.279Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "reclaimspacejob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceJob"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "reclaimspacecronjob", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "ReclaimSpaceCronJob"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode"}
2024-01-08T12:50:48.279Z	INFO	All workers finished	{"controller": "networkfence", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "NetworkFence"}
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for caches
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for webhooks
2024-01-08T12:50:48.279Z	INFO	Stopping and waiting for HTTP servers
2024-01-08T12:50:48.279Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-01-08T12:50:48.279Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-01-08T12:50:48.279Z	INFO	Wait completed, proceeding to shutdown the manager
```

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?



Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1. Check oc get pods -nopenshift-storage | grep csi-addons
2. Run must-gather
3. Look for the csi-addons pod again


Actual results:
must-gather should get collected without restarting the csi-addons controller

Expected results:
must-gather gets collected but it restarts the csi-addons controller


Additional info:

Comment 2 krishnaram Karthick 2024-04-01 10:30:48 UTC
Bug in NEW/ASSIGNED state. 
Moving the bug to 4.15.3 for a decision on RCA/FIX.

Comment 3 Nikhil Ladha 2024-04-01 10:34:15 UTC
(In reply to krishnaram Karthick from comment #2)
> Bug in NEW/ASSIGNED state. 
> Moving the bug to 4.15.3 for a decision on RCA/FIX.

The bug is a clone for 4.15.
The fix is already merged for the actual bz, once the bug is approved I will create a backport PR.
So, I think we should be good to take it in 4.15.2

Comment 7 Joy John Pinto 2024-04-22 15:41:13 UTC
With OCP 4.15 and ODF 4.15.2-1, upon running must gather command "oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15", the csi-addon-controller-manager pod goes for a restart.

(venv) [jopinto@jopinto brown416]$ date
Mon Apr 22 21:05:06 IST 2024
(venv) [jopinto@jopinto brown416]$ oc get pods -n openshift-storage -o wide | grep csi-addons
csi-addons-controller-manager-c889b47c9-rk5nf                     2/2     Running     0             46m    10.129.2.44    jopinto-c13416-696z2-worker-3-cldl4   <none>           <none>
(venv) [jopinto@jopinto brown416]$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.15 >must.txt
W0422 21:05:35.331132  426437 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gather", "copy" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gather", "copy" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gather", "copy" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gather", "copy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

[jopinto@jopinto brown416]$ oc describe pod csi-addons-controller-manager-c889b47c9-rf4b8 -n openshift-storage
Name:         csi-addons-controller-manager-c889b47c9-rf4b8
Namespace:    openshift-storage
Priority:     0
Node:         jopinto-c13416-696z2-worker-3-cldl4/10.241.128.4
Start Time:   Mon, 22 Apr 2024 21:05:34 +0530
...
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       115s  default-scheduler  Successfully assigned openshift-storage/csi-addons-controller-manager-c889b47c9-rf4b8 to jopinto-c13416-696z2-worker-3-cldl4
  Normal  AddedInterface  114s  multus             Add eth0 [10.129.2.46/23] from ovn-kubernetes
  Normal  Pulled          114s  kubelet            Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:063f7f9ee19b67e2184b2b461c86e15608951864bb84c7398b2e441f9ec6164f" already present on machine
  Normal  Created         114s  kubelet            Created container kube-rbac-proxy
  Normal  Started         114s  kubelet            Started container kube-rbac-proxy
  Normal  Pulled          114s  kubelet            Container image "registry.redhat.io/odf4/odf-csi-addons-rhel9-operator@sha256:7146ad801388ffd84d71fe1640f77cc9710883f4f9b22e359df75c1f424d99ce" already present on machine
  Normal  Created         114s  kubelet            Created container manager
  Normal  Started         114s  kubelet            Started container manager
[jopinto@jopinto brown416]$

Comment 8 Joy John Pinto 2024-04-24 06:33:45 UTC
Moving back the bug to assigned state based on https://bugzilla.redhat.com/show_bug.cgi?id=2267607#c7

Comment 9 Nikhil Ladha 2024-04-24 06:38:36 UTC
The fix should be working for the bz but due to some reasons the fix might have missed the latest 4.15 build and due to release timelines we are not able to have a new build for 4.15.
Hence, moving the bug back to POST state and proposing it for 4.15.3

Comment 10 Nikhil Ladha 2024-04-24 06:39:49 UTC
The fix should be working for the bz but due to some reasons the fix might have missed the latest 4.15 build and due to release timelines we are not able to have a new build for 4.15.
Hence, moving the bug back to POST state and proposing it for 4.15.3