Description of problem (please be detailed as possible and provide log snippests): On a Regional DR setup with ODF build ODF 4.16.0-102.stable, we found that the ramen pod (below) is restarting frequently even after having oadp operator installed on the odf managed cluster (which is needed for RDR setups). I internally upgraded the ODF to build ODF 4.16.0-108.stable but the issue still existed. (This output is taken after applying the workaround when the pod became some what stable, earlier it wasn't able to recover). oc get pods -n openshift-dr-system NAME READY STATUS RESTARTS AGE ramen-dr-cluster-operator-75b5dd74-m4pbb 2/2 Running 25 (4h37m ago) 4d3h This was probably due to higher memory utilisation by the pod. If we edit the csv and deployment for this resource and delete the available resources (memory, cpu), it sets the resource by itself which consumes very high memory. NAMESPACE NAME CPU(cores) MEMORY(bytes) oc adm top pods -A|grep ramen openshift-dr-system ramen-dr-cluster-operator-75b5dd74-m4pbb 2m 711Mi On further debugging, @bmekhiss pointed out that there are thousands of metrics created by ocs-metrics-exporter on this cluster (more than 16K which is gradually increasing every few mins). This issue may not be DR specific. A similar observation was seen with MDS and Noobaa pods on 2 other clusters owned by Pratik and Sidhant. I am opening this bug after discussion with @uchapaga, and looking at the severity of this bug, it should be considered as a release blocker for ODF 4.16. Version of all relevant components (if applicable): ODF 4.16.0-108.stable OCP 4.16.0-0.nightly-2024-05-19-083311 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. On a ODF cluster, deploy a few workloads and monitor restart count of ODF pods and secret count on the cluster overall. 2. 3. Actual results: Too many ocs-metrics-exporter secrets make ODF cluster unstable. Expected results: Unnecessary secrets shouldn't be created on the cluster. Additional info:
When I checked later, noobaa also went down on this setup. noobaa-operator-8d9b98f74-8vpf7 0/1 CrashLoopBackOff 1064 (33s ago) 4d5h 10.131.1.47 compute-0 <none> <none> oc describe pod/noobaa-operator-8d9b98f74-8vpf7 Name: noobaa-operator-8d9b98f74-8vpf7 Namespace: openshift-storage Priority: 0 Service Account: noobaa Node: compute-0/10.1.114.176 Start Time: Thu, 23 May 2024 13:22:46 +0530 Labels: app=noobaa noobaa-operator=deployment pod-template-hash=8d9b98f74 Annotations: alm-examples: [{"kind":"NooBaa","apiVersion":"noobaa.io/v1alpha1","metadata":{"name":"noobaa","creationTimestamp":null},"spec":{"cleanupPolicy":{},"secu... capabilities: Basic Install categories: Storage,Big Data certified: false containerImage: registry.redhat.io/odf4/mcg-rhel9-operator@sha256:982bb96ae2b0ee0992eef7015dae66beeb6f581e3773bbcc6f3bbb617d5e401f createdAt: 2019-07-08T13:10:20.940Z description: NooBaa is an object data service for hybrid and multi cloud environments. features.operators.openshift.io/disconnected: true features.operators.openshift.io/fips-compliant: true features.operators.openshift.io/proxy-aware: true features.operators.openshift.io/tls-profiles: false features.operators.openshift.io/token-auth-aws: true features.operators.openshift.io/token-auth-azure: false features.operators.openshift.io/token-auth-gcp: false k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.131.1.47/23"],"mac_address":"0a:58:0a:83:01:2f","gateway_ips":["10.131.0.1"],"routes":[{"dest":"10.128.0.0... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.131.1.47" ], "mac": "0a:58:0a:83:01:2f", "default": true, "dns": {} }] olm.operatorGroup: openshift-storage-7nqsg olm.operatorNamespace: openshift-storage olm.skipRange: >=4.2.0 <4.16.0-108.stable olm.targetNamespaces: openshift-storage olmcahash: eb037a323b0dead2da8d0a68085d6aa4c72002a42c0f46161572fe6c2696b328 openshift.io/scc: restricted-v2 operatorframework.io/properties: {"properties":[{"type":"olm.gvk","value":{"group":"noobaa.io","kind":"BackingStore","version":"v1alpha1"}},{"type":"olm.gvk","value":{"gro... operators.openshift.io/infrastructure-features: ֿ'["disconnected"]' operators.openshift.io/valid-subscription: ["OpenShift Platform Plus","OpenShift Data Foundation Essentials","OpenShift Data Foundation Advanced"] operators.operatorframework.io/operator-type: non-standalone repository: https://github.com/noobaa/noobaa-operator seccomp.security.alpha.kubernetes.io/pod: runtime/default support: Red Hat Status: Running SeccompProfile: RuntimeDefault IP: 10.131.1.47 IPs: IP: 10.131.1.47 Controlled By: ReplicaSet/noobaa-operator-8d9b98f74 Containers: noobaa-operator: Container ID: cri-o://b7331c5d35e3b921ec38e5b12f7aa3fa93c91d322039b84e7f594528fd3c0f42 Image: registry.redhat.io/odf4/mcg-rhel9-operator@sha256:982bb96ae2b0ee0992eef7015dae66beeb6f581e3773bbcc6f3bbb617d5e401f Image ID: registry.redhat.io/odf4/mcg-rhel9-operator@sha256:72eed97bb58ed74813dcf661514e5c6b3141f848531807a821b8e54e8bd2a528 Port: <none> Host Port: <none> State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 27 May 2024 18:50:42 +0530 Finished: Mon, 27 May 2024 18:51:22 +0530 Ready: False Restart Count: 1064 Limits: cpu: 250m memory: 512Mi Requests: cpu: 250m memory: 512Mi Environment: OPERATOR_NAME: noobaa-operator POD_NAME: noobaa-operator-8d9b98f74-8vpf7 (v1:metadata.name) WATCH_NAMESPACE: openshift-storage (v1:metadata.namespace) NOOBAA_CORE_IMAGE: registry.redhat.io/odf4/mcg-core-rhel9@sha256:313739c72ce84a82b9ed6064747c4cbb2d069bd0f0479750293b0260eace675b NOOBAA_DB_IMAGE: registry.redhat.io/rhel9/postgresql-15@sha256:1aeac23901c0147e4c6e9a1b8bb5f41dd6f95532b0d96adac55d609d0eed32fe ENABLE_NOOBAA_ADMISSION: true OPERATOR_CONDITION_NAME: mcg-operator.v4.16.0-108.stable Mounts: /apiserver.local.config/certificates from apiservice-cert (rw) /etc/ocp-injected-ca-bundle from ocp-injected-ca-bundle (rw) /tmp/k8s-webhook-server/serving-certs from webhook-cert (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-27jzd (ro) /var/run/secrets/openshift/serviceaccount from bound-sa-token (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: bound-sa-token: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3600 ocp-injected-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: ocp-injected-ca-bundle Optional: true apiservice-cert: Type: Secret (a volume populated by a Secret) SecretName: noobaa-operator-service-cert Optional: false webhook-cert: Type: Secret (a volume populated by a Secret) SecretName: noobaa-operator-service-cert Optional: false kube-api-access-27jzd: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Guaranteed Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning BackOff 4m38s (x24531 over 4d5h) kubelet Back-off restarting failed container noobaa-operator in pod noobaa-operator-8d9b98f74-8vpf7_openshift-storage(89b3433a-36ec-40e6-915c-6e68a4c62ae0) Unable to collect logs from this pod and must-gather also failed twice. oc logs pod/noobaa-operator-8d9b98f74-8vpf7 --tail 1000 Error from server: Get "https://10.1.114.176:10250/containerLogs/openshift-storage/noobaa-operator-8d9b98f74-8vpf7/noobaa-operator?tailLines=1000": write tcp 10.1.114.179:55308->10.1.114.176:10250: use of closed network connection
Operators that watch for all secrets across namespaces is impacted by this issue if they do not apply necessary cache filters. Impacted operators should apply filters to only cache relevant data to significantly reduce the memory usage. This applies to all kinds of resources. On creation of a new serviceaccount, a new `*-dockercfg-*` secret is created and bound to the serviceaccount. In our case, thousands of `ocs-metrics-exporter-dockercfg-*` secrets are created because ocs-operator re-creates the `ocs-metrics-exporter` serviceaccount after every reconcile (nearly every 5 seconds). This triggers the creation of a dockercfg secret with each reconcile. Providing devel_ack+ as we do not want to overwhelm the operators with thousands of unwanted secrets thereby causing crashes.
Why do we need to create new serviceacount on every reconcile? this sounds very wrong. Why not have *one* serviceacount, recreated if needed?
The issue was hit again on a fresh new RDR setup with ODF ODF 4.16.0-108.stable oc get pods -n openshift-operators NAME READY STATUS RESTARTS AGE volsync-controller-manager-784448784f-nmd2w 1/2 CrashLoopBackOff 6 (2m55s ago) 26m oc describe pod volsync-controller-manager-784448784f-nmd2w -n openshift-operators Name: volsync-controller-manager-784448784f-nmd2w Namespace: openshift-operators Priority: 0 Service Account: volsync-controller-manager Node: compute-1/10.1.114.90 Start Time: Wed, 29 May 2024 19:31:34 +0530 Labels: app.kubernetes.io/part-of=volsync control-plane=controller-manager pod-template-hash=784448784f Annotations: alm-examples: [ { "apiVersion": "volsync.backube/v1alpha1", "kind": "ReplicationDestination", "metadata": { "labels": { "app.kubernetes.io/created-by": "volsync", "app.kubernetes.io/instance": "replicationdestination-sample", "app.kubernetes.io/managed-by": "kustomize", "app.kubernetes.io/name": "replicationdestination", "app.kubernetes.io/part-of": "volsync" }, "name": "replicationdestination-sample" }, "spec": { "rsync": { "accessModes": [ "ReadWriteOnce" ], "capacity": "10Gi", "copyMethod": "Snapshot", "serviceType": "ClusterIP" } } }, { "apiVersion": "volsync.backube/v1alpha1", "kind": "ReplicationSource", "metadata": { "labels": { "app.kubernetes.io/created-by": "volsync", "app.kubernetes.io/instance": "replicationsource-sample", "app.kubernetes.io/managed-by": "kustomize", "app.kubernetes.io/name": "replicationsource", "app.kubernetes.io/part-of": "volsync" }, "name": "replicationsource-sample" }, "spec": { "rsync": { "address": "my.host.com", "copyMethod": "Clone", "sshKeys": "secretRef" }, "sourcePVC": "pvcname", "trigger": { "schedule": "0 * * * *" } } } ] capabilities: Basic Install containerImage: registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de createdAt: 28 Feb 2024, 21:56 features.operators.openshift.io/disconnected: true features.operators.openshift.io/fips-compliant: true features.operators.openshift.io/proxy-aware: true features.operators.openshift.io/tls-profiles: false features.operators.openshift.io/token-auth-aws: false features.operators.openshift.io/token-auth-azure: false features.operators.openshift.io/token-auth-gcp: false k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.131.0.255/23"],"mac_address":"0a:58:0a:83:00:ff","gateway_ips":["10.131.0.1"],"routes":[{"dest":"10.128.0.... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.131.0.255" ], "mac": "0a:58:0a:83:00:ff", "default": true, "dns": {} }] kubectl.kubernetes.io/default-container: manager olm.operatorGroup: global-operators olm.operatorNamespace: openshift-operators olm.skipRange: >=0.4.0 <0.8.1 olm.targetNamespaces: openshift.io/scc: restricted-v2 operatorframework.io/properties: {"properties":[{"type":"olm.gvk","value":{"group":"volsync.backube","kind":"ReplicationDestination","version":"v1alpha1"}},{"type":"olm.gv... operators.openshift.io/valid-subscription: ["OpenShift Platform Plus", "Red Hat Advanced Cluster Management for Kubernetes"] operators.operatorframework.io/builder: operator-sdk-v1.31.0 operators.operatorframework.io/project_layout: go.kubebuilder.io/v3 repository: https://github.com/backube/volsync seccomp.security.alpha.kubernetes.io/pod: runtime/default support: Red Hat Status: Running SeccompProfile: RuntimeDefault IP: 10.131.0.255 IPs: IP: 10.131.0.255 Controlled By: ReplicaSet/volsync-controller-manager-784448784f Containers: kube-rbac-proxy: Container ID: cri-o://b88fefdae6fd7eb865b27db4880c879aaf05686909eb43c71f3a4bfbd2a905b2 Image: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:fcb3b8ab93dfb5ef2b290e39ea5899dbb5e0c6d430370b8d281e59e74d94d749 Image ID: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:695eb0e189565eb8ccf06d32f6f06598908d35cdfea88edc0c75560806a037b7 Port: 8443/TCP Host Port: 0/TCP Args: --secure-listen-address=0.0.0.0:8443 --upstream=http://127.0.0.1:8080/ --logtostderr=true --tls-min-version=VersionTLS12 --v=0 State: Running Started: Wed, 29 May 2024 19:31:35 +0530 Ready: True Restart Count: 0 Limits: cpu: 500m memory: 128Mi Requests: cpu: 5m memory: 64Mi Environment: OPERATOR_CONDITION_NAME: volsync-product.v0.8.1 Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jc666 (ro) manager: Container ID: cri-o://530d8e52161a9cabd8185eb61055d81f6912f93ff2127ad44ce9a90c453f61d3 Image: registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de Image ID: registry.redhat.io/rhacm2/volsync-rhel8@sha256:4ff835154650867c8350e588a390ae123f51f9b2a39fdca8d1cca5a0e1649dcd Port: <none> Host Port: <none> Command: /manager Args: --health-probe-bind-address=:8081 --metrics-bind-address=127.0.0.1:8080 --leader-elect --scc-name=volsync-privileged-mover State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 29 May 2024 19:53:08 +0530 Finished: Wed, 29 May 2024 19:55:31 +0530 Ready: False Restart Count: 6 Limits: cpu: 1 memory: 300Mi Requests: cpu: 100m memory: 64Mi Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3 Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3 Environment: RELATED_IMAGE_RSYNC_CONTAINER: registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de RELATED_IMAGE_RSYNC_TLS_CONTAINER: registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de RELATED_IMAGE_RCLONE_CONTAINER: registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de RELATED_IMAGE_RESTIC_CONTAINER: registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de RELATED_IMAGE_SYNCTHING_CONTAINER: registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de OPERATOR_CONDITION_NAME: volsync-product.v0.8.1 Mounts: /tmp from tempdir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jc666 (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: tempdir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: <unset> kube-api-access-jc666: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 27m default-scheduler Successfully assigned openshift-operators/volsync-controller-manager-784448784f-nmd2w to compute-1 Normal AddedInterface 27m multus Add eth0 [10.131.0.255/23] from ovn-kubernetes Normal Pulled 27m kubelet Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:fcb3b8ab93dfb5ef2b290e39ea5899dbb5e0c6d430370b8d281e59e74d94d749" already present on machine Normal Created 27m kubelet Created container kube-rbac-proxy Normal Started 27m kubelet Started container kube-rbac-proxy Normal Pulled 15m (x5 over 27m) kubelet Container image "registry.redhat.io/rhacm2/volsync-rhel8@sha256:af9b23a39f4cff4dbea7c7adc26c2d7ce4ebc6158a42db2e7e861951cafd89de" already present on machine Normal Created 15m (x5 over 27m) kubelet Created container manager Normal Started 15m (x5 over 27m) kubelet Started container manager Warning BackOff 111s (x37 over 21m) kubelet Back-off restarting failed container manager in pod volsync-controller-manager-784448784f-nmd2w_openshift-operators(7b840625-ec71-4236-9428-9697d4368441) This cluster now has over 16K secrets oc get secrets -A|wc -l 16860 The secondary odf managed cluster has more than 17K secrets. Editing the csv and deployment for volsync pod (removing default resource didnt't help in this case). I have collected the odf-must-gather and uploaded here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/29may24-2/ Now I will delete the secrets as it's disrupting the testing for last few days.
*** Bug 2282927 has been marked as a duplicate of this bug. ***
Please update the RDT flag/text appropriately.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591