Bug 2308667 - [ODF][rook] suspected memory leak
Summary: [ODF][rook] suspected memory leak
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Blaine Gardner
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-08-30 11:00 UTC by tochan
Modified: 2025-03-03 04:25 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-10-29 15:22:36 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OCSBZM-8897 0 None None None 2024-08-30 11:03:08 UTC

Comment 11 kelwhite 2024-09-05 17:02:55 UTC
Travis,

I haven't had a chance to test anything atm, I just got 11k pvc's created :) Will update once I have more info... Setting need info on myself

Comment 12 kelwhite 2024-09-06 02:32:21 UTC
Travis,

Okay, I got a few moments to do some testing. From the UI, I see these events:

- Pod openshift-storage/rook-ceph-operator-554974456c-4pvn7 has been in a non-ready state for longer than 15 minutes.
- pod/rook-ceph-operator-554974456c-tw7gr in namespace openshift-storage on container rook-ceph-operator has been in waiting state for longer than 1 hour.
View details
- Deployment openshift-storage/rook-ceph-operator has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support.

doesn't seem to be correct at first glance:

kelson@quorra:~$ oc get pods|grep rook-ceph-operator-554974456c-4pvn7 
rook-ceph-operator-554974456c-4pvn7                               1/1     Running   0             50s
kelson@quorra:~$ oc get deployment|grep rook-ceph-operator
rook-ceph-operator                                     1/1     1            1           57d

however, I can't seem to view the logs:

kelson@quorra:~$ oc logs rook-ceph-operator-554974456c-4pvn7
Error from server (NotFound): pods "rook-ceph-operator-554974456c-4pvn7" not found

Seems the pod has restarted:

kelson@quorra:~$ oc get pods|grep rook-ceph-operator-554974456c-9585s
rook-ceph-operator-554974456c-9585s                               1/1     Running   0             50s

but its gone again?:

kelson@quorra:~$ oc get pods|grep rook-ceph-operator-554974456c-9585s
kelson@quorra:~$ 

Hum... keeps getting redeployed:

rook-ceph-operator-554974456c-zkgg9                               1/1     Terminating       0             60s
rook-ceph-operator-554974456c-tx67x                               0/1     Pending           0             0s
rook-ceph-operator-554974456c-tx67x                               0/1     Pending           0             0s
rook-ceph-operator-554974456c-tx67x                               0/1     Pending           0             0s
rook-ceph-operator-554974456c-tx67x                               0/1     ContainerCreating   0             0s
rook-ceph-operator-554974456c-zkgg9                               0/1     Terminating         0             60s
rook-ceph-operator-554974456c-tx67x                               0/1     ContainerCreating   0             1s
rook-ceph-operator-554974456c-zkgg9                               0/1     Terminating         0             61s
rook-ceph-operator-554974456c-zkgg9                               0/1     Terminating         0             61s

logs at the time of pod restart:

2024-09-06 01:07:27.006485 D | telemetry: set telemetry key: rook/cluster/storage/device-set/count/portable=0
2024-09-06 01:07:27.006524 D | op-config: setting "rook/cluster/storage/device-set/count/non-portable"="3" option in the mon config-key store
2024-09-06 01:07:27.006548 D | exec: Running command: ceph config-key set rook/cluster/storage/device-set/count/non-portable 3 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2024-09-06 01:07:27.101556 I | operator: shutdown signal received, exiting... context canceled
2024-09-06 01:07:27.101811 I | op-mon: stopping monitoring of mons in namespace "openshift-storage"
2024-09-06 01:07:27.102027 I | op-osd: stopping monitoring of OSDs in namespace "openshift-storage"
2024-09-06 01:07:27.102077 I | ceph-cluster-controller: stopping monitoring of ceph status

Sadly, I don't know what this 'operator: shutdown signal received, exiting... context canceled' means. Viewing [1] gives some more details on what this is occurring. Why tho? Kubelet logs:

Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.186971    2539 kubelet.go:2419] "SyncLoop DELETE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-77mm7"]
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.187904    2539 kuberuntime_container.go:745] "Killing container with a grace period" pod="openshift-storage/rook-ceph-operator-554974456c-77mm7" podUID="09e2cf5d-557a-461b-92ec-6d6559a43796" containerName="rook-ceph-operator" containerID="cri-o://44837cc7ff3875bfc8699ccd15cc513a61ed4d9cdf68a05098358c45b8ebdcf8" gracePeriod=30
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228353    2539 kubelet.go:2403] "SyncLoop ADD" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-gtfrc"]
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228517    2539 topology_manager.go:215] "Topology Admit Handler" podUID="0c8924ab-324c-45fe-86b2-2d21dac5a462" podNamespace="openshift-storage" podName="rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: E0906 01:13:27.228720    2539 cpu_manager.go:396] "RemoveStaleState: removing container" podUID="d66b27d4-1241-49ad-88f5-35c9857ae133" containerName="rook-ceph-operator"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228738    2539 state_mem.go:107] "Deleted CPUSet assignment" podUID="d66b27d4-1241-49ad-88f5-35c9857ae133" containerName="rook-ceph-operator"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.228871    2539 memory_manager.go:346] "RemoveStaleState removing state" podUID="d66b27d4-1241-49ad-88f5-35c9857ae133" containerName="rook-ceph-operator"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.267752    2539 kubelet.go:2410] "SyncLoop UPDATE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-gtfrc"]
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.326164    2539 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"rook-config\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-rook-config\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.332108    2539 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"default-config-dir\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-default-config-dir\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.332184    2539 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-nbhfr\" (UniqueName: \"kubernetes.io/projected/0c8924ab-324c-45fe-86b2-2d21dac5a462-kube-api-access-nbhfr\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.433535    2539 reconciler_common.go:231] "operationExecutor.MountVolume started for volume \"rook-config\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-rook-config\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.433611    2539 reconciler_common.go:231] "operationExecutor.MountVolume started for volume \"default-config-dir\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-default-config-dir\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.433649    2539 reconciler_common.go:231] "operationExecutor.MountVolume started for volume \"kube-api-access-nbhfr\" (UniqueName: \"kubernetes.io/projected/0c8924ab-324c-45fe-86b2-2d21dac5a462-kube-api-access-nbhfr\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.434043    2539 operation_generator.go:722] "MountVolume.SetUp succeeded for volume \"rook-config\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-rook-config\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.434252    2539 operation_generator.go:722] "MountVolume.SetUp succeeded for volume \"default-config-dir\" (UniqueName: \"kubernetes.io/empty-dir/0c8924ab-324c-45fe-86b2-2d21dac5a462-default-config-dir\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.457487    2539 operation_generator.go:722] "MountVolume.SetUp succeeded for volume \"kube-api-access-nbhfr\" (UniqueName: \"kubernetes.io/projected/0c8924ab-324c-45fe-86b2-2d21dac5a462-kube-api-access-nbhfr\") pod \"rook-ceph-operator-554974456c-gtfrc\" (UID: \"0c8924ab-324c-45fe-86b2-2d21dac5a462\") " pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.603034    2539 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc"
Sep 06 01:13:27 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:27.997613    2539 kubelet.go:2410] "SyncLoop UPDATE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-gtfrc"]
Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.071517    2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-77mm7" event={"ID":"09e2cf5d-557a-461b-92ec-6d6559a43796","Type":"ContainerDied","Data":"44837cc7ff3875bfc8699ccd15cc513a61ed4d9cdf68a05098358c45b8ebdcf8"}
Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.071560    2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-77mm7" event={"ID":"09e2cf5d-557a-461b-92ec-6d6559a43796","Type":"ContainerDied","Data":"da911b7cae3e737e6bf235750bc7f0b70a2bc9f51dff63f3dda4d0862d79c141"}
Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.073460    2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" event={"ID":"0c8924ab-324c-45fe-86b2-2d21dac5a462","Type":"ContainerStarted","Data":"d31051774bbd3f7627c036ee2163a455ce84f7197af08916518fd0cd59feed4e"}
Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.143705    2539 kubelet.go:2419] "SyncLoop DELETE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-77mm7"]
Sep 06 01:13:28 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:28.146430    2539 kubelet.go:2413] "SyncLoop REMOVE" source="api" pods=["openshift-storage/rook-ceph-operator-554974456c-77mm7"]
Sep 06 01:13:29 worker1.libvirt3.ocpcluster.cc kubenswrapper[2539]: I0906 01:13:29.082163    2539 kubelet.go:2435] "SyncLoop (PLEG): event for pod" pod="openshift-storage/rook-ceph-operator-554974456c-gtfrc" event={"ID":"0c8924ab-324c-45fe-86b2-2d21dac5a462","Type":"ContainerStarted","Data":"0208fd9eea43321abea50652ad50a003ae928f65522044b43860f33be8eb9521"}

I have stop for today, I'll pick this up tomorrow...

[1] https://github.com/rook/rook/blob/master/pkg/operator/ceph/operator.go#L98

Comment 23 kelwhite 2024-09-18 15:33:34 UTC
Sorry, I've been slacking on this. Hopefully ill have some time to get to this today...

Comment 27 Red Hat Bugzilla 2025-03-03 04:25:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.