Description of problem (please be detailed as possible and provide log snippests): On RDR Longevity cluster, running for 2 months, observed ceph-mon restarts on C1 and C2. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Keep the RDR cluster with replication going on for a longer duration(in this case 2 months) 2. Observed ceph-mon pod restart on C1 and C2. Actual results: Mons restarted multiple times within 3 days. Expected results: Additional info: 1) On C2 csi-addons-controller-manager, csi-rbdplugin-provisioner, ocs-operator, odf-operator-controller-manager also restarted multiple times. Must-gather logs:- c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity/ceph-mon-restart/c1/ c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity/ceph-mon-restart/c2/ hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity/ceph-mon-restart/hub/ Live cluster is avaiable for debugging hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/ c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/ c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/
On c1 a mon pod restarted only once. On c2, 2 mon pods restarted more than once. Ceph status is health ok in both the clusters. cluster: id: 7725dcb7-f13b-4609-a52b-f781d5fb3cd2 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,d (age 33h) mgr: a(active, since 3w) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 3w), 3 in (since 5w) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 200.71k objects, 753 GiB usage: 2.2 TiB used, 2.2 TiB / 4.4 TiB avail pgs: 169 active+clean io: client: 17 MiB/s rd, 5.5 MiB/s wr, 3.25k op/s rd, 1.12k op/s wr cluster: id: 009a2183-64c2-4101-8c42-ea892a9933c4 health: HEALTH_OK services: mon: 3 daemons, quorum b,c,d (age 32h) mgr: a(active, since 3d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 2w), 3 in (since 5w) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 202.29k objects, 761 GiB usage: 2.2 TiB used, 3.8 TiB / 6 TiB avail pgs: 169 active+clean io: client: 76 MiB/s rd, 1.4 MiB/s wr, 1.23k op/s rd, 270 op/s wr Does the mon restart affect the normal functioning of a cluster?
adding need info for my question in above comment.
@sapillai So far it doesn't affect the functions but I see one more restart today on c2. What would be the possible reason for this behavior?
Based on the dmesg logs shared by Pratik, it looks like memory limit issue that's causing the mon containers to restart. ``` [Sun Aug 6 20:29:51 2023] [ 903772] 167 903772 582401 525721 4399104 0 -997 ceph-mon [Sun Aug 6 20:29:51 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,task=ceph-mon,pid=903772,uid=167 [Sun Aug 6 20:29:51 2023] Memory cgroup out of memory: Killed process 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB, shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997```
(In reply to Santosh Pillai from comment #5) > Based on the dmesg logs shared by Pratik, it looks like memory limit issue > that's causing the mon containers to restart. > > ``` [Sun Aug 6 20:29:51 2023] [ 903772] 167 903772 582401 525721 > 4399104 0 -997 ceph-mon > [Sun Aug 6 20:29:51 2023] > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio- > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods- > burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable- > pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > task=ceph-mon,pid=903772,uid=167 > [Sun Aug 6 20:29:51 2023] Memory cgroup out of memory: Killed process > 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB, > shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997``` Hi. Can you confirm if the mon pod restarts that you see are because of the same error and not any thing else?
One more question is what kind of test/load is being run on this cluster? IMO, we haven't seen these mon restarts before and ODF hasn't changed/reduced the memory requests/limits for mon pods in a long time. So curious if QE has introduced any new load tests that might be causing the mon pods to restart. Solution for the workaround could be increasing the memory limits/requests for mon pods. Currently they are: ``` Limits: cpu: 1 memory: 2Gi Requests: cpu: 1 memory: 2Gi ```
(In reply to Santosh Pillai from comment #6) > (In reply to Santosh Pillai from comment #5) > > Based on the dmesg logs shared by Pratik, it looks like memory limit issue > > that's causing the mon containers to restart. > > > > ``` [Sun Aug 6 20:29:51 2023] [ 903772] 167 903772 582401 525721 > > 4399104 0 -997 ceph-mon > > [Sun Aug 6 20:29:51 2023] > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio- > > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > > mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods- > > burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > > task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable- > > pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > > task=ceph-mon,pid=903772,uid=167 > > [Sun Aug 6 20:29:51 2023] Memory cgroup out of memory: Killed process > > 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB, > > shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997``` > > Hi. Can you confirm if the mon pod restarts that you see are because of the > same error and not any thing else? @sapillai We haven't performed any operations on the clusters, just replications were going on, so the reason for mon restart may be associated with the memory limit issue.
(In reply to Santosh Pillai from comment #7) > One more question is what kind of test/load is being run on this cluster? > IMO, we haven't seen these mon restarts before and ODF hasn't > changed/reduced the memory requests/limits for mon pods in a long time. > So curious if QE has introduced any new load tests that might be causing the > mon pods to restart. > > Solution for the workaround could be increasing the memory limits/requests > for mon pods. Currently they are: > ``` > Limits: > cpu: 1 > memory: 2Gi > Requests: > cpu: 1 > memory: 2Gi > ``` @
(In reply to Santosh Pillai from comment #7) > One more question is what kind of test/load is being run on this cluster? > IMO, we haven't seen these mon restarts before and ODF hasn't > changed/reduced the memory requests/limits for mon pods in a long time. > So curious if QE has introduced any new load tests that might be causing the > mon pods to restart. > > Solution for the workaround could be increasing the memory limits/requests > for mon pods. Currently they are: > ``` > Limits: > cpu: 1 > memory: 2Gi > Requests: > cpu: 1 > memory: 2Gi > ``` @sapillai We used the usual busybox workload. No new load tests have been introduced. FYI, No further restarts were seen in the past week.
Thanks for confirming that there are no restarts as of now. I would suggest to move this BZ for 4.15. If more restarts are observed, then we can try increasing the resource limits for mon pods.