Bug 2229110
| Summary: | [RDR-CEPH] Ceph MON restarts on RDR Longevity cluster | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | kmanohar |
| Component: | ceph | Assignee: | Santosh Pillai <sapillai> |
| ceph sub component: | Ceph-MGR | QA Contact: | Elad <ebenahar> |
| Status: | NEW --- | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | amagrawa, bniver, muagarwa, nojha, odf-bz-bot, sapillai, sostapov |
| Version: | 4.13 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
kmanohar
2023-08-04 08:16:26 UTC
On c1 a mon pod restarted only once.
On c2, 2 mon pods restarted more than once.
Ceph status is health ok in both the clusters.
cluster:
id: 7725dcb7-f13b-4609-a52b-f781d5fb3cd2
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,d (age 33h)
mgr: a(active, since 3w)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 3w), 3 in (since 5w)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 200.71k objects, 753 GiB
usage: 2.2 TiB used, 2.2 TiB / 4.4 TiB avail
pgs: 169 active+clean
io:
client: 17 MiB/s rd, 5.5 MiB/s wr, 3.25k op/s rd, 1.12k op/s wr
cluster:
id: 009a2183-64c2-4101-8c42-ea892a9933c4
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,d (age 32h)
mgr: a(active, since 3d)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 2w), 3 in (since 5w)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 202.29k objects, 761 GiB
usage: 2.2 TiB used, 3.8 TiB / 6 TiB avail
pgs: 169 active+clean
io:
client: 76 MiB/s rd, 1.4 MiB/s wr, 1.23k op/s rd, 270 op/s wr
Does the mon restart affect the normal functioning of a cluster?
adding need info for my question in above comment. @sapillai So far it doesn't affect the functions but I see one more restart today on c2. What would be the possible reason for this behavior? Based on the dmesg logs shared by Pratik, it looks like memory limit issue that's causing the mon containers to restart. ``` [Sun Aug 6 20:29:51 2023] [ 903772] 167 903772 582401 525721 4399104 0 -997 ceph-mon [Sun Aug 6 20:29:51 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,task=ceph-mon,pid=903772,uid=167 [Sun Aug 6 20:29:51 2023] Memory cgroup out of memory: Killed process 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB, shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997``` (In reply to Santosh Pillai from comment #5) > Based on the dmesg logs shared by Pratik, it looks like memory limit issue > that's causing the mon containers to restart. > > ``` [Sun Aug 6 20:29:51 2023] [ 903772] 167 903772 582401 525721 > 4399104 0 -997 ceph-mon > [Sun Aug 6 20:29:51 2023] > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio- > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods- > burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable- > pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > task=ceph-mon,pid=903772,uid=167 > [Sun Aug 6 20:29:51 2023] Memory cgroup out of memory: Killed process > 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB, > shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997``` Hi. Can you confirm if the mon pod restarts that you see are because of the same error and not any thing else? One more question is what kind of test/load is being run on this cluster? IMO, we haven't seen these mon restarts before and ODF hasn't changed/reduced the memory requests/limits for mon pods in a long time. So curious if QE has introduced any new load tests that might be causing the mon pods to restart. Solution for the workaround could be increasing the memory limits/requests for mon pods. Currently they are: ``` Limits: cpu: 1 memory: 2Gi Requests: cpu: 1 memory: 2Gi ``` (In reply to Santosh Pillai from comment #6) > (In reply to Santosh Pillai from comment #5) > > Based on the dmesg logs shared by Pratik, it looks like memory limit issue > > that's causing the mon containers to restart. > > > > ``` [Sun Aug 6 20:29:51 2023] [ 903772] 167 903772 582401 525721 > > 4399104 0 -997 ceph-mon > > [Sun Aug 6 20:29:51 2023] > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio- > > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > > mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods- > > burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > > task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable- > > pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio- > > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope, > > task=ceph-mon,pid=903772,uid=167 > > [Sun Aug 6 20:29:51 2023] Memory cgroup out of memory: Killed process > > 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB, > > shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997``` > > Hi. Can you confirm if the mon pod restarts that you see are because of the > same error and not any thing else? @sapillai We haven't performed any operations on the clusters, just replications were going on, so the reason for mon restart may be associated with the memory limit issue. (In reply to Santosh Pillai from comment #7) > One more question is what kind of test/load is being run on this cluster? > IMO, we haven't seen these mon restarts before and ODF hasn't > changed/reduced the memory requests/limits for mon pods in a long time. > So curious if QE has introduced any new load tests that might be causing the > mon pods to restart. > > Solution for the workaround could be increasing the memory limits/requests > for mon pods. Currently they are: > ``` > Limits: > cpu: 1 > memory: 2Gi > Requests: > cpu: 1 > memory: 2Gi > ``` @ (In reply to Santosh Pillai from comment #7) > One more question is what kind of test/load is being run on this cluster? > IMO, we haven't seen these mon restarts before and ODF hasn't > changed/reduced the memory requests/limits for mon pods in a long time. > So curious if QE has introduced any new load tests that might be causing the > mon pods to restart. > > Solution for the workaround could be increasing the memory limits/requests > for mon pods. Currently they are: > ``` > Limits: > cpu: 1 > memory: 2Gi > Requests: > cpu: 1 > memory: 2Gi > ``` @sapillai We used the usual busybox workload. No new load tests have been introduced. FYI, No further restarts were seen in the past week. Thanks for confirming that there are no restarts as of now. I would suggest to move this BZ for 4.15. If more restarts are observed, then we can try increasing the resource limits for mon pods. |