2229110 – [RDR-CEPH] Ceph MON restarts on RDR Longevity cluster

Bug 2229110 - [RDR-CEPH] Ceph MON restarts on RDR Longevity cluster

Summary: [RDR-CEPH] Ceph MON restarts on RDR Longevity cluster

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Santosh Pillai
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-04 08:16 UTC by kmanohar
Modified:	2024-10-23 04:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-04-22 09:31:34 UTC
Embargoed:
Flags:	kramdoss: needinfo+

Attachments	(Terms of Use)

Description kmanohar 2023-08-04 08:16:26 UTC

Description of problem (please be detailed as possible and provide log
snippests):

On RDR Longevity cluster, running for 2 months, observed ceph-mon restarts on C1 and C2.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1. Keep the RDR cluster with replication going on for a longer duration(in this case 2 months)
2. Observed ceph-mon pod restart on C1 and C2. 

Actual results:
Mons restarted multiple times within 3 days.


Expected results:


Additional info:
1) On C2 csi-addons-controller-manager, csi-rbdplugin-provisioner, ocs-operator, odf-operator-controller-manager also restarted multiple times.

Must-gather logs:-

c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity/ceph-mon-restart/c1/

c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity/ceph-mon-restart/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity/ceph-mon-restart/hub/

Live cluster is avaiable for debugging

hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/

c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/

c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/

Comment 2 Santosh Pillai 2023-08-04 12:53:22 UTC

On c1 a mon pod restarted only once. 

On c2, 2 mon pods restarted more than once. 


Ceph status is health ok in both the clusters.   

cluster:
    id:     7725dcb7-f13b-4609-a52b-f781d5fb3cd2
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum a,b,d (age 33h)
    mgr:        a(active, since 3w)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 3w), 3 in (since 5w)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 200.71k objects, 753 GiB
    usage:   2.2 TiB used, 2.2 TiB / 4.4 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   17 MiB/s rd, 5.5 MiB/s wr, 3.25k op/s rd, 1.12k op/s wr



  cluster:
    id:     009a2183-64c2-4101-8c42-ea892a9933c4
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum b,c,d (age 32h)
    mgr:        a(active, since 3d)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 2w), 3 in (since 5w)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 202.29k objects, 761 GiB
    usage:   2.2 TiB used, 3.8 TiB / 6 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   76 MiB/s rd, 1.4 MiB/s wr, 1.23k op/s rd, 270 op/s wr


Does the mon restart affect the normal functioning of a cluster?

Comment 3 Santosh Pillai 2023-08-04 12:54:13 UTC

adding need info for my question in above comment.

Comment 4 kmanohar 2023-08-08 05:00:25 UTC

@sapillai So far it doesn't affect the functions but I see one more restart today on c2. What would be the possible reason for this behavior?

Comment 5 Santosh Pillai 2023-08-10 14:00:24 UTC

Based on the dmesg logs shared by Pratik, it looks like memory limit issue that's causing the mon containers to restart. 

``` [Sun Aug  6 20:29:51 2023] [ 903772]   167 903772   582401   525721  4399104        0          -997 ceph-mon
[Sun Aug  6 20:29:51 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,task=ceph-mon,pid=903772,uid=167
[Sun Aug  6 20:29:51 2023] Memory cgroup out of memory: Killed process 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB, shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997```

Comment 6 Santosh Pillai 2023-08-10 14:05:25 UTC

(In reply to Santosh Pillai from comment #5)
> Based on the dmesg logs shared by Pratik, it looks like memory limit issue
> that's causing the mon containers to restart. 
> 
> ``` [Sun Aug  6 20:29:51 2023] [ 903772]   167 903772   582401   525721 
> 4399104        0          -997 ceph-mon
> [Sun Aug  6 20:29:51 2023]
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-
> 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,
> mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-
> burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-
> 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,
> task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-
> 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,
> task=ceph-mon,pid=903772,uid=167
> [Sun Aug  6 20:29:51 2023] Memory cgroup out of memory: Killed process
> 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB,
> shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997```

Hi. Can you confirm if the mon pod restarts that you see are because of the same error and not any thing else?

Comment 7 Santosh Pillai 2023-08-10 14:28:05 UTC

One more question is what kind of test/load is being run on this cluster? IMO, we haven't seen these mon restarts before and ODF hasn't changed/reduced the memory requests/limits for mon pods in a long time. 
So curious if QE has introduced any new load tests that might be causing the mon pods to restart. 

Solution for the workaround could be increasing the memory limits/requests for mon pods. Currently they are:
```
   Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:        1
      memory:     2Gi
```

Comment 8 kmanohar 2023-08-14 08:45:57 UTC

(In reply to Santosh Pillai from comment #6)
> (In reply to Santosh Pillai from comment #5)
> > Based on the dmesg logs shared by Pratik, it looks like memory limit issue
> > that's causing the mon containers to restart. 
> > 
> > ``` [Sun Aug  6 20:29:51 2023] [ 903772]   167 903772   582401   525721 
> > 4399104        0          -997 ceph-mon
> > [Sun Aug  6 20:29:51 2023]
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-
> > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,
> > mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-
> > burstable-pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-
> > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,
> > task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> > pod46a9f45e_de8d_403a_b0e3_51c515610a68.slice/crio-
> > 709abf06417c617700babacb304118344abf8b8a568d3c45da9ade62a56e0a23.scope,
> > task=ceph-mon,pid=903772,uid=167
> > [Sun Aug  6 20:29:51 2023] Memory cgroup out of memory: Killed process
> > 903772 (ceph-mon) total-vm:2329604kB, anon-rss:2076260kB, file-rss:26624kB,
> > shmem-rss:0kB, UID:167 pgtables:4296kB oom_score_adj:-997```
> 
> Hi. Can you confirm if the mon pod restarts that you see are because of the
> same error and not any thing else?

@sapillai We haven't performed any operations on the clusters, just replications were going on, so the reason for mon restart may be associated with the memory limit issue.

Comment 9 kmanohar 2023-08-14 09:58:55 UTC

(In reply to Santosh Pillai from comment #7)
> One more question is what kind of test/load is being run on this cluster?
> IMO, we haven't seen these mon restarts before and ODF hasn't
> changed/reduced the memory requests/limits for mon pods in a long time. 
> So curious if QE has introduced any new load tests that might be causing the
> mon pods to restart. 
> 
> Solution for the workaround could be increasing the memory limits/requests
> for mon pods. Currently they are:
> ```
>    Limits:
>       cpu:     1
>       memory:  2Gi
>     Requests:
>       cpu:        1
>       memory:     2Gi
> ```

@

Comment 10 kmanohar 2023-08-14 10:06:26 UTC

(In reply to Santosh Pillai from comment #7)
> One more question is what kind of test/load is being run on this cluster?
> IMO, we haven't seen these mon restarts before and ODF hasn't
> changed/reduced the memory requests/limits for mon pods in a long time. 
> So curious if QE has introduced any new load tests that might be causing the
> mon pods to restart. 
> 
> Solution for the workaround could be increasing the memory limits/requests
> for mon pods. Currently they are:
> ```
>    Limits:
>       cpu:     1
>       memory:  2Gi
>     Requests:
>       cpu:        1
>       memory:     2Gi
> ```

@sapillai We used the usual busybox workload. No new load tests have been introduced. FYI, No further restarts were seen in the past week.

Comment 11 Santosh Pillai 2023-08-14 10:19:48 UTC

Thanks for confirming that there are no restarts as of now. I would suggest to move this BZ for 4.15. If more restarts are observed, then we can try increasing the resource limits for mon pods.

Comment 12 Santosh Pillai 2024-01-12 06:17:40 UTC

Have more restarts been observed lately?

Comment 13 krishnaram Karthick 2024-01-16 06:18:33 UTC

Moving the bug out to 4.16 and we will create a new longevity cluster to retest this issue.

Comment 14 Santosh Pillai 2024-04-22 09:23:58 UTC

Hi, 

Any update on this ?

Comment 15 Santosh Pillai 2024-04-22 09:31:34 UTC

Since the issue was not reproduced later on, I will close this for now.  Please reopen if you still see the issue.

Comment 16 Red Hat Bugzilla 2024-10-23 04:25:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.