2069753 – [DR] OSD are getting OOM killed when running io

Bug 2069753 - [DR] OSD are getting OOM killed when running io

Summary: [DR] OSD are getting OOM killed when running io

Keywords:
Status:	CLOSED DUPLICATE of bug 2021931
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Scott Ostapovicz
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-29 15:47 UTC by Pratik Surve
Modified:	2023-08-09 16:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-30 17:47:21 UTC
Embargoed:

Attachments	(Terms of Use)

Description Pratik Surve 2022-03-29 15:47:20 UTC

Description of problem (please be detailed as possible and provide log
snippets):

[DR] OSD are getting OOM killed when running io

Version of all relevant components (if applicable):

OCP version:- 4.10.0-0.nightly-2022-03-23-153617
ODF version:- 4.10.0-208
CEPH version:- ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes


PVC will take time to be come to bound state

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy RDR cluster
2. Run io for 2-3 days min(100 pvc/pods)
3. Check osd pod status 


Actual results:
rook-ceph-osd-0-6c4cdb77f4-p4f7m                                  2/2     Running     66 (18m ago)   5d4h   10.131.0.36    vmware-dccp-one-4ch4f-worker-m8w42   <none>           <none>
rook-ceph-osd-1-69fb9b6d9f-7xtjt                                  2/2     Running     7 (111m ago)   5d4h   10.128.2.23    vmware-dccp-one-4ch4f-worker-hxbhv   <none>           <none>
rook-ceph-osd-2-dd8fb4c89-xb77w                                   2/2     Running     2 (85m ago)    5d4h   10.129.2.25    vmware-dccp-one-4ch4f-worker-56xcv   <none>           <none>


Output from dmesg -T

[Tue Mar 29 15:18:30 2022] Memory cgroup out of memory: Killed process 3671834 (ceph-osd) total-vm:6175716kB, anon-rss:5225984kB, file-rss:34368kB, shmem-rss:0kB, UID:167 pgtables:11100kB oom_score_adj:-997


Expected results:


Additional info:




$oc adm top nodes                                                            
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
vmware-dccp-one-4ch4f-master-0       760m         21%    8649Mi          58%       
vmware-dccp-one-4ch4f-master-1       858m         24%    7500Mi          50%       
vmware-dccp-one-4ch4f-master-2       871m         24%    9735Mi          65%       
vmware-dccp-one-4ch4f-worker-56xcv   3389m        21%    8705Mi          13%       
vmware-dccp-one-4ch4f-worker-hxbhv   3640m        23%    13309Mi         21%       
vmware-dccp-one-4ch4f-worker-m8w42   3618m        23%    12912Mi         20%

Comment 4 Scott Ostapovicz 2022-03-30 17:47:21 UTC

Given the lack of detail on this ticket, I am going to agree that this seems to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2021931 and close this one.  Please reopen it if you have additional information that conflicts with this assumption.

*** This bug has been marked as a duplicate of bug 2021931 ***

Note You need to log in before you can comment on or make changes to this bug.