Bug 2069753

Summary: [DR] OSD are getting OOM killed when running io
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pratik Surve <prsurve>
Component: cephAssignee: Scott Ostapovicz <sostapov>
Status: CLOSED DUPLICATE QA Contact: Neha Berry <nberry>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.10CC: bniver, ekuric, idryomov, kseeger, madam, muagarwa, ocs-bugs, odf-bz-bot, srangana
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-30 17:47:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pratik Surve 2022-03-29 15:47:20 UTC
Description of problem (please be detailed as possible and provide log
snippets):

[DR] OSD are getting OOM killed when running io

Version of all relevant components (if applicable):

OCP version:- 4.10.0-0.nightly-2022-03-23-153617
ODF version:- 4.10.0-208
CEPH version:- ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes


PVC will take time to be come to bound state

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy RDR cluster
2. Run io for 2-3 days min(100 pvc/pods)
3. Check osd pod status 


Actual results:
rook-ceph-osd-0-6c4cdb77f4-p4f7m                                  2/2     Running     66 (18m ago)   5d4h   10.131.0.36    vmware-dccp-one-4ch4f-worker-m8w42   <none>           <none>
rook-ceph-osd-1-69fb9b6d9f-7xtjt                                  2/2     Running     7 (111m ago)   5d4h   10.128.2.23    vmware-dccp-one-4ch4f-worker-hxbhv   <none>           <none>
rook-ceph-osd-2-dd8fb4c89-xb77w                                   2/2     Running     2 (85m ago)    5d4h   10.129.2.25    vmware-dccp-one-4ch4f-worker-56xcv   <none>           <none>


Output from dmesg -T

[Tue Mar 29 15:18:30 2022] Memory cgroup out of memory: Killed process 3671834 (ceph-osd) total-vm:6175716kB, anon-rss:5225984kB, file-rss:34368kB, shmem-rss:0kB, UID:167 pgtables:11100kB oom_score_adj:-997


Expected results:


Additional info:




$oc adm top nodes                                                            
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
vmware-dccp-one-4ch4f-master-0       760m         21%    8649Mi          58%       
vmware-dccp-one-4ch4f-master-1       858m         24%    7500Mi          50%       
vmware-dccp-one-4ch4f-master-2       871m         24%    9735Mi          65%       
vmware-dccp-one-4ch4f-worker-56xcv   3389m        21%    8705Mi          13%       
vmware-dccp-one-4ch4f-worker-hxbhv   3640m        23%    13309Mi         21%       
vmware-dccp-one-4ch4f-worker-m8w42   3618m        23%    12912Mi         20%

Comment 4 Scott Ostapovicz 2022-03-30 17:47:21 UTC
Given the lack of detail on this ticket, I am going to agree that this seems to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2021931 and close this one.  Please reopen it if you have additional information that conflicts with this assumption.

*** This bug has been marked as a duplicate of bug 2021931 ***