Bug 2060963

Summary: [GSS] osd processes consuming high cpu on ocs nodes
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: khover
Component: cephAssignee: Vikhyat Umrao <vumrao>
Status: CLOSED NOTABUG QA Contact: Mahesh Shetty <mashetty>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.8CC: assingh, bhubbard, bkunal, bniver, fsilva, hnallurv, jdelaros, jolee, kelwhite, linuxkidd, madam, mashetty, mhackett, mmuench, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhange, soakley, sostapov, vumrao
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-14 23:07:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description khover 2022-03-04 17:36:07 UTC
Description of problem (please be detailed as possible and provide log
snippests):

osd processes consuming high cpu on ocs nodes, believed to be causing ocs nodes to reboot.

This is contributing to a unstable storage platform for customer.



Version of all relevant components (if applicable):

OCS v 4.8

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

NO

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Can this issue reproducible?



Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 khover 2022-03-04 18:26:38 UTC
Supportshell uploads 


/cases/03163193

drwxrwxrwx. 3 yank yank        80 Mar  2 19:44 0010-sosreport-ilscha03-ocp-cnsd-01-03159693-2022-03-02-dmccdfq.tar.xz
drwxrwxrwx. 3 yank yank        59 Mar  2 20:32 0020-cluster-must-gather.tar.gz
drwxrwxrwx. 3 yank yank        59 Mar  2 23:00 0030-ocs-must-gather.tar.gz
drwxrwxrwx. 3 yank yank        80 Mar  2 23:33 0040-sosreport-ilscha03-ocp-cnsd-03-03163193-2022-03-02-rglltqb.tar.xz
drwxrwxrwx. 3 yank yank        80 Mar  2 23:34 0050-sosreport-ilscha03-ocp-cnsd-02-03163193-2022-03-02-ssflbkb.tar.xz
-rw-rw-rw-. 1 yank yank     18365 Mar  2 23:35 0060-ceph-ocs-node-down-alert.png
-rw-rw-rw-. 1 yank yank     52648 Mar  2 23:35 0070-ceph-ocs-node-down-alert-2.png
-rw-rw-rw-. 1 yank yank     57056 Mar  2 23:35 0080-ceph-ocs-node-down-alert-3.png
drwxrwxrwx. 3 yank yank        55 Mar  3 21:15 0090-inspect-openshit-storage.tar.gz
-rw-rw-rw-. 1 yank yank     92404 Mar  4 14:37 0100-ocs-node-03-TOP.png
-rw-rw-rw-. 1 yank yank    131318 Mar  4 14:57 0110-dmesg-ilscha03-ocp-cnsd-03.uscc.com.log
-rw-rw-rw-. 1 yank yank   3286490 Mar  4 14:58 0120-journalctl-ilscha03-ocp-cnsd-03.uscc.com.log
drwxrwxrwx. 2 yank usbmon      51 Mar  2 20:12 sosreport-20220302-175522
drwxrwxrwx. 2 yank usbmon      92 Mar  2 23:48 sosreport-20220302-232021

Comment 7 khover 2022-03-07 14:13:03 UTC
Additional notes from openshift team initial investigation on 3/3/22

bare-metal nodes


10.32.161.8 --> ping reachable 


local disks mounted on the nodes --> SSD's - raid0 4TB
no shared volumes.


firmware not up-to date. but not necessarily the issue - is equivalent on all the nodes. 


4-5 times a day. The processes start in the morning.


Full system CPU is maxed, we lose networking access here also. console hangs.


Hardware is the same on all OCS nodes - processors is the same. Memory is slightly 


all the way to 100% utilization according to the ilo proliant reporting handler.


node: 03 down around 10:30AM Eastern (3:30 GMT) (9:15 this morning also)
01 node --> down around the same time


no OCS alerts - Only NODE IS DOWN alerting. (typically this loses SSH and kubelet fails) networking is lost also?

Comment 8 Scott Ostapovicz 2022-03-08 15:34:37 UTC
Waiting on load info from @khover