2060963 – [GSS] osd processes consuming high cpu on ocs nodes

Bug 2060963 - [GSS] osd processes consuming high cpu on ocs nodes

Summary: [GSS] osd processes consuming high cpu on ocs nodes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Vikhyat Umrao
QA Contact:	Mahesh Shetty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-04 17:36 UTC by khover
Modified:	2023-08-09 16:37 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-14 23:07:18 UTC
Embargoed:

Attachments	(Terms of Use)

Description khover 2022-03-04 17:36:07 UTC

Description of problem (please be detailed as possible and provide log
snippests):

osd processes consuming high cpu on ocs nodes, believed to be causing ocs nodes to reboot.

This is contributing to a unstable storage platform for customer.



Version of all relevant components (if applicable):

OCS v 4.8

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

NO

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Can this issue reproducible?



Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 khover 2022-03-04 18:26:38 UTC

Supportshell uploads 


/cases/03163193

drwxrwxrwx. 3 yank yank        80 Mar  2 19:44 0010-sosreport-ilscha03-ocp-cnsd-01-03159693-2022-03-02-dmccdfq.tar.xz
drwxrwxrwx. 3 yank yank        59 Mar  2 20:32 0020-cluster-must-gather.tar.gz
drwxrwxrwx. 3 yank yank        59 Mar  2 23:00 0030-ocs-must-gather.tar.gz
drwxrwxrwx. 3 yank yank        80 Mar  2 23:33 0040-sosreport-ilscha03-ocp-cnsd-03-03163193-2022-03-02-rglltqb.tar.xz
drwxrwxrwx. 3 yank yank        80 Mar  2 23:34 0050-sosreport-ilscha03-ocp-cnsd-02-03163193-2022-03-02-ssflbkb.tar.xz
-rw-rw-rw-. 1 yank yank     18365 Mar  2 23:35 0060-ceph-ocs-node-down-alert.png
-rw-rw-rw-. 1 yank yank     52648 Mar  2 23:35 0070-ceph-ocs-node-down-alert-2.png
-rw-rw-rw-. 1 yank yank     57056 Mar  2 23:35 0080-ceph-ocs-node-down-alert-3.png
drwxrwxrwx. 3 yank yank        55 Mar  3 21:15 0090-inspect-openshit-storage.tar.gz
-rw-rw-rw-. 1 yank yank     92404 Mar  4 14:37 0100-ocs-node-03-TOP.png
-rw-rw-rw-. 1 yank yank    131318 Mar  4 14:57 0110-dmesg-ilscha03-ocp-cnsd-03.uscc.com.log
-rw-rw-rw-. 1 yank yank   3286490 Mar  4 14:58 0120-journalctl-ilscha03-ocp-cnsd-03.uscc.com.log
drwxrwxrwx. 2 yank usbmon      51 Mar  2 20:12 sosreport-20220302-175522
drwxrwxrwx. 2 yank usbmon      92 Mar  2 23:48 sosreport-20220302-232021

Comment 7 khover 2022-03-07 14:13:03 UTC

Additional notes from openshift team initial investigation on 3/3/22

bare-metal nodes


10.32.161.8 --> ping reachable 


local disks mounted on the nodes --> SSD's - raid0 4TB
no shared volumes.


firmware not up-to date. but not necessarily the issue - is equivalent on all the nodes. 


4-5 times a day. The processes start in the morning.


Full system CPU is maxed, we lose networking access here also. console hangs.


Hardware is the same on all OCS nodes - processors is the same. Memory is slightly 


all the way to 100% utilization according to the ilo proliant reporting handler.


node: 03 down around 10:30AM Eastern (3:30 GMT) (9:15 this morning also)
01 node --> down around the same time


no OCS alerts - Only NODE IS DOWN alerting. (typically this loses SSH and kubelet fails) networking is lost also?

Comment 8 Scott Ostapovicz 2022-03-08 15:34:37 UTC

Waiting on load info from @khover

Note You need to log in before you can comment on or make changes to this bug.

assingh
bhubbard
bkunal
bniver
fsilva
hnallurv
jdelaros
jolee
kelwhite
linuxkidd
madam
mashetty
mhackett
mmuench
muagarwa
nojha
ocs-bugs
odf-bz-bot
pdhange
soakley
sostapov
vumrao