Description of problem (please be detailed as possible and provide log snippests): osd processes consuming high cpu on ocs nodes, believed to be causing ocs nodes to reboot. This is contributing to a unstable storage platform for customer. Version of all relevant components (if applicable): OCS v 4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? NO Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Supportshell uploads /cases/03163193 drwxrwxrwx. 3 yank yank 80 Mar 2 19:44 0010-sosreport-ilscha03-ocp-cnsd-01-03159693-2022-03-02-dmccdfq.tar.xz drwxrwxrwx. 3 yank yank 59 Mar 2 20:32 0020-cluster-must-gather.tar.gz drwxrwxrwx. 3 yank yank 59 Mar 2 23:00 0030-ocs-must-gather.tar.gz drwxrwxrwx. 3 yank yank 80 Mar 2 23:33 0040-sosreport-ilscha03-ocp-cnsd-03-03163193-2022-03-02-rglltqb.tar.xz drwxrwxrwx. 3 yank yank 80 Mar 2 23:34 0050-sosreport-ilscha03-ocp-cnsd-02-03163193-2022-03-02-ssflbkb.tar.xz -rw-rw-rw-. 1 yank yank 18365 Mar 2 23:35 0060-ceph-ocs-node-down-alert.png -rw-rw-rw-. 1 yank yank 52648 Mar 2 23:35 0070-ceph-ocs-node-down-alert-2.png -rw-rw-rw-. 1 yank yank 57056 Mar 2 23:35 0080-ceph-ocs-node-down-alert-3.png drwxrwxrwx. 3 yank yank 55 Mar 3 21:15 0090-inspect-openshit-storage.tar.gz -rw-rw-rw-. 1 yank yank 92404 Mar 4 14:37 0100-ocs-node-03-TOP.png -rw-rw-rw-. 1 yank yank 131318 Mar 4 14:57 0110-dmesg-ilscha03-ocp-cnsd-03.uscc.com.log -rw-rw-rw-. 1 yank yank 3286490 Mar 4 14:58 0120-journalctl-ilscha03-ocp-cnsd-03.uscc.com.log drwxrwxrwx. 2 yank usbmon 51 Mar 2 20:12 sosreport-20220302-175522 drwxrwxrwx. 2 yank usbmon 92 Mar 2 23:48 sosreport-20220302-232021
Additional notes from openshift team initial investigation on 3/3/22 bare-metal nodes 10.32.161.8 --> ping reachable local disks mounted on the nodes --> SSD's - raid0 4TB no shared volumes. firmware not up-to date. but not necessarily the issue - is equivalent on all the nodes. 4-5 times a day. The processes start in the morning. Full system CPU is maxed, we lose networking access here also. console hangs. Hardware is the same on all OCS nodes - processors is the same. Memory is slightly all the way to 100% utilization according to the ilo proliant reporting handler. node: 03 down around 10:30AM Eastern (3:30 GMT) (9:15 this morning also) 01 node --> down around the same time no OCS alerts - Only NODE IS DOWN alerting. (typically this loses SSH and kubelet fails) networking is lost also?
Waiting on load info from @khover