Bug 2060963
| Summary: | [GSS] osd processes consuming high cpu on ocs nodes | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | khover |
| Component: | ceph | Assignee: | Vikhyat Umrao <vumrao> |
| Status: | CLOSED NOTABUG | QA Contact: | Mahesh Shetty <mashetty> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | assingh, bhubbard, bkunal, bniver, fsilva, hnallurv, jdelaros, jolee, kelwhite, linuxkidd, madam, mashetty, mhackett, mmuench, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhange, soakley, sostapov, vumrao |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-14 23:07:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
khover
2022-03-04 17:36:07 UTC
Supportshell uploads /cases/03163193 drwxrwxrwx. 3 yank yank 80 Mar 2 19:44 0010-sosreport-ilscha03-ocp-cnsd-01-03159693-2022-03-02-dmccdfq.tar.xz drwxrwxrwx. 3 yank yank 59 Mar 2 20:32 0020-cluster-must-gather.tar.gz drwxrwxrwx. 3 yank yank 59 Mar 2 23:00 0030-ocs-must-gather.tar.gz drwxrwxrwx. 3 yank yank 80 Mar 2 23:33 0040-sosreport-ilscha03-ocp-cnsd-03-03163193-2022-03-02-rglltqb.tar.xz drwxrwxrwx. 3 yank yank 80 Mar 2 23:34 0050-sosreport-ilscha03-ocp-cnsd-02-03163193-2022-03-02-ssflbkb.tar.xz -rw-rw-rw-. 1 yank yank 18365 Mar 2 23:35 0060-ceph-ocs-node-down-alert.png -rw-rw-rw-. 1 yank yank 52648 Mar 2 23:35 0070-ceph-ocs-node-down-alert-2.png -rw-rw-rw-. 1 yank yank 57056 Mar 2 23:35 0080-ceph-ocs-node-down-alert-3.png drwxrwxrwx. 3 yank yank 55 Mar 3 21:15 0090-inspect-openshit-storage.tar.gz -rw-rw-rw-. 1 yank yank 92404 Mar 4 14:37 0100-ocs-node-03-TOP.png -rw-rw-rw-. 1 yank yank 131318 Mar 4 14:57 0110-dmesg-ilscha03-ocp-cnsd-03.uscc.com.log -rw-rw-rw-. 1 yank yank 3286490 Mar 4 14:58 0120-journalctl-ilscha03-ocp-cnsd-03.uscc.com.log drwxrwxrwx. 2 yank usbmon 51 Mar 2 20:12 sosreport-20220302-175522 drwxrwxrwx. 2 yank usbmon 92 Mar 2 23:48 sosreport-20220302-232021 Additional notes from openshift team initial investigation on 3/3/22 bare-metal nodes 10.32.161.8 --> ping reachable local disks mounted on the nodes --> SSD's - raid0 4TB no shared volumes. firmware not up-to date. but not necessarily the issue - is equivalent on all the nodes. 4-5 times a day. The processes start in the morning. Full system CPU is maxed, we lose networking access here also. console hangs. Hardware is the same on all OCS nodes - processors is the same. Memory is slightly all the way to 100% utilization according to the ilo proliant reporting handler. node: 03 down around 10:30AM Eastern (3:30 GMT) (9:15 this morning also) 01 node --> down around the same time no OCS alerts - Only NODE IS DOWN alerting. (typically this loses SSH and kubelet fails) networking is lost also? Waiting on load info from @khover |