Description of problem: My customer is having an issue with OCP 4.6.27 (after an upgrade from 4.6.23) where we are seeing nodes not being able to be accessed due to large amounts of resources being taken by crio. We are not even able to run a sosreport from the node as this fails as well. This is not being reflected in the UI which says everything is fine. Outputs from oc get co and oc get clusterversion report everything normal CRIO CPU usage between 100% - 250% (see top screenshot in attachment) The containers do not reflect this usage when running a sudo crictl stats: [core@node~]$ sudo crictl stats CONTAINER CPU % MEM DISK INODES 169ef6b0ee5f0 0.02 24.41MB 6B 1 17c5e49bd514b 0.01 12.2MB 49B 3 23ec03d558d03 0.32 29.27MB 44B 3 34823cfa00d91 0.00 18.22MB 61B 4 385593ae449b7 0.20 10.08MB 235B 14 38911cb2e3e3e 0.41 56.43MB 2.233kB 28 3f69da7def7ea 0.00 19.6MB 143B 8 4470bdd44b2c4 0.00 18.9MB 84B 5 606636bf31f18 0.01 15.75MB 49B 3 63d5585bdba2f 4.82 692.3MB 500B 23 9136f0a5c1be9 0.00 21.11MB 23B 2 921734f0451c8 0.00 17.76MB 61B 4 982726b96aa9a 0.05 78.14MB 62B 4 a0302de0e37c0 0.00 11.69MB 104B 6 b308fcc7613ba 0.00 19.66MB 44B 3 d4632b6d78a6e 0.00 24.36MB 517B 19 d5ad732e6bd56 0.00 18MB 368B 4 d7e54c676c16c 0.00 20.69MB 23B 2 d84b9b99f9996 0.00 14.82MB 24B 2 e7a266e47773a 0.02 24.56MB 88B 5 ed8b662fed698 0.00 22.12MB 517B 19 f41e903ad81ba 0.00 74.14MB 299B 17 CRIO CPU usage between 100% - 250% on the node Version-Release number of selected component (if applicable): OCP 4.6.27 How reproducible: Not reproducible as it appears to be random currently. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
there's nothing immediately apparent here. cri-o didn't bump versions between 4.6.23 and 4.6.27 Unfortunately, I'll need the cri-o logs from the affected node to do any investigation.
gah the crio log is pretty sparse. I'll need the full node journal to do investigation. I haven't seen behavior like this
In addition, I think it'd be useful to get the cri-o goroutine stacks to know where cri-o is actively: https://github.com/cri-o/cri-o/pull/5033 adds a file that describes how to do so
A note for posterity: when nodes are in this condition, systemctl often is hosed as well, causing the "connection reset" problem. If anyone runs into this, cri-o goroutine stacks can be grabbed by running kill -USR1 $(pidof crio) which doesn't rely on systemd, and is more likely to succeed
It seems initial feedback indicates the attached PR mitigates the issues on cri-o's end
4.7 version of the fix
*** Bug 1952798 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759