Bug 1975831

Summary: Crio is using large amounts of node resources
Product: OpenShift Container Platform Reporter: Andy Bartlett <andbartl>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: pmali
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, mduasope, pehunt
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:36:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andy Bartlett 2021-06-24 14:14:31 UTC
Description of problem:

My customer is having an issue with OCP 4.6.27 (after an upgrade from 4.6.23) where we are seeing nodes not being able to be accessed due to large amounts of resources being taken by crio. We are not even able to run a sosreport from the node as this fails as well.


This is not being reflected in the UI which says everything is fine.

Outputs from oc get co and oc get clusterversion report everything normal

CRIO CPU usage between 100% - 250% (see top screenshot in attachment) 


The containers do not reflect this usage when running a sudo crictl stats:


[core@node~]$ sudo crictl stats
CONTAINER           CPU %               MEM                 DISK                INODES
169ef6b0ee5f0       0.02                24.41MB             6B                  1
17c5e49bd514b       0.01                12.2MB              49B                 3
23ec03d558d03       0.32                29.27MB             44B                 3
34823cfa00d91       0.00                18.22MB             61B                 4
385593ae449b7       0.20                10.08MB             235B                14
38911cb2e3e3e       0.41                56.43MB             2.233kB             28
3f69da7def7ea       0.00                19.6MB              143B                8
4470bdd44b2c4       0.00                18.9MB              84B                 5
606636bf31f18       0.01                15.75MB             49B                 3
63d5585bdba2f       4.82                692.3MB             500B                23
9136f0a5c1be9       0.00                21.11MB             23B                 2
921734f0451c8       0.00                17.76MB             61B                 4
982726b96aa9a       0.05                78.14MB             62B                 4
a0302de0e37c0       0.00                11.69MB             104B                6
b308fcc7613ba       0.00                19.66MB             44B                 3
d4632b6d78a6e       0.00                24.36MB             517B                19
d5ad732e6bd56       0.00                18MB                368B                4
d7e54c676c16c       0.00                20.69MB             23B                 2
d84b9b99f9996       0.00                14.82MB             24B                 2
e7a266e47773a       0.02                24.56MB             88B                 5
ed8b662fed698       0.00                22.12MB             517B                19
f41e903ad81ba       0.00                74.14MB             299B                17

CRIO CPU usage between 100% - 250% on the node

Version-Release number of selected component (if applicable):

OCP 4.6.27


How reproducible:

Not reproducible as it appears to be random currently.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2021-06-24 14:27:57 UTC
there's nothing immediately apparent here. cri-o didn't bump versions between 4.6.23 and 4.6.27

Unfortunately, I'll need the cri-o logs from the affected node to do any investigation.

Comment 3 Peter Hunt 2021-06-24 17:48:42 UTC
gah the crio log is pretty sparse. I'll need the full node journal to do investigation. I haven't seen behavior like this

Comment 5 Peter Hunt 2021-06-25 14:40:25 UTC
In addition, I think it'd be useful to get the cri-o goroutine stacks to know where cri-o is actively:
https://github.com/cri-o/cri-o/pull/5033 adds a file that describes how to do so

Comment 13 Peter Hunt 2021-06-30 13:48:46 UTC
A note for posterity:
when nodes are in this condition, systemctl often is hosed as well, causing the "connection reset" problem. If anyone runs into this, cri-o goroutine stacks can be grabbed by running

kill -USR1 $(pidof crio) 

which doesn't rely on systemd, and is more likely to succeed

Comment 14 Peter Hunt 2021-07-02 18:51:21 UTC
It seems initial feedback indicates the attached PR mitigates the issues on cri-o's end

Comment 19 Peter Hunt 2021-07-09 13:47:05 UTC
4.7 version of the fix

Comment 23 Peter Hunt 2021-09-09 20:13:46 UTC
*** Bug 1952798 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2021-10-18 17:36:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759