Bug 1975831 - Crio is using large amounts of node resources
Summary: Crio is using large amounts of node resources
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Peter Hunt
QA Contact: pmali
URL:
Whiteboard:
: 1952798 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-24 14:14 UTC by Andy Bartlett
Modified: 2024-12-20 20:19 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:36:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5053 0 None closed [1.19] cgmgr: reuse dbus connection 2021-07-15 14:58:07 UTC
Github cri-o cri-o pull 5058 0 None closed [1.20] cgmgr: reuse dbus connection 2021-07-15 14:58:08 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:37:08 UTC

Description Andy Bartlett 2021-06-24 14:14:31 UTC
Description of problem:

My customer is having an issue with OCP 4.6.27 (after an upgrade from 4.6.23) where we are seeing nodes not being able to be accessed due to large amounts of resources being taken by crio. We are not even able to run a sosreport from the node as this fails as well.


This is not being reflected in the UI which says everything is fine.

Outputs from oc get co and oc get clusterversion report everything normal

CRIO CPU usage between 100% - 250% (see top screenshot in attachment) 


The containers do not reflect this usage when running a sudo crictl stats:


[core@node~]$ sudo crictl stats
CONTAINER           CPU %               MEM                 DISK                INODES
169ef6b0ee5f0       0.02                24.41MB             6B                  1
17c5e49bd514b       0.01                12.2MB              49B                 3
23ec03d558d03       0.32                29.27MB             44B                 3
34823cfa00d91       0.00                18.22MB             61B                 4
385593ae449b7       0.20                10.08MB             235B                14
38911cb2e3e3e       0.41                56.43MB             2.233kB             28
3f69da7def7ea       0.00                19.6MB              143B                8
4470bdd44b2c4       0.00                18.9MB              84B                 5
606636bf31f18       0.01                15.75MB             49B                 3
63d5585bdba2f       4.82                692.3MB             500B                23
9136f0a5c1be9       0.00                21.11MB             23B                 2
921734f0451c8       0.00                17.76MB             61B                 4
982726b96aa9a       0.05                78.14MB             62B                 4
a0302de0e37c0       0.00                11.69MB             104B                6
b308fcc7613ba       0.00                19.66MB             44B                 3
d4632b6d78a6e       0.00                24.36MB             517B                19
d5ad732e6bd56       0.00                18MB                368B                4
d7e54c676c16c       0.00                20.69MB             23B                 2
d84b9b99f9996       0.00                14.82MB             24B                 2
e7a266e47773a       0.02                24.56MB             88B                 5
ed8b662fed698       0.00                22.12MB             517B                19
f41e903ad81ba       0.00                74.14MB             299B                17

CRIO CPU usage between 100% - 250% on the node

Version-Release number of selected component (if applicable):

OCP 4.6.27


How reproducible:

Not reproducible as it appears to be random currently.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2021-06-24 14:27:57 UTC
there's nothing immediately apparent here. cri-o didn't bump versions between 4.6.23 and 4.6.27

Unfortunately, I'll need the cri-o logs from the affected node to do any investigation.

Comment 3 Peter Hunt 2021-06-24 17:48:42 UTC
gah the crio log is pretty sparse. I'll need the full node journal to do investigation. I haven't seen behavior like this

Comment 5 Peter Hunt 2021-06-25 14:40:25 UTC
In addition, I think it'd be useful to get the cri-o goroutine stacks to know where cri-o is actively:
https://github.com/cri-o/cri-o/pull/5033 adds a file that describes how to do so

Comment 13 Peter Hunt 2021-06-30 13:48:46 UTC
A note for posterity:
when nodes are in this condition, systemctl often is hosed as well, causing the "connection reset" problem. If anyone runs into this, cri-o goroutine stacks can be grabbed by running

kill -USR1 $(pidof crio) 

which doesn't rely on systemd, and is more likely to succeed

Comment 14 Peter Hunt 2021-07-02 18:51:21 UTC
It seems initial feedback indicates the attached PR mitigates the issues on cri-o's end

Comment 19 Peter Hunt 2021-07-09 13:47:05 UTC
4.7 version of the fix

Comment 23 Peter Hunt 2021-09-09 20:13:46 UTC
*** Bug 1952798 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2021-10-18 17:36:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.