1975831 – Crio is using large amounts of node resources

Bug 1975831 - Crio is using large amounts of node resources

Summary: Crio is using large amounts of node resources

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Peter Hunt
QA Contact:	pmali
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1952798 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-24 14:14 UTC by Andy Bartlett
Modified:	2024-12-20 20:19 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:36:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 5053	None	closed	[1.19] cgmgr: reuse dbus connection	2021-07-15 14:58:07 UTC
Github	cri-o cri-o pull 5058	None	closed	[1.20] cgmgr: reuse dbus connection	2021-07-15 14:58:08 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:37:08 UTC

Description Andy Bartlett 2021-06-24 14:14:31 UTC

Description of problem:

My customer is having an issue with OCP 4.6.27 (after an upgrade from 4.6.23) where we are seeing nodes not being able to be accessed due to large amounts of resources being taken by crio. We are not even able to run a sosreport from the node as this fails as well.


This is not being reflected in the UI which says everything is fine.

Outputs from oc get co and oc get clusterversion report everything normal

CRIO CPU usage between 100% - 250% (see top screenshot in attachment) 


The containers do not reflect this usage when running a sudo crictl stats:


[core@node~]$ sudo crictl stats
CONTAINER           CPU %               MEM                 DISK                INODES
169ef6b0ee5f0       0.02                24.41MB             6B                  1
17c5e49bd514b       0.01                12.2MB              49B                 3
23ec03d558d03       0.32                29.27MB             44B                 3
34823cfa00d91       0.00                18.22MB             61B                 4
385593ae449b7       0.20                10.08MB             235B                14
38911cb2e3e3e       0.41                56.43MB             2.233kB             28
3f69da7def7ea       0.00                19.6MB              143B                8
4470bdd44b2c4       0.00                18.9MB              84B                 5
606636bf31f18       0.01                15.75MB             49B                 3
63d5585bdba2f       4.82                692.3MB             500B                23
9136f0a5c1be9       0.00                21.11MB             23B                 2
921734f0451c8       0.00                17.76MB             61B                 4
982726b96aa9a       0.05                78.14MB             62B                 4
a0302de0e37c0       0.00                11.69MB             104B                6
b308fcc7613ba       0.00                19.66MB             44B                 3
d4632b6d78a6e       0.00                24.36MB             517B                19
d5ad732e6bd56       0.00                18MB                368B                4
d7e54c676c16c       0.00                20.69MB             23B                 2
d84b9b99f9996       0.00                14.82MB             24B                 2
e7a266e47773a       0.02                24.56MB             88B                 5
ed8b662fed698       0.00                22.12MB             517B                19
f41e903ad81ba       0.00                74.14MB             299B                17

CRIO CPU usage between 100% - 250% on the node

Version-Release number of selected component (if applicable):

OCP 4.6.27


How reproducible:

Not reproducible as it appears to be random currently.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2021-06-24 14:27:57 UTC

there's nothing immediately apparent here. cri-o didn't bump versions between 4.6.23 and 4.6.27

Unfortunately, I'll need the cri-o logs from the affected node to do any investigation.

Comment 3 Peter Hunt 2021-06-24 17:48:42 UTC

gah the crio log is pretty sparse. I'll need the full node journal to do investigation. I haven't seen behavior like this

Comment 5 Peter Hunt 2021-06-25 14:40:25 UTC

In addition, I think it'd be useful to get the cri-o goroutine stacks to know where cri-o is actively:
https://github.com/cri-o/cri-o/pull/5033 adds a file that describes how to do so

Comment 13 Peter Hunt 2021-06-30 13:48:46 UTC

A note for posterity:
when nodes are in this condition, systemctl often is hosed as well, causing the "connection reset" problem. If anyone runs into this, cri-o goroutine stacks can be grabbed by running

kill -USR1 $(pidof crio) 

which doesn't rely on systemd, and is more likely to succeed

Comment 14 Peter Hunt 2021-07-02 18:51:21 UTC

It seems initial feedback indicates the attached PR mitigates the issues on cri-o's end

Comment 19 Peter Hunt 2021-07-09 13:47:05 UTC

4.7 version of the fix

Comment 23 Peter Hunt 2021-09-09 20:13:46 UTC

*** Bug 1952798 has been marked as a duplicate of this bug. ***

Comment 25 errata-xmlrpc 2021-10-18 17:36:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.