1866702 – Facing warning messages "failed to get imageFs info: non-existent label "crio-images" after upgrading to OCP 4.4.12

Bug 1866702 - Facing warning messages "failed to get imageFs info: non-existent label "crio-images" after upgrading to OCP 4.4.12

Summary: Facing warning messages "failed to get imageFs info: non-existent label "crio...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1866045 (view as bug list)
Depends On:
Blocks:	1878264 1878265
TreeView+	depends on / blocked

Reported:	2020-08-06 07:44 UTC by Asheth
Modified:	2024-10-01 16:45 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1878264 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:25:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 4131	0	None	closed	config: set internal RootConfig to default storage if not specified	2021-01-27 09:57:24 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:25:45 UTC

Description Asheth 2020-08-06 07:44:21 UTC

Description of problem:

Recently the cluster was upgraded to OCP 4.4.12. 

After the upgrade, the below message was seen repeatedly in the logs.

Aug 04 01:51:54 dx01-worker-xyz-zrty[1461]: E0804 01:51:54.630430    1461 log_metrics.go:66] failed to get pod stats: failed to get imageFs info: non-existent label "crio-images"

Also the output of the command "$ oc describe node" shows -- "failed to get imageFs info: non-existent label "crio-images"".


Version-Release number of selected component (if applicable):

OCP 4.4

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
We had Bugzilla[1] opened for a similar issue, however, that Bugzilla was opened for OCP 4.1. This Bugzilla[1] was closed as OCP 4.1 reached End of Life cycle (EOL).

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1741608

Comment 6 Peter Hunt 2020-08-06 16:29:54 UTC

what's `systemctl status crio`? We have seen this problem before when kubelet comes up before crio does.

Comment 12 Peter Hunt 2020-08-20 13:43:46 UTC

*** Bug 1866045 has been marked as a duplicate of this bug. ***

Comment 13 Peter Hunt 2020-08-20 20:32:27 UTC

I have yet to have a chance to look at this. I am going to work my team in the coming sprint to get to the bottom of it.

Comment 18 Peter Hunt 2020-08-28 18:39:41 UTC

I got it!

When the MCO applies a ContainerRuntimeConfig, it takes the ignition template and populates it with some defaults and the overridden values (in 4.4 and 4.3, this behavior has been changed in 4.5 slightly). 

CRI-O's default containers/storage options (root, runroot, storage_driver, storage_option) are all commented out by default. This is because we usually want to inherit options from `/etc/containers/storage.conf`

However, due to limitations in ignition, the newly created crio config prints all options, even ones that are empty (it doesn't know if it's supposed to be empty or not). This causes those values to all be empty.

CRI-O then serves that information directly on its `/info` endpoint, which cadvisor uses to populate its information about where the crio images are.

Thus, if we apply a ctrcfg, crio is lying to cadvisor about where the images are, and cadvisor gets confused and spits out that error.

The solution is to properly inherit the defaults from containers/storage that come from the storage.json. The master version of that PR is attached. Once it's approved, I'll back port all the way back to 4.3


I've verified that this fixes all cases up through 4.4. I am not sure why a customer is facing it in 4.5.5, I wasn't able to reproduce, but this may also fix it there.

Comment 20 Peter Hunt 2020-09-11 17:59:54 UTC

technically, this is already fixed in 4.6/4.7, so I'm marking it as modified. I'll clone back to 4.4 where the issue actually occurs

Comment 22 MinLi 2020-09-16 10:48:37 UTC

verified in version : 4.6.0-0.nightly-2020-09-12-230035

create a ContainerRuntimeConfig[1] changing the pod PID limit, and find no error messages in the event log, kubelet log and crio log.

[1]ContainerRuntimeConfig.yaml:
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: set-pids-limit
spec:
 machineConfigPoolSelector:
   matchLabels:
     custom-crio: high-pid-limit
 containerRuntimeConfig:
   pidsLimit: 4096

Comment 27 errata-xmlrpc 2020-10-27 16:25:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.