Bug 1866702

Summary: Facing warning messages "failed to get imageFs info: non-existent label "crio-images" after upgrading to OCP 4.4.12
Product: OpenShift Container Platform Reporter: Asheth <asheth>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: alchan, aos-bugs, dshumake, harpatil, jhou, jokerman, mharri, oarribas, obulatov, pehunt, rdomnu, weihuang, wzheng
Version: 4.4   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1878264 (view as bug list) Environment:
Last Closed: 2020-10-27 16:25:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1878264, 1878265    

Description Asheth 2020-08-06 07:44:21 UTC
Description of problem:

Recently the cluster was upgraded to OCP 4.4.12. 

After the upgrade, the below message was seen repeatedly in the logs.

Aug 04 01:51:54 dx01-worker-xyz-zrty[1461]: E0804 01:51:54.630430    1461 log_metrics.go:66] failed to get pod stats: failed to get imageFs info: non-existent label "crio-images"

Also the output of the command "$ oc describe node" shows -- "failed to get imageFs info: non-existent label "crio-images"".


Version-Release number of selected component (if applicable):

OCP 4.4

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
We had Bugzilla[1] opened for a similar issue, however, that Bugzilla was opened for OCP 4.1. This Bugzilla[1] was closed as OCP 4.1 reached End of Life cycle (EOL).

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1741608

Comment 6 Peter Hunt 2020-08-06 16:29:54 UTC
what's `systemctl status crio`? We have seen this problem before when kubelet comes up before crio does.

Comment 12 Peter Hunt 2020-08-20 13:43:46 UTC
*** Bug 1866045 has been marked as a duplicate of this bug. ***

Comment 13 Peter Hunt 2020-08-20 20:32:27 UTC
I have yet to have a chance to look at this. I am going to work my team in the coming sprint to get to the bottom of it.

Comment 18 Peter Hunt 2020-08-28 18:39:41 UTC
I got it!

When the MCO applies a ContainerRuntimeConfig, it takes the ignition template and populates it with some defaults and the overridden values (in 4.4 and 4.3, this behavior has been changed in 4.5 slightly). 

CRI-O's default containers/storage options (root, runroot, storage_driver, storage_option) are all commented out by default. This is because we usually want to inherit options from `/etc/containers/storage.conf`

However, due to limitations in ignition, the newly created crio config prints all options, even ones that are empty (it doesn't know if it's supposed to be empty or not). This causes those values to all be empty.

CRI-O then serves that information directly on its `/info` endpoint, which cadvisor uses to populate its information about where the crio images are.

Thus, if we apply a ctrcfg, crio is lying to cadvisor about where the images are, and cadvisor gets confused and spits out that error.

The solution is to properly inherit the defaults from containers/storage that come from the storage.json. The master version of that PR is attached. Once it's approved, I'll back port all the way back to 4.3


I've verified that this fixes all cases up through 4.4. I am not sure why a customer is facing it in 4.5.5, I wasn't able to reproduce, but this may also fix it there.

Comment 20 Peter Hunt 2020-09-11 17:59:54 UTC
technically, this is already fixed in 4.6/4.7, so I'm marking it as modified. I'll clone back to 4.4 where the issue actually occurs

Comment 22 MinLi 2020-09-16 10:48:37 UTC
verified in version : 4.6.0-0.nightly-2020-09-12-230035

create a ContainerRuntimeConfig[1] changing the pod PID limit, and find no error messages in the event log, kubelet log and crio log.

[1]ContainerRuntimeConfig.yaml:
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: set-pids-limit
spec:
 machineConfigPoolSelector:
   matchLabels:
     custom-crio: high-pid-limit
 containerRuntimeConfig:
   pidsLimit: 4096

Comment 27 errata-xmlrpc 2020-10-27 16:25:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196