Description of problem:
Recently the cluster was upgraded to OCP 4.4.12.
After the upgrade, the below message was seen repeatedly in the logs.
Aug 04 01:51:54 dx01-worker-xyz-zrty: E0804 01:51:54.630430 1461 log_metrics.go:66] failed to get pod stats: failed to get imageFs info: non-existent label "crio-images"
Also the output of the command "$ oc describe node" shows -- "failed to get imageFs info: non-existent label "crio-images"".
Version-Release number of selected component (if applicable):
Steps to Reproduce:
We had Bugzilla opened for a similar issue, however, that Bugzilla was opened for OCP 4.1. This Bugzilla was closed as OCP 4.1 reached End of Life cycle (EOL).
what's `systemctl status crio`? We have seen this problem before when kubelet comes up before crio does.
*** Bug 1866045 has been marked as a duplicate of this bug. ***
I have yet to have a chance to look at this. I am going to work my team in the coming sprint to get to the bottom of it.
I got it!
When the MCO applies a ContainerRuntimeConfig, it takes the ignition template and populates it with some defaults and the overridden values (in 4.4 and 4.3, this behavior has been changed in 4.5 slightly).
CRI-O's default containers/storage options (root, runroot, storage_driver, storage_option) are all commented out by default. This is because we usually want to inherit options from `/etc/containers/storage.conf`
However, due to limitations in ignition, the newly created crio config prints all options, even ones that are empty (it doesn't know if it's supposed to be empty or not). This causes those values to all be empty.
CRI-O then serves that information directly on its `/info` endpoint, which cadvisor uses to populate its information about where the crio images are.
Thus, if we apply a ctrcfg, crio is lying to cadvisor about where the images are, and cadvisor gets confused and spits out that error.
The solution is to properly inherit the defaults from containers/storage that come from the storage.json. The master version of that PR is attached. Once it's approved, I'll back port all the way back to 4.3
I've verified that this fixes all cases up through 4.4. I am not sure why a customer is facing it in 4.5.5, I wasn't able to reproduce, but this may also fix it there.
technically, this is already fixed in 4.6/4.7, so I'm marking it as modified. I'll clone back to 4.4 where the issue actually occurs
verified in version : 4.6.0-0.nightly-2020-09-12-230035
create a ContainerRuntimeConfig changing the pod PID limit, and find no error messages in the event log, kubelet log and crio log.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.