+++ This bug was initially created as a clone of Bug #1794493 +++ Description of problem: When applying a ctrcfg to an mcp, the node goes into "NotReady" state. This happens because the crio.conf generated after the crd application populates the config fields to it's empty value (0 for int, "" for string etc). Version-Release number of selected component (if applicable): OCP 4.4 CRI-O 1.17 and 1.16 How reproducible: 100% of the time Steps to Reproduce: 1. Create a "Ctrcfg" 2. Wait for it to roll out onto the nodes Actual results: Node goes into "NotReady" state Expected results: Roll out should be successful and node should be in "Ready" state. Additional info: This change was introduced by https://github.com/openshift/machine-config-operator/commit/69025e8e8c82ed6d188eb0e409e8148da09ac3b2, we are working on reverting this and adding e2e tests for it.
Refrences on how to create a 'Ctrcfg' https://github.com/openshift/machine-config-operator/blob/master/docs/ContainerRuntimeConfigDesign.md https://github.com/openshift/machine-config-operator/blob/master/examples/containerruntimeconfig.crd.yaml
Fixes in PR https://github.com/openshift/machine-config-operator/pull/1447
verified with 4.3.0-0.nightly-2020-02-24-071304 $ oc version Client Version: openshift-clients-4.3.0-201910250623-79-g5d15fd52 Server Version: 4.3.0-0.nightly-2020-02-24-071304 Kubernetes Version: v1.16.2 1. oc edit machineconfigpool worker labels: custom-crio: high-pid-limit <-- add this line machineconfiguration.openshift.io/mco-built-in: "" 2. create a config yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-log-and-pid spec: machineConfigPoolSelector: matchLabels: custom-crio: high-pid-limit ### <-- this must match the label created in step #1 containerRuntimeConfig: pidsLimit: 2048 logLevel: debug 3. oc create -f config.yaml 4. wait until all worker nodes come up $ oc get no NAME STATUS ROLES AGE VERSION ip-10-0-133-169.us-east-2.compute.internal Ready worker 102m v1.16.2 ip-10-0-136-180.us-east-2.compute.internal Ready master 111m v1.16.2 ip-10-0-152-112.us-east-2.compute.internal Ready master 111m v1.16.2 ip-10-0-156-233.us-east-2.compute.internal Ready worker 102m v1.16.2 ip-10-0-172-75.us-east-2.compute.internal Ready master 111m v1.16.2 5. verify the limits defined in the config.yaml is applied to the nodes. $ oc debug node/ip-10-0-156-233.us-east-2.compute.internal Starting pod/ip-10-0-156-233us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` chroot /host Pod IP: 10.0.156.233 If you don't see a command prompt, try pressing enter. sh-4.2# sh-4.2# chroot /host sh-4.4# cat /etc/crio/crio.conf | grep limit pids_limit = 2048
Recovery steps in case this issue is hit: 1) Delete the ctrcfg 2) Manually replace the `/etc/crio/crio.conf` on the node with a copy from a working node 3) Restart crio --> `systemctl restart crio` 4) Reboot node 5) Upgrade the cluster to pick up the newer version with the fix
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0676