1794495 – Applying "ctrcfg" causes cri-o to fail to start on node reboot

Bug 1794495 - Applying "ctrcfg" causes cri-o to fail to start on node reboot

Summary: Applying "ctrcfg" causes cri-o to fail to start on node reboot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Urvashi Mohnani
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1794493
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-23 17:38 UTC by Urvashi Mohnani
Modified:	2020-03-10 23:53 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1794493
Environment:
Last Closed:	2020-03-10 23:53:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1447	0	None	closed	Bug 1794495: [release-4.3] fix ctrcfg and add e2e test	2020-08-06 16:25:58 UTC
Red Hat Product Errata	RHBA-2020:0676	0	None	None	None	2020-03-10 23:53:44 UTC

Description Urvashi Mohnani 2020-01-23 17:38:07 UTC

+++ This bug was initially created as a clone of Bug #1794493 +++

Description of problem:

When applying a ctrcfg to an mcp, the node goes into "NotReady" state. This happens because the crio.conf generated after the crd application populates the config fields to it's empty value (0 for int, "" for string etc).


Version-Release number of selected component (if applicable): 
OCP 4.4
CRI-O 1.17 and 1.16


How reproducible: 100% of the time


Steps to Reproduce:

1. Create a "Ctrcfg"
2. Wait for it to roll out onto the nodes

Actual results:
Node goes into "NotReady" state

Expected results:
Roll out should be successful and node should be in "Ready" state.


Additional info: 
This change was introduced by https://github.com/openshift/machine-config-operator/commit/69025e8e8c82ed6d188eb0e409e8148da09ac3b2, we are working on reverting this and adding e2e tests for it.

Comment 1 Peter Ruan 2020-01-23 18:59:40 UTC

Refrences on how to create a 'Ctrcfg'
 https://github.com/openshift/machine-config-operator/blob/master/docs/ContainerRuntimeConfigDesign.md
 https://github.com/openshift/machine-config-operator/blob/master/examples/containerruntimeconfig.crd.yaml

Comment 2 Urvashi Mohnani 2020-02-18 15:11:54 UTC

Fixes in PR https://github.com/openshift/machine-config-operator/pull/1447

Comment 5 Weinan Liu 2020-02-25 12:58:03 UTC

verified with 4.3.0-0.nightly-2020-02-24-071304

$ oc version
Client Version: openshift-clients-4.3.0-201910250623-79-g5d15fd52
Server Version: 4.3.0-0.nightly-2020-02-24-071304
Kubernetes Version: v1.16.2

1. oc edit machineconfigpool worker 
      labels:                                                                       
        custom-crio: high-pid-limit       <-- add this line                                          
        machineconfiguration.openshift.io/mco-built-in: ""        
2. create a config yaml

apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  name: set-log-and-pid
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-crio: high-pid-limit      ### <-- this must match the label created in step #1
  containerRuntimeConfig:
    pidsLimit: 2048
    logLevel: debug

3. oc create -f config.yaml
4. wait until all worker nodes come up
 $ oc get no
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-133-169.us-east-2.compute.internal   Ready    worker   102m   v1.16.2
ip-10-0-136-180.us-east-2.compute.internal   Ready    master   111m   v1.16.2
ip-10-0-152-112.us-east-2.compute.internal   Ready    master   111m   v1.16.2
ip-10-0-156-233.us-east-2.compute.internal   Ready    worker   102m   v1.16.2
ip-10-0-172-75.us-east-2.compute.internal    Ready    master   111m   v1.16.2


5. verify the limits defined in the config.yaml is applied to the nodes.   
$ oc debug node/ip-10-0-156-233.us-east-2.compute.internal
Starting pod/ip-10-0-156-233us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`


chroot /host
Pod IP: 10.0.156.233
If you don't see a command prompt, try pressing enter.

sh-4.2#
sh-4.2# chroot /host
sh-4.4# cat /etc/crio/crio.conf | grep limit
    pids_limit = 2048

Comment 7 Urvashi Mohnani 2020-03-09 20:56:40 UTC

Recovery steps in case this issue is hit:

1) Delete the ctrcfg
2) Manually replace the `/etc/crio/crio.conf` on the node with a copy from a working node
3) Restart crio --> `systemctl restart crio`
4) Reboot node
5) Upgrade the cluster to pick up the newer version with the fix

Comment 8 errata-xmlrpc 2020-03-10 23:53:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0676

Note You need to log in before you can comment on or make changes to this bug.