Description of problem: Another step of the fallout of https://bugzilla.redhat.com/show_bug.cgi?id=1993385 includes an interesting interaction between rpm-ostree and older versions of MCO. If a cluster was ever at a version where the MCO configured /etc/crio/crio.conf (4.5 or earlier), then updates to the cri-o rpm won't update the crio.conf file (in ways like updating the conmon path). Since the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1993385 only updated MCO to *not* specify the conmon path (thinking it would leave it to the CRI-O default of "") in the drop in template, the pre-existing value in /etc/crio/crio.conf (unchanged from fixing the rpm) would prevail, causing cri-o to expect conmon to be at /usr/libexec/crio/conmon, which no longer exists. This causes nodes to not come up Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. upgrade a node from 4.5->affectected versions (going through each minor version) 2. notice cri-o does not come up in similar ways to https://bugzilla.redhat.com/show_bug.cgi?id=1993385 Actual results: the node does not come up Expected results: the node starts Additional info:
We've tombstoned 4.7.25 and 4.8.6 on this in https://github.com/openshift/cincinnati-graph-data/pull/995
Working through a reproducer, I sent cluster-bot a 'launch 4.5.41'. Confirming the version after receiving the cluster: $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version 4.5.41 Pulling in a ContainerRuntimeConfig from [1], because I hear that we need some kind of divergence from stock to trigger the bug: $ cat highpids.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: machineConfigPoolSelector: matchLabels: custom-crio: high-pid-limit containerRuntimeConfig: pidsLimit: 2048 $ oc apply -f highpids.yaml $ oc label -n openshift-machine-api machineconfigpool worker custom-crio=high-pid-limit $ oc get -n openshift-machine-api machineconfigpool worker -w NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-a0892f651d1c00e2f9456596af125622 True False False 3 3 3 0 23m ... worker rendered-worker-a0892f651d1c00e2f9456596af125622 False True False 3 1 1 0 26m That's far enough. We only need one, and more that pick up the new config before the MCO gets bumped during the update just helps make the problem more obvious later. Trigger the update to 4.6, setting the channel, because we clear channel in CI [2], and cluster-bot is using that CI config. $ oc adm upgrade channel stable-4.6 # requires a 4.9+ 'oc' binary warning: No channels known to be compatible with the current version "4.5.41"; unable to validate "stable-4.6". $ oc adm upgrade --to 4.6.42 All three compute ended up catching up before the CVO started updating the MCO to 4.6: $ oc get -n openshift-machine-api machineconfigpool worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-12a38dd697f5c238e747cdcdecdd98cc True False False 3 3 3 0 32m $ oc adm upgrade info: An upgrade is in progress. Working towards 4.6.42: 15% complete ... Update eventually completes: $ oc adm upgrade Cluster version is 4.6.42 And off to the vulnerable 4.7.25, to try and reproduce the "Validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory": $ oc adm upgrade channel candidate-4.7 # requires a 4.9+ 'oc' binary $ oc adm upgrade --to 4.7.25 And then a while later: $ oc adm upgrade Cluster version is 4.7.25 $ oc get -n openshift-machine-api machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f4a193c15abb46959e590b6254c7bb22 True False False 3 3 3 0 174m worker rendered-worker-bddea6e49777a4615148d9fd7412a2b7 True False False 3 3 3 0 174m So failed to reproduce the original bug. I'll try again starting with 4.4.33... [1]: https://github.com/openshift/machine-config-operator/blob/release-4.5/docs/ContainerRuntimeConfigDesign.md#example [2]: https://github.com/openshift/release/pull/8631
Followed upgrade path 4.5.41 -> 4.6.42 -> 4.7.25. Applied container runtime confing on 4.5.41 before starting upgrade. Failed to reproduce the bug. Currently one upgrade from 4.4.33 is in progress. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.25 True False 6m26s Cluster version is 4.7.25 $ oc describe clusterversion Name: version Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterVersion Metadata: Creation Timestamp: 2021-08-20T08:17:58Z Generation: 5 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:clusterID: f:upstream: Manager: cluster-bootstrap Operation: Update Time: 2021-08-20T08:17:58Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: f:channel: f:desiredUpdate: .: f:force: f:image: f:version: Manager: oc Operation: Update Time: 2021-08-20T11:02:13Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:availableUpdates: f:conditions: f:desired: .: f:channels: f:image: f:url: f:version: f:history: f:observedGeneration: f:versionHash: Manager: cluster-version-operator Operation: Update Time: 2021-08-20T11:56:18Z Resource Version: 132274 Self Link: /apis/config.openshift.io/v1/clusterversions/version UID: 5cf3aab5-a992-4524-9bc2-b0ee6d32711c Spec: Channel: candidate-4.7 Cluster ID: f532fd70-41ef-4be7-8847-56f591c189b7 Desired Update: Force: false Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3 Version: 4.7.25 Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Status: Available Updates: <nil> Conditions: Last Transition Time: 2021-08-20T08:54:35Z Message: Done applying 4.7.25 Status: True Type: Available Last Transition Time: 2021-08-20T12:09:22Z Status: False Type: Failing Last Transition Time: 2021-08-20T12:09:52Z Message: Cluster version is 4.7.25 Status: False Type: Progressing Last Transition Time: 2021-08-20T08:18:12Z Status: True Type: RetrievedUpdates Desired: Channels: candidate-4.7 candidate-4.8 Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3 URL: https://access.redhat.com/errata/RHBA-2021:3188 Version: 4.7.25 History: Completion Time: 2021-08-20T12:09:52Z Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3 Started Time: 2021-08-20T11:02:28Z State: Completed Verified: true Version: 4.7.25 Completion Time: 2021-08-20T10:57:28Z Image: quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b Started Time: 2021-08-20T09:52:35Z State: Completed Verified: true Version: 4.6.42 Completion Time: 2021-08-20T08:54:35Z Image: quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297 Started Time: 2021-08-20T08:18:12Z State: Completed Verified: false Version: 4.5.41 Observed Generation: 5 Version Hash: N_wDQ8h9xO8= Events: <none> $ oc describe containerruntimeconfig set-pids-limit Name: set-pids-limit Namespace: Labels: <none> Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: ContainerRuntimeConfig Metadata: Creation Timestamp: 2021-08-20T09:35:34Z Finalizers: 99-worker-12fbe9f1-357e-47c4-bf8d-f33e9272bc46-containerruntime 99-worker-generated-containerruntime Generation: 1 Managed Fields: API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:containerRuntimeConfig: .: f:pidsLimit: f:machineConfigPoolSelector: .: f:matchLabels: .: f:custom-crio: Manager: kubectl-create Operation: Update Time: 2021-08-20T09:35:34Z API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"99-worker-12fbe9f1-357e-47c4-bf8d-f33e9272bc46-containerruntime": v:"99-worker-generated-containerruntime": f:spec: f:containerRuntimeConfig: f:logSizeMax: f:overlaySize: f:status: .: f:conditions: f:observedGeneration: Manager: machine-config-controller Operation: Update Time: 2021-08-20T10:37:27Z Resource Version: 118733 Self Link: /apis/machineconfiguration.openshift.io/v1/containerruntimeconfigs/set-pids-limit UID: 0691d0c1-7a9b-4c29-88ea-341aaf900ea0 Spec: Container Runtime Config: Pids Limit: 2048 Machine Config Pool Selector: Match Labels: Custom - Crio: high-pid-limit Status: Conditions: Last Transition Time: 2021-08-20T09:35:39Z Message: Error: could not find any MachineConfigPool set for ContainerRuntimeConfig set-pids-limit Status: False Type: Failure Last Transition Time: 2021-08-20T11:49:56Z Message: Success Status: True Type: Success Observed Generation: 1 Events: <none>
Followed upgrade path 4.4.33 -> 4.5.41 -> 4.6.42 -> 4.7.25. Applied container runtime confing on 4.4.33 before starting upgrade. Could not trigger bug. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.25 True False 19m Cluster version is 4.7.25 $ oc describe clusterversion Name: version Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterVersion Metadata: Creation Timestamp: 2021-08-20T10:18:19Z Generation: 8 Resource Version: 158626 Self Link: /apis/config.openshift.io/v1/clusterversions/version UID: f67762f4-e704-4cf8-aa98-efe822557da5 Spec: Channel: candidate-4.7 Cluster ID: 3d9b11f4-1742-47df-b491-47e96446e8dc Desired Update: Force: false Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3 Version: 4.7.25 Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Status: Available Updates: <nil> Conditions: Last Transition Time: 2021-08-20T10:40:41Z Message: Done applying 4.7.25 Status: True Type: Available Last Transition Time: 2021-08-20T13:27:23Z Status: False Type: Failing Last Transition Time: 2021-08-20T15:00:54Z Message: Cluster version is 4.7.25 Status: False Type: Progressing Last Transition Time: 2021-08-20T10:18:25Z Status: True Type: RetrievedUpdates Desired: Channels: candidate-4.7 candidate-4.8 Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3 URL: https://access.redhat.com/errata/RHBA-2021:3188 Version: 4.7.25 History: Completion Time: 2021-08-20T15:00:54Z Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3 Started Time: 2021-08-20T13:53:08Z State: Completed Verified: true Version: 4.7.25 Completion Time: 2021-08-20T13:27:53Z Image: quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b Started Time: 2021-08-20T12:09:06Z State: Completed Verified: true Version: 4.6.42 Completion Time: 2021-08-20T12:05:21Z Image: quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297 Started Time: 2021-08-20T11:11:11Z State: Completed Verified: true Version: 4.5.41 Completion Time: 2021-08-20T10:40:41Z Image: quay.io/openshift-release-dev/ocp-release@sha256:a035dddd8a5e5c99484138951ef4aba021799b77eb9046f683a5466c23717738 Started Time: 2021-08-20T10:18:25Z State: Completed Verified: false Version: 4.4.33 Observed Generation: 8 Version Hash: N_wDQ8h9xO8= Events: <none> $ oc get containerruntimeconfig NAME AGE set-pids-limit 4h22m
We were able to recover our cluster by doing the following (needs SSH access): 1. The cluster gets stuck mid-upgrade, with one master node NotReady 2. On the two ready master nodes, create the following file: # cat /etc/crio/crio.conf.d/02-conmon [crio.runtime] conmon = "" Note that with ready masters, you can use `oc debug` - in that case the path will be /host/etc/crio/crio.conf.d/02-conmon 3. On the NotReady master, create the same file. On this master, `oc debug` will not work, you'll need to use SSH (without SSH configured, we were able to connect to the EC2 instance through serial terminal, boot it into single user mode and added a ssh key) 4. On the NotReady master, restart cri-o (first) and then kubelet services This revived the master node, and then the upgrade process proceeded normally.
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? * Users who have ever manually changed their /etc/crio/crio.conf and attempt to upgrade to the affected versions (4.7.24 or 4.8.5) * Potentially, users who have applied a ContainerRuntimeConfig before Openshift 4.4, and who have kept upgrading their clusters all the way to the affected versions. What is the impact? Is it serious enough to warrant blocking edges? * Nodes that upgrade go NotReady and require manual intervention to fix. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? * Admin must SSH to the node and apply a drop-in cri-o config file. Since cri-o does not start, `oc debug node/` is not sufficient. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? * Yes, this is a regression
Verified on 4.9.0-0.nightly-2021-08-22-070405 1. Install 4.8.5 2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file 3. Create a containerruntime config with the following contents apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: machineConfigPoolSelector: matchLabels: custom-crio: high-pid-limit containerRuntimeConfig: pidsLimit: 2048 4. oc label machineconfigpool worker custom-crio=high-pid-limit 5. oc get mcp worker -w and watch for all workers to be ready 6. oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-08-22-070405 - verify upgrade successful - oc debug to the node where crio.conf was modified and verify customizations are still in place - crio config | grep conmon and verify value is "" and not /usr/libexec/crio/conmon
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759