Bug 1995785
| Summary: | long living clusters may fail to upgrade because of an invalid conmon path | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Peter Hunt <pehunt> | |
| Component: | Node | Assignee: | Peter Hunt <pehunt> | |
| Node sub component: | CRI-O | QA Contact: | Mike Fiedler <mifiedle> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | urgent | CC: | aos-bugs, kgarriso, kuiwang, mharri, mifiedle, minmli, pmuller, skumari, smilner, wking | |
| Version: | 4.9 | Keywords: | FastFix, Regression, UpgradeBlocker, Upgrades | |
| Target Milestone: | --- | |||
| Target Release: | 4.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | UpdateRecommendationsBlocked | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1995809 (view as bug list) | Environment: | ||
| Last Closed: | 2021-10-18 17:47:27 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1995809 | |||
|
Description
Peter Hunt
2021-08-19 18:32:09 UTC
We've tombstoned 4.7.25 and 4.8.6 on this in https://github.com/openshift/cincinnati-graph-data/pull/995 Working through a reproducer, I sent cluster-bot a 'launch 4.5.41'. Confirming the version after receiving the cluster:
$ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
4.5.41
Pulling in a ContainerRuntimeConfig from [1], because I hear that we need some kind of divergence from stock to trigger the bug:
$ cat highpids.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
name: set-pids-limit
spec:
machineConfigPoolSelector:
matchLabels:
custom-crio: high-pid-limit
containerRuntimeConfig:
pidsLimit: 2048
$ oc apply -f highpids.yaml
$ oc label -n openshift-machine-api machineconfigpool worker custom-crio=high-pid-limit
$ oc get -n openshift-machine-api machineconfigpool worker -w
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT
UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-a0892f651d1c00e2f9456596af125622 True False False 3 3 3
0 23m
...
worker rendered-worker-a0892f651d1c00e2f9456596af125622 False True False 3 1 1 0 26m
That's far enough. We only need one, and more that pick up the new config before the MCO gets bumped during the update just helps make the problem more obvious later. Trigger the update to 4.6, setting the channel, because we clear channel in CI [2], and cluster-bot is using that CI config.
$ oc adm upgrade channel stable-4.6 # requires a 4.9+ 'oc' binary
warning: No channels known to be compatible with the current version "4.5.41"; unable to validate "stable-4.6".
$ oc adm upgrade --to 4.6.42
All three compute ended up catching up before the CVO started updating the MCO to 4.6:
$ oc get -n openshift-machine-api machineconfigpool worker
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-12a38dd697f5c238e747cdcdecdd98cc True False False 3 3 3 0 32m
$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.6.42: 15% complete
...
Update eventually completes:
$ oc adm upgrade
Cluster version is 4.6.42
And off to the vulnerable 4.7.25, to try and reproduce the "Validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory":
$ oc adm upgrade channel candidate-4.7 # requires a 4.9+ 'oc' binary
$ oc adm upgrade --to 4.7.25
And then a while later:
$ oc adm upgrade
Cluster version is 4.7.25
$ oc get -n openshift-machine-api machineconfigpools
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-f4a193c15abb46959e590b6254c7bb22 True False False 3 3 3 0 174m
worker rendered-worker-bddea6e49777a4615148d9fd7412a2b7 True False False 3 3 3 0 174m
So failed to reproduce the original bug. I'll try again starting with 4.4.33...
[1]: https://github.com/openshift/machine-config-operator/blob/release-4.5/docs/ContainerRuntimeConfigDesign.md#example
[2]: https://github.com/openshift/release/pull/8631
Followed upgrade path 4.5.41 -> 4.6.42 -> 4.7.25. Applied container runtime confing on 4.5.41 before starting upgrade.
Failed to reproduce the bug. Currently one upgrade from 4.4.33 is in progress.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.25 True False 6m26s Cluster version is 4.7.25
$ oc describe clusterversion
Name: version
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterVersion
Metadata:
Creation Timestamp: 2021-08-20T08:17:58Z
Generation: 5
Managed Fields:
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:clusterID:
f:upstream:
Manager: cluster-bootstrap
Operation: Update
Time: 2021-08-20T08:17:58Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:channel:
f:desiredUpdate:
.:
f:force:
f:image:
f:version:
Manager: oc
Operation: Update
Time: 2021-08-20T11:02:13Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:availableUpdates:
f:conditions:
f:desired:
.:
f:channels:
f:image:
f:url:
f:version:
f:history:
f:observedGeneration:
f:versionHash:
Manager: cluster-version-operator
Operation: Update
Time: 2021-08-20T11:56:18Z
Resource Version: 132274
Self Link: /apis/config.openshift.io/v1/clusterversions/version
UID: 5cf3aab5-a992-4524-9bc2-b0ee6d32711c
Spec:
Channel: candidate-4.7
Cluster ID: f532fd70-41ef-4be7-8847-56f591c189b7
Desired Update:
Force: false
Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
Version: 4.7.25
Upstream: https://api.openshift.com/api/upgrades_info/v1/graph
Status:
Available Updates: <nil>
Conditions:
Last Transition Time: 2021-08-20T08:54:35Z
Message: Done applying 4.7.25
Status: True
Type: Available
Last Transition Time: 2021-08-20T12:09:22Z
Status: False
Type: Failing
Last Transition Time: 2021-08-20T12:09:52Z
Message: Cluster version is 4.7.25
Status: False
Type: Progressing
Last Transition Time: 2021-08-20T08:18:12Z
Status: True
Type: RetrievedUpdates
Desired:
Channels:
candidate-4.7
candidate-4.8
Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
URL: https://access.redhat.com/errata/RHBA-2021:3188
Version: 4.7.25
History:
Completion Time: 2021-08-20T12:09:52Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
Started Time: 2021-08-20T11:02:28Z
State: Completed
Verified: true
Version: 4.7.25
Completion Time: 2021-08-20T10:57:28Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b
Started Time: 2021-08-20T09:52:35Z
State: Completed
Verified: true
Version: 4.6.42
Completion Time: 2021-08-20T08:54:35Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297
Started Time: 2021-08-20T08:18:12Z
State: Completed
Verified: false
Version: 4.5.41
Observed Generation: 5
Version Hash: N_wDQ8h9xO8=
Events: <none>
$ oc describe containerruntimeconfig set-pids-limit
Name: set-pids-limit
Namespace:
Labels: <none>
Annotations: <none>
API Version: machineconfiguration.openshift.io/v1
Kind: ContainerRuntimeConfig
Metadata:
Creation Timestamp: 2021-08-20T09:35:34Z
Finalizers:
99-worker-12fbe9f1-357e-47c4-bf8d-f33e9272bc46-containerruntime
99-worker-generated-containerruntime
Generation: 1
Managed Fields:
API Version: machineconfiguration.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:containerRuntimeConfig:
.:
f:pidsLimit:
f:machineConfigPoolSelector:
.:
f:matchLabels:
.:
f:custom-crio:
Manager: kubectl-create
Operation: Update
Time: 2021-08-20T09:35:34Z
API Version: machineconfiguration.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"99-worker-12fbe9f1-357e-47c4-bf8d-f33e9272bc46-containerruntime":
v:"99-worker-generated-containerruntime":
f:spec:
f:containerRuntimeConfig:
f:logSizeMax:
f:overlaySize:
f:status:
.:
f:conditions:
f:observedGeneration:
Manager: machine-config-controller
Operation: Update
Time: 2021-08-20T10:37:27Z
Resource Version: 118733
Self Link: /apis/machineconfiguration.openshift.io/v1/containerruntimeconfigs/set-pids-limit
UID: 0691d0c1-7a9b-4c29-88ea-341aaf900ea0
Spec:
Container Runtime Config:
Pids Limit: 2048
Machine Config Pool Selector:
Match Labels:
Custom - Crio: high-pid-limit
Status:
Conditions:
Last Transition Time: 2021-08-20T09:35:39Z
Message: Error: could not find any MachineConfigPool set for ContainerRuntimeConfig set-pids-limit
Status: False
Type: Failure
Last Transition Time: 2021-08-20T11:49:56Z
Message: Success
Status: True
Type: Success
Observed Generation: 1
Events: <none>
Followed upgrade path 4.4.33 -> 4.5.41 -> 4.6.42 -> 4.7.25. Applied container runtime confing on 4.4.33 before starting upgrade.
Could not trigger bug.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.25 True False 19m Cluster version is 4.7.25
$ oc describe clusterversion
Name: version
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterVersion
Metadata:
Creation Timestamp: 2021-08-20T10:18:19Z
Generation: 8
Resource Version: 158626
Self Link: /apis/config.openshift.io/v1/clusterversions/version
UID: f67762f4-e704-4cf8-aa98-efe822557da5
Spec:
Channel: candidate-4.7
Cluster ID: 3d9b11f4-1742-47df-b491-47e96446e8dc
Desired Update:
Force: false
Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
Version: 4.7.25
Upstream: https://api.openshift.com/api/upgrades_info/v1/graph
Status:
Available Updates: <nil>
Conditions:
Last Transition Time: 2021-08-20T10:40:41Z
Message: Done applying 4.7.25
Status: True
Type: Available
Last Transition Time: 2021-08-20T13:27:23Z
Status: False
Type: Failing
Last Transition Time: 2021-08-20T15:00:54Z
Message: Cluster version is 4.7.25
Status: False
Type: Progressing
Last Transition Time: 2021-08-20T10:18:25Z
Status: True
Type: RetrievedUpdates
Desired:
Channels:
candidate-4.7
candidate-4.8
Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
URL: https://access.redhat.com/errata/RHBA-2021:3188
Version: 4.7.25
History:
Completion Time: 2021-08-20T15:00:54Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
Started Time: 2021-08-20T13:53:08Z
State: Completed
Verified: true
Version: 4.7.25
Completion Time: 2021-08-20T13:27:53Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b
Started Time: 2021-08-20T12:09:06Z
State: Completed
Verified: true
Version: 4.6.42
Completion Time: 2021-08-20T12:05:21Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297
Started Time: 2021-08-20T11:11:11Z
State: Completed
Verified: true
Version: 4.5.41
Completion Time: 2021-08-20T10:40:41Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:a035dddd8a5e5c99484138951ef4aba021799b77eb9046f683a5466c23717738
Started Time: 2021-08-20T10:18:25Z
State: Completed
Verified: false
Version: 4.4.33
Observed Generation: 8
Version Hash: N_wDQ8h9xO8=
Events: <none>
$ oc get containerruntimeconfig
NAME AGE
set-pids-limit 4h22m
We were able to recover our cluster by doing the following (needs SSH access): 1. The cluster gets stuck mid-upgrade, with one master node NotReady 2. On the two ready master nodes, create the following file: # cat /etc/crio/crio.conf.d/02-conmon [crio.runtime] conmon = "" Note that with ready masters, you can use `oc debug` - in that case the path will be /host/etc/crio/crio.conf.d/02-conmon 3. On the NotReady master, create the same file. On this master, `oc debug` will not work, you'll need to use SSH (without SSH configured, we were able to connect to the EC2 instance through serial terminal, boot it into single user mode and added a ssh key) 4. On the NotReady master, restart cri-o (first) and then kubelet services This revived the master node, and then the upgrade process proceeded normally. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? * Users who have ever manually changed their /etc/crio/crio.conf and attempt to upgrade to the affected versions (4.7.24 or 4.8.5) * Potentially, users who have applied a ContainerRuntimeConfig before Openshift 4.4, and who have kept upgrading their clusters all the way to the affected versions. What is the impact? Is it serious enough to warrant blocking edges? * Nodes that upgrade go NotReady and require manual intervention to fix. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? * Admin must SSH to the node and apply a drop-in cri-o config file. Since cri-o does not start, `oc debug node/` is not sufficient. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? * Yes, this is a regression Verified on 4.9.0-0.nightly-2021-08-22-070405
1. Install 4.8.5
2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file
3. Create a containerruntime config with the following contents
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
name: set-pids-limit
spec:
machineConfigPoolSelector:
matchLabels:
custom-crio: high-pid-limit
containerRuntimeConfig:
pidsLimit: 2048
4. oc label machineconfigpool worker custom-crio=high-pid-limit
5. oc get mcp worker -w and watch for all workers to be ready
6. oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-08-22-070405
- verify upgrade successful
- oc debug to the node where crio.conf was modified and verify customizations are still in place
- crio config | grep conmon and verify value is "" and not /usr/libexec/crio/conmon
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |