Bug 1764001
Summary: | Node does not recover after "unexpected on-disk state validating" error | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | William Caban <william.caban> | |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | |
Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.2.0 | CC: | amurdaca, kgarriso, miabbott, rob.fisher, skumari, smilner | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | x86_64 | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1764116 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:11:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1764116 | |||
Bug Blocks: |
Description
William Caban
2019-10-22 07:01:18 UTC
The MCO is validating against the original crio unit and is failing for that - this is tricky because everytime you apply a config that somehow overrides what the MCO ships you get into this situation. I'll check why removing it doesn't reconcile but I think the MCO is doing correct to deny to change a system file that it manages. Antonio, this happens with any of the /etc/systemd/system/*/10-default-env.conf files. At least we need a way to reset the system to its original state or to reset as if it were a fresh install. (In reply to William Caban from comment #2) > Antonio, this happens with any of the > /etc/systemd/system/*/10-default-env.conf files. At least we need a way to > reset the system to its original state or to reset as if it were a fresh > install. yes, somehow, we're discussing it here https://github.com/openshift/machine-config-operator/pull/1203#issuecomment-544886806 This has to be fixed in master still and the 4.4 BZ is already here https://bugzilla.redhat.com/show_bug.cgi?id=1764116, thus moving this to 4.5 *** Bug 1764116 has been marked as a duplicate of this bug. *** There's already a PR for this, on review and waiting to address review comments. Might be able to approve it by end of sprint but unlikely to get merged (as it'll need to go further review). Verified with 4.5.0-rc1: ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-rc.1 True False 85s Cluster version is 4.5.0-rc.1 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-188.us-west-1.compute.internal Ready master 21m v1.18.3+a637491 ip-10-0-147-1.us-west-1.compute.internal Ready master 23m v1.18.3+a637491 ip-10-0-152-169.us-west-1.compute.internal Ready worker 11m v1.18.3+a637491 ip-10-0-154-177.us-west-1.compute.internal Ready worker 11m v1.18.3+a637491 ip-10-0-225-24.us-west-1.compute.internal Ready master 22m v1.18.3+a637491 ip-10-0-239-116.us-west-1.compute.internal Ready worker 13m v1.18.3+a637491 ``` Create an MC from comment #0 and apply ``` $ cat bz1764001.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: null labels: machineconfiguration.openshift.io/role: worker name: 10-crio-default-env spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 2.2.0 networkd: {} passwd: {} systemd: { "units": [ { "name": "crio.service", "dropins": [{ "name": "10-default-env.conf", "contents": '#foo' }] } ] } $ oc create -f bz1764001.yaml machineconfig.machineconfiguration.openshift.io/10-crio-default-env created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 00-worker e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-master-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-master-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-worker-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-worker-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 10-crio-default-env 2.2.0 12s 99-master-e8a44833-ab79-4d75-bf61-fed3a38f434e-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 99-master-ssh 2.2.0 28m 99-worker-f15e132c-35c2-470a-a629-f13da6435e3d-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 99-worker-ssh 2.2.0 28m rendered-master-759d4538a6a9d8170c625a96ab4805c9 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m rendered-worker-35b05ccd8394d4341f7692d1179de05b e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m rendered-worker-818b4712aa4adf225ee4417e6397fd69 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 7s $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 24m worker rendered-worker-35b05ccd8394d4341f7692d1179de05b False True False 3 1 1 0 24m ``` Check one of the nodes annotations and file state: ``` $ oc describe node/ip-10-0-239-116.us-west-1.compute.internal ... Annotations: machine.openshift.io/machine: openshift-machine-api/ci-ln-tq3mggt-d5d6b-28d4c-worker-us-west-1b-b6m2k machineconfiguration.openshift.io/currentConfig: rendered-worker-818b4712aa4adf225ee4417e6397fd69 machineconfiguration.openshift.io/desiredConfig: rendered-worker-818b4712aa4adf225ee4417e6397fd69 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true $ oc debug node/ip-10-0-239-116.us-west-1.compute.internal Starting pod/ip-10-0-239-116us-west-1computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.239.116 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /etc/systemd/system/crio.service.d/10-default-env.conf #foosh-4.4# sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ``` Confirm MCP full updated: ``` $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 28m worker rendered-worker-818b4712aa4adf225ee4417e6397fd69 True False False 3 3 3 0 28m ``` Delete newly added MC and confirm rollout: ``` $ oc delete mc/10-crio-default-env machineconfig.machineconfiguration.openshift.io "10-crio-default-env" deleted $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 00-worker e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-master-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-master-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-worker-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-worker-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 99-master-e8a44833-ab79-4d75-bf61-fed3a38f434e-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 99-master-ssh 2.2.0 38m 99-worker-f15e132c-35c2-470a-a629-f13da6435e3d-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 99-worker-ssh 2.2.0 38m rendered-master-759d4538a6a9d8170c625a96ab4805c9 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m rendered-worker-35b05ccd8394d4341f7692d1179de05b e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m rendered-worker-818b4712aa4adf225ee4417e6397fd69 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 10m $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 32m worker rendered-worker-818b4712aa4adf225ee4417e6397fd69 False True False 3 0 0 0 32m ``` Inspect node to confirm file has been removed: ``` $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 44m worker rendered-worker-35b05ccd8394d4341f7692d1179de05b True False False 3 3 3 0 44m $ oc debug node/ip-10-0-239-116.us-west-1.compute.internal Starting pod/ip-10-0-239-116us-west-1computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.239.116 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /etc/systemd/system/crio.service.d/10-default-env.conf cat: /etc/systemd/system/crio.service.d/10-default-env.conf: No such file or directory sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ``` Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |