# Description of problem: When creating a MachineConfig to update systemd dropins, after applying to the first Node, the Node gets in error "unexpected on-disk state validating" and cannot recovery. # Version-Release number of selected component (if applicable): OCP 4.2.0 # How reproducible: This is reproducible 100% of the time. # Steps to Reproduce: 1. Create and apply a MachineConfig to update a systemd dropin like this: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: null labels: machineconfiguration.openshift.io/role: worker name: 10-crio-default-env spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 2.2.0 networkd: {} passwd: {} systemd: { "units": [ { "name": "crio.service", "dropins": [{ "name": "10-default-env.conf", "contents": '#foo' }] } ] } 2. The Node get into "unexpected on-disk state validating" show the Node message: apiVersion: v1 kind: Node metadata: annotations: machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90 machineconfiguration.openshift.io/reason: unexpected on-disk state validating against rendered-worker-8d3c5f427a66492fed77a3c491514b90 machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: "true" Validating the MC applied the configuration: [root@worker-1 /]# cat /etc/systemd/system/crio.service.d/10-default-env.conf #foo # The worker Node status is show as: worker-1.ocp4poc.lab.shift.zone Ready,SchedulingDisabled worker 4d7h v1.14.6+c07e432da 3. Removing the MachineConfig does not return the Node to the original. # Changing the desiredConfig to the previous one and rebooting the Nodes does not reset the state machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f # After rebooting it goes back to the degraded state and machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90 Expected results: The Node should return or reset to a known good state after the MachineConfig is removed
The MCO is validating against the original crio unit and is failing for that - this is tricky because everytime you apply a config that somehow overrides what the MCO ships you get into this situation. I'll check why removing it doesn't reconcile but I think the MCO is doing correct to deny to change a system file that it manages.
Antonio, this happens with any of the /etc/systemd/system/*/10-default-env.conf files. At least we need a way to reset the system to its original state or to reset as if it were a fresh install.
(In reply to William Caban from comment #2) > Antonio, this happens with any of the > /etc/systemd/system/*/10-default-env.conf files. At least we need a way to > reset the system to its original state or to reset as if it were a fresh > install. yes, somehow, we're discussing it here https://github.com/openshift/machine-config-operator/pull/1203#issuecomment-544886806
This has to be fixed in master still and the 4.4 BZ is already here https://bugzilla.redhat.com/show_bug.cgi?id=1764116, thus moving this to 4.5
*** Bug 1764116 has been marked as a duplicate of this bug. ***
There's already a PR for this, on review and waiting to address review comments. Might be able to approve it by end of sprint but unlikely to get merged (as it'll need to go further review).
Verified with 4.5.0-rc1: ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-rc.1 True False 85s Cluster version is 4.5.0-rc.1 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-188.us-west-1.compute.internal Ready master 21m v1.18.3+a637491 ip-10-0-147-1.us-west-1.compute.internal Ready master 23m v1.18.3+a637491 ip-10-0-152-169.us-west-1.compute.internal Ready worker 11m v1.18.3+a637491 ip-10-0-154-177.us-west-1.compute.internal Ready worker 11m v1.18.3+a637491 ip-10-0-225-24.us-west-1.compute.internal Ready master 22m v1.18.3+a637491 ip-10-0-239-116.us-west-1.compute.internal Ready worker 13m v1.18.3+a637491 ``` Create an MC from comment #0 and apply ``` $ cat bz1764001.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: null labels: machineconfiguration.openshift.io/role: worker name: 10-crio-default-env spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 2.2.0 networkd: {} passwd: {} systemd: { "units": [ { "name": "crio.service", "dropins": [{ "name": "10-default-env.conf", "contents": '#foo' }] } ] } $ oc create -f bz1764001.yaml machineconfig.machineconfiguration.openshift.io/10-crio-default-env created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 00-worker e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-master-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-master-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-worker-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 01-worker-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 10-crio-default-env 2.2.0 12s 99-master-e8a44833-ab79-4d75-bf61-fed3a38f434e-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 99-master-ssh 2.2.0 28m 99-worker-f15e132c-35c2-470a-a629-f13da6435e3d-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m 99-worker-ssh 2.2.0 28m rendered-master-759d4538a6a9d8170c625a96ab4805c9 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m rendered-worker-35b05ccd8394d4341f7692d1179de05b e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 20m rendered-worker-818b4712aa4adf225ee4417e6397fd69 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 7s $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 24m worker rendered-worker-35b05ccd8394d4341f7692d1179de05b False True False 3 1 1 0 24m ``` Check one of the nodes annotations and file state: ``` $ oc describe node/ip-10-0-239-116.us-west-1.compute.internal ... Annotations: machine.openshift.io/machine: openshift-machine-api/ci-ln-tq3mggt-d5d6b-28d4c-worker-us-west-1b-b6m2k machineconfiguration.openshift.io/currentConfig: rendered-worker-818b4712aa4adf225ee4417e6397fd69 machineconfiguration.openshift.io/desiredConfig: rendered-worker-818b4712aa4adf225ee4417e6397fd69 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true $ oc debug node/ip-10-0-239-116.us-west-1.compute.internal Starting pod/ip-10-0-239-116us-west-1computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.239.116 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /etc/systemd/system/crio.service.d/10-default-env.conf #foosh-4.4# sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ``` Confirm MCP full updated: ``` $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 28m worker rendered-worker-818b4712aa4adf225ee4417e6397fd69 True False False 3 3 3 0 28m ``` Delete newly added MC and confirm rollout: ``` $ oc delete mc/10-crio-default-env machineconfig.machineconfiguration.openshift.io "10-crio-default-env" deleted $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 00-worker e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-master-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-master-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-worker-container-runtime e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 01-worker-kubelet e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 99-master-e8a44833-ab79-4d75-bf61-fed3a38f434e-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 99-master-ssh 2.2.0 38m 99-worker-f15e132c-35c2-470a-a629-f13da6435e3d-registries e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m 99-worker-ssh 2.2.0 38m rendered-master-759d4538a6a9d8170c625a96ab4805c9 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m rendered-worker-35b05ccd8394d4341f7692d1179de05b e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 30m rendered-worker-818b4712aa4adf225ee4417e6397fd69 e4d414f494d174826121feeaa0a5160b47099bd7 2.2.0 10m $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 32m worker rendered-worker-818b4712aa4adf225ee4417e6397fd69 False True False 3 0 0 0 32m ``` Inspect node to confirm file has been removed: ``` $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-759d4538a6a9d8170c625a96ab4805c9 True False False 3 3 3 0 44m worker rendered-worker-35b05ccd8394d4341f7692d1179de05b True False False 3 3 3 0 44m $ oc debug node/ip-10-0-239-116.us-west-1.compute.internal Starting pod/ip-10-0-239-116us-west-1computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.239.116 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /etc/systemd/system/crio.service.d/10-default-env.conf cat: /etc/systemd/system/crio.service.d/10-default-env.conf: No such file or directory sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409