+++ This bug was initially created as a clone of Bug #1764001 +++ # Description of problem: When creating a MachineConfig to update systemd dropins, after applying to the first Node, the Node gets in error "unexpected on-disk state validating" and cannot recovery. # Version-Release number of selected component (if applicable): OCP 4.2.0 # How reproducible: This is reproducible 100% of the time. # Steps to Reproduce: 1. Create and apply a MachineConfig to update a systemd dropin like this: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: null labels: machineconfiguration.openshift.io/role: worker name: 10-crio-default-env spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 2.2.0 networkd: {} passwd: {} systemd: { "units": [ { "name": "crio.service", "dropins": [{ "name": "10-default-env.conf", "contents": '#foo' }] } ] } 2. The Node get into "unexpected on-disk state validating" show the Node message: apiVersion: v1 kind: Node metadata: annotations: machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90 machineconfiguration.openshift.io/reason: unexpected on-disk state validating against rendered-worker-8d3c5f427a66492fed77a3c491514b90 machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: "true" Validating the MC applied the configuration: [root@worker-1 /]# cat /etc/systemd/system/crio.service.d/10-default-env.conf #foo # The worker Node status is show as: worker-1.ocp4poc.lab.shift.zone Ready,SchedulingDisabled worker 4d7h v1.14.6+c07e432da 3. Removing the MachineConfig does not return the Node to the original. # Changing the desiredConfig to the previous one and rebooting the Nodes does not reset the state machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f # After rebooting it goes back to the degraded state and machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90 Expected results: The Node should return or reset to a known good state after the MachineConfig is removed --- Additional comment from Antonio Murdaca on 2019-10-22 09:09:28 UTC --- The MCO is validating against the original crio unit and is failing for that - this is tricky because everytime you apply a config that somehow overrides what the MCO ships you get into this situation. I'll check why removing it doesn't reconcile but I think the MCO is doing correct to deny to change a system file that it manages.
We're gonna use https://bugzilla.redhat.com/show_bug.cgi?id=1764001 as the base for this work - I'm working on that (as it's 4.2, where the issue happened and has been introduced). Closing this and I'm going to clone again from that one as this is just a dup over 4.2/4.3 *** This bug has been marked as a duplicate of bug 1764001 ***