Bug 1764116 - Node does not recover after "unexpected on-disk state validating" error
Summary: Node does not recover after "unexpected on-disk state validating" error
Keywords:
Status: CLOSED DUPLICATE of bug 1764001
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.3.0
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.5.0
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1764001
TreeView+ depends on / blocked
 
Reported: 2019-10-22 09:35 UTC by Antonio Murdaca
Modified: 2020-04-28 21:22 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1764001
Environment:
Last Closed: 2020-04-28 21:22:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1203 0 'None' closed Bug 1764116: templates: rename our dropins to include the mco string 2021-02-05 16:12:33 UTC

Description Antonio Murdaca 2019-10-22 09:35:03 UTC
+++ This bug was initially created as a clone of Bug #1764001 +++

# Description of problem:

When creating a MachineConfig to update systemd dropins, after applying to the first Node, the Node gets in error "unexpected on-disk state validating" and cannot recovery.

# Version-Release number of selected component (if applicable):
OCP 4.2.0 

# How reproducible:

This is reproducible 100% of the time.


# Steps to Reproduce:

1. Create and apply a MachineConfig to update a systemd dropin like this:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: null
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 10-crio-default-env
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 2.2.0
    networkd: {}
    passwd: {}
    systemd: {
            "units": [
                {
                "name": "crio.service",
                "dropins": [{
                  "name": "10-default-env.conf",
                  "contents": '#foo'
                  }]
                }
            ]
        }

2. The Node get into "unexpected on-disk state validating" show the Node message:


apiVersion: v1
kind: Node
metadata:
  annotations:
    machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90
    machineconfiguration.openshift.io/reason: unexpected on-disk state validating
      against rendered-worker-8d3c5f427a66492fed77a3c491514b90
    machineconfiguration.openshift.io/state: Degraded
    volumes.kubernetes.io/controller-managed-attach-detach: "true"


Validating the MC applied the configuration:
[root@worker-1 /]# cat /etc/systemd/system/crio.service.d/10-default-env.conf
#foo


# The worker Node status is show as: 

worker-1.ocp4poc.lab.shift.zone   Ready,SchedulingDisabled   worker          4d7h   v1.14.6+c07e432da

3. Removing the MachineConfig does not return the Node to the original.

# Changing the desiredConfig to the previous one and rebooting the Nodes does not reset the state
machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f
machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f

# After rebooting it goes back to the degraded state and 
machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f
machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90


Expected results:
The Node should return or reset to a known good state after the MachineConfig is removed

--- Additional comment from Antonio Murdaca on 2019-10-22 09:09:28 UTC ---

The MCO is validating against the original crio unit and is failing for that - this is tricky because everytime you apply a config that somehow overrides what the MCO ships you get into this situation. I'll check why removing it doesn't reconcile but I think the MCO is doing correct to deny to change a system file that it manages.

Comment 3 Antonio Murdaca 2020-04-28 21:22:51 UTC
We're gonna use https://bugzilla.redhat.com/show_bug.cgi?id=1764001 as the base for this work - I'm working on that (as it's 4.2, where the issue happened and has been introduced). Closing this and I'm going to clone again from that one as this is just a dup over 4.2/4.3

*** This bug has been marked as a duplicate of bug 1764001 ***


Note You need to log in before you can comment on or make changes to this bug.