1764116 – Node does not recover after "unexpected on-disk state validating" error

Bug 1764116 - Node does not recover after "unexpected on-disk state validating" error

Summary: Node does not recover after "unexpected on-disk state validating" error

Keywords:
Status:	CLOSED DUPLICATE of bug 1764001
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1764001
TreeView+	depends on / blocked

Reported:	2019-10-22 09:35 UTC by Antonio Murdaca
Modified:	2020-04-28 21:22 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1764001
Environment:
Last Closed:	2020-04-28 21:22:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1203	0	'None'	closed	Bug 1764116: templates: rename our dropins to include the mco string	2021-02-05 16:12:33 UTC

Description Antonio Murdaca 2019-10-22 09:35:03 UTC

+++ This bug was initially created as a clone of Bug #1764001 +++

# Description of problem:

When creating a MachineConfig to update systemd dropins, after applying to the first Node, the Node gets in error "unexpected on-disk state validating" and cannot recovery.

# Version-Release number of selected component (if applicable):
OCP 4.2.0 

# How reproducible:

This is reproducible 100% of the time.


# Steps to Reproduce:

1. Create and apply a MachineConfig to update a systemd dropin like this:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: null
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 10-crio-default-env
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 2.2.0
    networkd: {}
    passwd: {}
    systemd: {
            "units": [
                {
                "name": "crio.service",
                "dropins": [{
                  "name": "10-default-env.conf",
                  "contents": '#foo'
                  }]
                }
            ]
        }

2. The Node get into "unexpected on-disk state validating" show the Node message:


apiVersion: v1
kind: Node
metadata:
  annotations:
    machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90
    machineconfiguration.openshift.io/reason: unexpected on-disk state validating
      against rendered-worker-8d3c5f427a66492fed77a3c491514b90
    machineconfiguration.openshift.io/state: Degraded
    volumes.kubernetes.io/controller-managed-attach-detach: "true"


Validating the MC applied the configuration:
[root@worker-1 /]# cat /etc/systemd/system/crio.service.d/10-default-env.conf
#foo


# The worker Node status is show as: 

worker-1.ocp4poc.lab.shift.zone   Ready,SchedulingDisabled   worker          4d7h   v1.14.6+c07e432da

3. Removing the MachineConfig does not return the Node to the original.

# Changing the desiredConfig to the previous one and rebooting the Nodes does not reset the state
machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f
machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f

# After rebooting it goes back to the degraded state and 
machineconfiguration.openshift.io/currentConfig: rendered-worker-85b528bcd2d7ff5110197a176e59c43f
machineconfiguration.openshift.io/desiredConfig: rendered-worker-8d3c5f427a66492fed77a3c491514b90


Expected results:
The Node should return or reset to a known good state after the MachineConfig is removed

--- Additional comment from Antonio Murdaca on 2019-10-22 09:09:28 UTC ---

The MCO is validating against the original crio unit and is failing for that - this is tricky because everytime you apply a config that somehow overrides what the MCO ships you get into this situation. I'll check why removing it doesn't reconcile but I think the MCO is doing correct to deny to change a system file that it manages.

Comment 3 Antonio Murdaca 2020-04-28 21:22:51 UTC

We're gonna use https://bugzilla.redhat.com/show_bug.cgi?id=1764001 as the base for this work - I'm working on that (as it's 4.2, where the issue happened and has been introduced). Closing this and I'm going to clone again from that one as this is just a dup over 4.2/4.3

*** This bug has been marked as a duplicate of bug 1764001 ***

Note You need to log in before you can comment on or make changes to this bug.