1766513 – MCD could shadow errors since the force validation file isn't deleted once used

Bug 1766513 - MCD could shadow errors since the force validation file isn't deleted once used

Summary: MCD could shadow errors since the force validation file isn't deleted once used

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1766514 1766515
TreeView+	depends on / blocked

Reported:	2019-10-29 10:27 UTC by Antonio Murdaca
Modified:	2020-01-23 11:10 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1766514 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:09:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:10:11 UTC

Description Antonio Murdaca 2019-10-29 10:27:04 UTC

Description of problem:

MCO added a "force overwrite" file to allow people to fix things when they know they wanted to synchronize to the current state. However, persisting it on disk by default thereafter wasn't explicitly intended - it will silently mask errors.

The flag is much better as a "one time" intervention.

Version-Release number of selected component (if applicable):

from >=4.1


How reproducible:

always


Steps to Reproduce:
1. create a skew in configs (like certs recovery)
2. fix the above using the validation file
3. re-create the skew and notice the error is not bubbled up

Actual results:

errors being shadowed

Expected results:

errors properly bubbled up

Additional info:

Comment 2 Kirsten Garrison 2019-11-07 19:34:11 UTC

Any updates to this, we have 2 backports needed stalled.

Comment 3 Michael Nguyen 2019-11-19 17:26:05 UTC

Verified on 4.3.0-0.nightly-2019-11-18-175710

== Apply the following MC to have the file managed by MCO ==
cat file.yaml >
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","metadata":{"annotations":{},"labels":{"machineconfiguration.openshift.io/role":"worker"},"name":"test-file"},"spec":{"config":{"ignition":{"version":"2.2.0"},"storage":{"files":[{"contents":{"source":"data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK"},"filesystem":"root","mode":420,"path":"/etc/test"}]}}}}
  creationTimestamp: "2019-11-19T15:42:50Z"
  generation: 1
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
  resourceVersion: "286928"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigs/test-file
  uid: 9c0fd040-c96e-4fd1-bfa4-122551bf0f8c
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 420
        path: /etc/test

oc apply -f file.yaml

== edit the /etc/test which is now managed by MCO ==
$ oc get node
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-132-107.ec2.internal   Ready    worker   17h   v1.16.2
ip-10-0-143-180.ec2.internal   Ready    master   17h   v1.16.2
ip-10-0-155-243.ec2.internal   Ready    worker   17h   v1.16.2
ip-10-0-157-93.ec2.internal    Ready    master   17h   v1.16.2
ip-10-0-166-244.ec2.internal   Ready    master   17h   v1.16.2
ip-10-0-168-238.ec2.internal   Ready    worker   17h   v1.16.2
$ oc debug node/ip-10-0-132-107.ec2.internal
Starting pod/ip-10-0-132-107ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# 
sh-4.2# 
sh-4.2# chroot /host
sh-4.4# vi /etc/test 
sh-4.4# cat /etc/test 
server foo.example.net maxdelay 0.4 offline
server bar.example.net maxdelay 0.4 offline
MY CHANGES HERE
server baz.example.net maxdelay 0.4 offline
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

== restart MCD pod on that node ==
$ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-pffnf   2/2     Running   3          17h
$ oc -n openshift-machine-config-operator log machine-config-daemon-pffnf -c machine-config-daemon
I1119 15:51:41.294634    2141 start.go:74] Version: v4.3.0-201911161914-dirty (1c4436a0952ff43315da8d9a4aa1b5841192aeb1)
I1119 15:51:41.358299    2141 start.go:84] Calling chroot("/rootfs")
I1119 15:51:41.360123    2141 rpm-ostree.go:365] Running captured: rpm-ostree status --json
I1119 15:51:43.252332    2141 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320 (43.81.201911181553.0)
I1119 15:51:43.320652    2141 metrics.go:106] Registering Prometheus metrics
I1119 15:51:43.320872    2141 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I1119 15:51:43.356798    2141 update.go:1050] Starting to manage node: ip-10-0-132-107.ec2.internal
I1119 15:51:43.386440    2141 rpm-ostree.go:365] Running captured: rpm-ostree status
I1119 15:51:43.873395    2141 daemon.go:778] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.201911181553.0 (2019-11-18T15:58:44Z)

  ostree://025d77d2680e3a2ce60b3b832106319fefb81cfefbf1bf26fcf4bd1aeae0712d
                   Version: 43.81.201911081536.0 (2019-11-08T15:41:32Z)
I1119 15:51:43.873421    2141 rpm-ostree.go:365] Running captured: journalctl --list-boots
I1119 15:51:43.890836    2141 daemon.go:785] journalctl --list-boots:
-2 9426aac3147844b8b0703f2a7e70fb1e Mon 2019-11-18 23:05:26 UTC—Mon 2019-11-18 23:07:16 UTC
-1 27c0bbf49686428091cefaaf6ee2e0ac Mon 2019-11-18 23:07:49 UTC—Tue 2019-11-19 15:50:46 UTC
 0 56fcd0cea1d3495fa43887d9ac004008 Tue 2019-11-19 15:51:22 UTC—Tue 2019-11-19 15:51:43 UTC
I1119 15:51:43.890857    2141 daemon.go:528] Starting MachineConfigDaemon
I1119 15:51:43.890966    2141 daemon.go:535] Enabling Kubelet Healthz Monitor
E1119 15:51:47.483423    2141 reflector.go:123] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E1119 15:51:47.483436    2141 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I1119 15:51:49.939436    2141 daemon.go:731] Current config: rendered-worker-a071bd1866b9d81e64e2b56318138074
I1119 15:51:49.939462    2141 daemon.go:732] Desired config: rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 15:51:49.952134    2141 update.go:1050] Disk currentConfig rendered-worker-e63e8ae93003df690179be3266b9621e overrides node annotation rendered-worker-a071bd1866b9d81e64e2b56318138074
I1119 15:51:49.956574    2141 daemon.go:955] Validating against pending config rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 15:51:49.970491    2141 daemon.go:971] Validated on-disk state
I1119 15:51:49.985618    2141 daemon.go:1005] Completing pending config rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 15:51:49.993916    2141 update.go:1050] completed update for config rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 15:51:49.999053    2141 daemon.go:1021] In desired config rendered-worker-e63e8ae93003df690179be3266b9621e

$ oc -n openshift-machine-config-operator delete pod machine-config-daemon-pffnf
pod "machine-config-daemon-pffnf" deleted


== Verify degraded state in MCD logs after it restarts ==
$ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-4cfqh   2/2     Running   0          12m
$ oc -n openshift-machine-config-operator log machine-config-daemon-4cfqh -c machine-config-daemon
...snip...
E1119 16:54:09.246433   71013 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 16:55:09.262041   71013 daemon.go:724] Current+desired config: rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 16:55:09.269002   71013 daemon.go:958] Validating against current config rendered-worker-e63e8ae93003df690179be3266b9621e
E1119 16:55:09.270049   71013 daemon.go:1350] content mismatch for file /etc/test: server foo.example.net maxdelay 0.4 offline
server bar.example.net maxdelay 0.4 offline


A: MY CHANGES HERE
server baz.example.net maxdelay 0.4 offline


B: server baz.example.net maxdelay 0.4 offline


E1119 16:55:09.270077   71013 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT
master   rendered-master-6e7212ec4d57e884f64e32eabff065bb   True      False      False      3              3                   3                     0
worker   rendered-worker-e63e8ae93003df690179be3266b9621e   False     True       True       3              2                   2                     1


== Touch the force file /run/machine-config-daemon-force ==
$ oc get nodes
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-132-107.ec2.internal   Ready    worker   17h   v1.16.2
ip-10-0-143-180.ec2.internal   Ready    master   17h   v1.16.2
ip-10-0-155-243.ec2.internal   Ready    worker   17h   v1.16.2
ip-10-0-157-93.ec2.internal    Ready    master   17h   v1.16.2
ip-10-0-166-244.ec2.internal   Ready    master   17h   v1.16.2
ip-10-0-168-238.ec2.internal   Ready    worker   17h   v1.16.2
[mnguyen@pet30 4.3-nightly]$ ls
openshift-client-linux-4.3.0-0.nightly-2019-11-18-175710.tar.gz   README.md      testcluster
openshift-install                                                 release.txt
openshift-install-linux-4.3.0-0.nightly-2019-11-18-175710.tar.gz  sha256sum.txt
[mnguyen@pet30 4.3-nightly]$ oc debug node/ip-10-0-132-107.ec2.internal
Starting pod/ip-10-0-132-107ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# 
sh-4.2# chroot /host
sh-4.4# cat /etc/test 
server foo.example.net maxdelay 0.4 offline
server bar.example.net maxdelay 0.4 offline
MY CHANGES HERE
server baz.example.net maxdelay 0.4 offline
sh-4.4# touch /run/machine-config-daemon-force
sh-4.4# ls /run/
NetworkManager		       containers	dmeventd-server  issue.d		      mdadm	     ostree-booted  samba     sudo
agetty.reload		       crio		faillock	 kubernetes		      media	     podman	    secrets   systemd
auditd.pid		       criu		gluster		 lock			      mount	     rpc.statd.pid  sepermit  tmpfiles.d
chrony			       cryptsetup	ignition.env	 log			      multus	     rpcbind	    setrans   udev
console			       dbus		initctl		 lvm			      openshift-sdn  rpcbind.sock   sshd.pid  user
console-login-helper-messages  dmeventd-client	initramfs	 machine-config-daemon-force  openvswitch    runc	    sssd.pid  utmp
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
== Restart MCD on node again ==
$ oc -n openshift-machine-config-operator delete pod/machine-config-daemon-4cfqh 
pod "machine-config-daemon-4cfqh" deleted

== Verify that mcp are no longer degraded and no error messages in MCD logs ==
$ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-k5w9h   2/2     Running   0          9s
$ oc -n openshift-machine-config-operator log machine-config-daemon-k5w9h -c machine-config-daemon
I1119 17:03:18.700477   97533 start.go:74] Version: v4.3.0-201911161914-dirty (1c4436a0952ff43315da8d9a4aa1b5841192aeb1)
I1119 17:03:18.706312   97533 start.go:84] Calling chroot("/rootfs")
I1119 17:03:18.706418   97533 rpm-ostree.go:365] Running captured: rpm-ostree status --json
I1119 17:03:18.877078   97533 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320 (43.81.201911181553.0)
I1119 17:03:18.879443   97533 metrics.go:106] Registering Prometheus metrics
I1119 17:03:18.879586   97533 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I1119 17:03:18.881125   97533 update.go:1050] Starting to manage node: ip-10-0-132-107.ec2.internal
I1119 17:03:18.886955   97533 rpm-ostree.go:365] Running captured: rpm-ostree status
I1119 17:03:18.964138   97533 daemon.go:778] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.201911181553.0 (2019-11-18T15:58:44Z)

  ostree://025d77d2680e3a2ce60b3b832106319fefb81cfefbf1bf26fcf4bd1aeae0712d
                   Version: 43.81.201911081536.0 (2019-11-08T15:41:32Z)
I1119 17:03:18.964162   97533 rpm-ostree.go:365] Running captured: journalctl --list-boots
I1119 17:03:18.976745   97533 daemon.go:785] journalctl --list-boots:
-2 9426aac3147844b8b0703f2a7e70fb1e Mon 2019-11-18 23:05:26 UTC—Mon 2019-11-18 23:07:16 UTC
-1 27c0bbf49686428091cefaaf6ee2e0ac Mon 2019-11-18 23:07:49 UTC—Tue 2019-11-19 15:50:46 UTC
 0 56fcd0cea1d3495fa43887d9ac004008 Tue 2019-11-19 15:51:22 UTC—Tue 2019-11-19 17:03:18 UTC
I1119 17:03:18.976772   97533 daemon.go:528] Starting MachineConfigDaemon
I1119 17:03:18.976874   97533 daemon.go:535] Enabling Kubelet Healthz Monitor
$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT
worker   rendered-worker-e63e8ae93003df690179be3266b9621e   True      False      False      3              3                   3                     0

== Change /etc/test again and verify MCP is degrade ==
$ oc get node
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-132-107.ec2.internal   Ready    worker   17h   v1.16.2
ip-10-0-143-180.ec2.internal   Ready    master   18h   v1.16.2
ip-10-0-155-243.ec2.internal   Ready    worker   17h   v1.16.2
ip-10-0-157-93.ec2.internal    Ready    master   18h   v1.16.2
ip-10-0-166-244.ec2.internal   Ready    master   18h   v1.16.2
ip-10-0-168-238.ec2.internal   Ready    worker   17h   v1.16.2
$ oc debug node/ip-10-0-132-107.ec2.internal
Starting pod/ip-10-0-132-107ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# vi /etc/test 
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT
master   rendered-master-6e7212ec4d57e884f64e32eabff065bb   True      False      False      3              3                   3                     0
worker   rendered-worker-e63e8ae93003df690179be3266b9621e   False     True       True       3              2                   2                     1

$ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-k5w9h   2/2     Running   0          3m25s
$ oc -n openshift-machine-config-operator delete pod/machine-config-daemon-k5w9h
pod "machine-config-daemon-k5w9h" deleted
$ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-jbw99   2/2     Running   0          5s
$ oc -n openshift-machine-config-operator log machine-config-daemon-jbw99 -c machine-config-daemon
...snip...
E1119 17:07:17.742472  102988 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 17:07:25.760929  102988 daemon.go:724] Current+desired config: rendered-worker-e63e8ae93003df690179be3266b9621e
I1119 17:07:25.767167  102988 daemon.go:958] Validating against current config rendered-worker-e63e8ae93003df690179be3266b9621e
E1119 17:07:25.767269  102988 daemon.go:1350] content mismatch for file /etc/test: server foo.example.net maxdelay 0.4 offline
server bar.example.net maxdelay 0.4 offline


A: MY CHANGES HERE
SECOND CHANGES HERE
server baz.example.net maxdelay 0.4 offline


B: server baz.example.net maxdelay 0.4 offline


E1119 17:07:25.767294  102988 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e

Comment 5 errata-xmlrpc 2020-01-23 11:09:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.