Description of problem: MCO added a "force overwrite" file to allow people to fix things when they know they wanted to synchronize to the current state. However, persisting it on disk by default thereafter wasn't explicitly intended - it will silently mask errors. The flag is much better as a "one time" intervention. Version-Release number of selected component (if applicable): from >=4.1 How reproducible: always Steps to Reproduce: 1. create a skew in configs (like certs recovery) 2. fix the above using the validation file 3. re-create the skew and notice the error is not bubbled up Actual results: errors being shadowed Expected results: errors properly bubbled up Additional info:
Any updates to this, we have 2 backports needed stalled.
Verified on 4.3.0-0.nightly-2019-11-18-175710 == Apply the following MC to have the file managed by MCO == cat file.yaml > apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","metadata":{"annotations":{},"labels":{"machineconfiguration.openshift.io/role":"worker"},"name":"test-file"},"spec":{"config":{"ignition":{"version":"2.2.0"},"storage":{"files":[{"contents":{"source":"data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK"},"filesystem":"root","mode":420,"path":"/etc/test"}]}}}} creationTimestamp: "2019-11-19T15:42:50Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: test-file resourceVersion: "286928" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigs/test-file uid: 9c0fd040-c96e-4fd1-bfa4-122551bf0f8c spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK filesystem: root mode: 420 path: /etc/test oc apply -f file.yaml == edit the /etc/test which is now managed by MCO == $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-132-107.ec2.internal Ready worker 17h v1.16.2 ip-10-0-143-180.ec2.internal Ready master 17h v1.16.2 ip-10-0-155-243.ec2.internal Ready worker 17h v1.16.2 ip-10-0-157-93.ec2.internal Ready master 17h v1.16.2 ip-10-0-166-244.ec2.internal Ready master 17h v1.16.2 ip-10-0-168-238.ec2.internal Ready worker 17h v1.16.2 $ oc debug node/ip-10-0-132-107.ec2.internal Starting pod/ip-10-0-132-107ec2internal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# sh-4.2# sh-4.2# chroot /host sh-4.4# vi /etc/test sh-4.4# cat /etc/test server foo.example.net maxdelay 0.4 offline server bar.example.net maxdelay 0.4 offline MY CHANGES HERE server baz.example.net maxdelay 0.4 offline sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... == restart MCD pod on that node == $ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-pffnf 2/2 Running 3 17h $ oc -n openshift-machine-config-operator log machine-config-daemon-pffnf -c machine-config-daemon I1119 15:51:41.294634 2141 start.go:74] Version: v4.3.0-201911161914-dirty (1c4436a0952ff43315da8d9a4aa1b5841192aeb1) I1119 15:51:41.358299 2141 start.go:84] Calling chroot("/rootfs") I1119 15:51:41.360123 2141 rpm-ostree.go:365] Running captured: rpm-ostree status --json I1119 15:51:43.252332 2141 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320 (43.81.201911181553.0) I1119 15:51:43.320652 2141 metrics.go:106] Registering Prometheus metrics I1119 15:51:43.320872 2141 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I1119 15:51:43.356798 2141 update.go:1050] Starting to manage node: ip-10-0-132-107.ec2.internal I1119 15:51:43.386440 2141 rpm-ostree.go:365] Running captured: rpm-ostree status I1119 15:51:43.873395 2141 daemon.go:778] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320 CustomOrigin: Managed by machine-config-operator Version: 43.81.201911181553.0 (2019-11-18T15:58:44Z) ostree://025d77d2680e3a2ce60b3b832106319fefb81cfefbf1bf26fcf4bd1aeae0712d Version: 43.81.201911081536.0 (2019-11-08T15:41:32Z) I1119 15:51:43.873421 2141 rpm-ostree.go:365] Running captured: journalctl --list-boots I1119 15:51:43.890836 2141 daemon.go:785] journalctl --list-boots: -2 9426aac3147844b8b0703f2a7e70fb1e Mon 2019-11-18 23:05:26 UTC—Mon 2019-11-18 23:07:16 UTC -1 27c0bbf49686428091cefaaf6ee2e0ac Mon 2019-11-18 23:07:49 UTC—Tue 2019-11-19 15:50:46 UTC 0 56fcd0cea1d3495fa43887d9ac004008 Tue 2019-11-19 15:51:22 UTC—Tue 2019-11-19 15:51:43 UTC I1119 15:51:43.890857 2141 daemon.go:528] Starting MachineConfigDaemon I1119 15:51:43.890966 2141 daemon.go:535] Enabling Kubelet Healthz Monitor E1119 15:51:47.483423 2141 reflector.go:123] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E1119 15:51:47.483436 2141 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host I1119 15:51:49.939436 2141 daemon.go:731] Current config: rendered-worker-a071bd1866b9d81e64e2b56318138074 I1119 15:51:49.939462 2141 daemon.go:732] Desired config: rendered-worker-e63e8ae93003df690179be3266b9621e I1119 15:51:49.952134 2141 update.go:1050] Disk currentConfig rendered-worker-e63e8ae93003df690179be3266b9621e overrides node annotation rendered-worker-a071bd1866b9d81e64e2b56318138074 I1119 15:51:49.956574 2141 daemon.go:955] Validating against pending config rendered-worker-e63e8ae93003df690179be3266b9621e I1119 15:51:49.970491 2141 daemon.go:971] Validated on-disk state I1119 15:51:49.985618 2141 daemon.go:1005] Completing pending config rendered-worker-e63e8ae93003df690179be3266b9621e I1119 15:51:49.993916 2141 update.go:1050] completed update for config rendered-worker-e63e8ae93003df690179be3266b9621e I1119 15:51:49.999053 2141 daemon.go:1021] In desired config rendered-worker-e63e8ae93003df690179be3266b9621e $ oc -n openshift-machine-config-operator delete pod machine-config-daemon-pffnf pod "machine-config-daemon-pffnf" deleted == Verify degraded state in MCD logs after it restarts == $ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-4cfqh 2/2 Running 0 12m $ oc -n openshift-machine-config-operator log machine-config-daemon-4cfqh -c machine-config-daemon ...snip... E1119 16:54:09.246433 71013 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e I1119 16:55:09.262041 71013 daemon.go:724] Current+desired config: rendered-worker-e63e8ae93003df690179be3266b9621e I1119 16:55:09.269002 71013 daemon.go:958] Validating against current config rendered-worker-e63e8ae93003df690179be3266b9621e E1119 16:55:09.270049 71013 daemon.go:1350] content mismatch for file /etc/test: server foo.example.net maxdelay 0.4 offline server bar.example.net maxdelay 0.4 offline A: MY CHANGES HERE server baz.example.net maxdelay 0.4 offline B: server baz.example.net maxdelay 0.4 offline E1119 16:55:09.270077 71013 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT master rendered-master-6e7212ec4d57e884f64e32eabff065bb True False False 3 3 3 0 worker rendered-worker-e63e8ae93003df690179be3266b9621e False True True 3 2 2 1 == Touch the force file /run/machine-config-daemon-force == $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-107.ec2.internal Ready worker 17h v1.16.2 ip-10-0-143-180.ec2.internal Ready master 17h v1.16.2 ip-10-0-155-243.ec2.internal Ready worker 17h v1.16.2 ip-10-0-157-93.ec2.internal Ready master 17h v1.16.2 ip-10-0-166-244.ec2.internal Ready master 17h v1.16.2 ip-10-0-168-238.ec2.internal Ready worker 17h v1.16.2 [mnguyen@pet30 4.3-nightly]$ ls openshift-client-linux-4.3.0-0.nightly-2019-11-18-175710.tar.gz README.md testcluster openshift-install release.txt openshift-install-linux-4.3.0-0.nightly-2019-11-18-175710.tar.gz sha256sum.txt [mnguyen@pet30 4.3-nightly]$ oc debug node/ip-10-0-132-107.ec2.internal Starting pod/ip-10-0-132-107ec2internal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# sh-4.2# chroot /host sh-4.4# cat /etc/test server foo.example.net maxdelay 0.4 offline server bar.example.net maxdelay 0.4 offline MY CHANGES HERE server baz.example.net maxdelay 0.4 offline sh-4.4# touch /run/machine-config-daemon-force sh-4.4# ls /run/ NetworkManager containers dmeventd-server issue.d mdadm ostree-booted samba sudo agetty.reload crio faillock kubernetes media podman secrets systemd auditd.pid criu gluster lock mount rpc.statd.pid sepermit tmpfiles.d chrony cryptsetup ignition.env log multus rpcbind setrans udev console dbus initctl lvm openshift-sdn rpcbind.sock sshd.pid user console-login-helper-messages dmeventd-client initramfs machine-config-daemon-force openvswitch runc sssd.pid utmp sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... == Restart MCD on node again == $ oc -n openshift-machine-config-operator delete pod/machine-config-daemon-4cfqh pod "machine-config-daemon-4cfqh" deleted == Verify that mcp are no longer degraded and no error messages in MCD logs == $ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-k5w9h 2/2 Running 0 9s $ oc -n openshift-machine-config-operator log machine-config-daemon-k5w9h -c machine-config-daemon I1119 17:03:18.700477 97533 start.go:74] Version: v4.3.0-201911161914-dirty (1c4436a0952ff43315da8d9a4aa1b5841192aeb1) I1119 17:03:18.706312 97533 start.go:84] Calling chroot("/rootfs") I1119 17:03:18.706418 97533 rpm-ostree.go:365] Running captured: rpm-ostree status --json I1119 17:03:18.877078 97533 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320 (43.81.201911181553.0) I1119 17:03:18.879443 97533 metrics.go:106] Registering Prometheus metrics I1119 17:03:18.879586 97533 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I1119 17:03:18.881125 97533 update.go:1050] Starting to manage node: ip-10-0-132-107.ec2.internal I1119 17:03:18.886955 97533 rpm-ostree.go:365] Running captured: rpm-ostree status I1119 17:03:18.964138 97533 daemon.go:778] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eeb7352d393d6aafcb75d1d57f5578c06af9f8ab21d86f00e7cda8388744b320 CustomOrigin: Managed by machine-config-operator Version: 43.81.201911181553.0 (2019-11-18T15:58:44Z) ostree://025d77d2680e3a2ce60b3b832106319fefb81cfefbf1bf26fcf4bd1aeae0712d Version: 43.81.201911081536.0 (2019-11-08T15:41:32Z) I1119 17:03:18.964162 97533 rpm-ostree.go:365] Running captured: journalctl --list-boots I1119 17:03:18.976745 97533 daemon.go:785] journalctl --list-boots: -2 9426aac3147844b8b0703f2a7e70fb1e Mon 2019-11-18 23:05:26 UTC—Mon 2019-11-18 23:07:16 UTC -1 27c0bbf49686428091cefaaf6ee2e0ac Mon 2019-11-18 23:07:49 UTC—Tue 2019-11-19 15:50:46 UTC 0 56fcd0cea1d3495fa43887d9ac004008 Tue 2019-11-19 15:51:22 UTC—Tue 2019-11-19 17:03:18 UTC I1119 17:03:18.976772 97533 daemon.go:528] Starting MachineConfigDaemon I1119 17:03:18.976874 97533 daemon.go:535] Enabling Kubelet Healthz Monitor $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT worker rendered-worker-e63e8ae93003df690179be3266b9621e True False False 3 3 3 0 == Change /etc/test again and verify MCP is degrade == $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-132-107.ec2.internal Ready worker 17h v1.16.2 ip-10-0-143-180.ec2.internal Ready master 18h v1.16.2 ip-10-0-155-243.ec2.internal Ready worker 17h v1.16.2 ip-10-0-157-93.ec2.internal Ready master 18h v1.16.2 ip-10-0-166-244.ec2.internal Ready master 18h v1.16.2 ip-10-0-168-238.ec2.internal Ready worker 17h v1.16.2 $ oc debug node/ip-10-0-132-107.ec2.internal Starting pod/ip-10-0-132-107ec2internal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# vi /etc/test sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT master rendered-master-6e7212ec4d57e884f64e32eabff065bb True False False 3 3 3 0 worker rendered-worker-e63e8ae93003df690179be3266b9621e False True True 3 2 2 1 $ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-k5w9h 2/2 Running 0 3m25s $ oc -n openshift-machine-config-operator delete pod/machine-config-daemon-k5w9h pod "machine-config-daemon-k5w9h" deleted $ oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=ip-10-0-132-107.ec2.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-jbw99 2/2 Running 0 5s $ oc -n openshift-machine-config-operator log machine-config-daemon-jbw99 -c machine-config-daemon ...snip... E1119 17:07:17.742472 102988 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e I1119 17:07:25.760929 102988 daemon.go:724] Current+desired config: rendered-worker-e63e8ae93003df690179be3266b9621e I1119 17:07:25.767167 102988 daemon.go:958] Validating against current config rendered-worker-e63e8ae93003df690179be3266b9621e E1119 17:07:25.767269 102988 daemon.go:1350] content mismatch for file /etc/test: server foo.example.net maxdelay 0.4 offline server bar.example.net maxdelay 0.4 offline A: MY CHANGES HERE SECOND CHANGES HERE server baz.example.net maxdelay 0.4 offline B: server baz.example.net maxdelay 0.4 offline E1119 17:07:25.767294 102988 writer.go:130] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e63e8ae93003df690179be3266b9621e
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062