Description of problem: Nodes on Openshift 4.3 cluster are continually rebooted. The reason of the reboot is the machineconfigpool being applied to the nodes: - extract from "machine-config-daemon-2mfbb" log: ~~~ I0515 10:23:54.059595 2287 start.go:74] Version: v4.3.14-202004200457-dirty (f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54) I0515 10:23:54.068432 2287 start.go:84] Calling chroot("/rootfs") I0515 10:23:54.069495 2287 rpm-ostree.go:366] Running captured: rpm-ostree status --json I0515 10:23:54.252483 2287 daemon.go:209] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4cd521fb34c0d362205a1e55ad8c9c8dd6c7365b71a357ef705692ed80f7b112 (43.81.202004280317.0) ...output suppressed... Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4cd521fb34c0d362205a1e55ad8c9c8dd6c7365b71a357ef705692ed80f7b112 CustomOrigin: Managed by machine-config-operator Version: 43.81.202004280317.0 (2020-04-28T03:22:44Z) ostree://624bfc39e8091d69b3c48bb16e85683ff5166dc8b03ac0686753d8e555613b54 Version: 43.81.202003191953.0 (2020-03-19T19:59:17Z) I0515 10:23:54.572041 2287 rpm-ostree.go:366] Running captured: journalctl --list-boots I0515 10:23:57.110763 2287 daemon.go:785] journalctl --list-boots: -126 2a4ac46b42b543ab9eef1617a3a4c161 Sat 2020-05-09 23:08:27 UTC—Sat 2020-05-09 23:08:39 UTC -125 d8315c36fdfe4c688db9bcca4d8f2ee3 Sat 2020-05-09 23:08:56 UTC—Sun 2020-05-10 00:55:50 UTC -124 09d0d5845b8f4a279a8233684f9d7343 Sun 2020-05-10 00:56:08 UTC—Sun 2020-05-10 01:09:12 UTC ...output suppressed... -2 82d07edfd9a84a24878493c5d31993be Fri 2020-05-15 07:29:30 UTC—Fri 2020-05-15 10:11:45 UTC -1 42f374f40fc643e2a71e868da52a5d8d Fri 2020-05-15 10:12:01 UTC—Fri 2020-05-15 10:23:24 UTC 0 b7fae3fd0a3b4e9da1f6a5dabe318bdd Fri 2020-05-15 10:23:41 UTC—Fri 2020-05-15 10:23:57 UTC ...output suppressed... I0515 10:24:04.196965 2287 daemon.go:731] Current config: rendered-worker-e09847e8fddf3cca3a3cfc89a033eee6 I0515 10:24:04.196986 2287 daemon.go:732] Desired config: rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73 I0515 10:24:04.205520 2287 update.go:1051] Disk currentConfig rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73 overrides node annotation rendered-worker-e09847e8fddf3cca3a3cfc89a033eee6 I0515 10:24:04.208700 2287 daemon.go:955] Validating against pending config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73 I0515 10:24:04.211216 2287 daemon.go:971] Validated on-disk state I0515 10:24:04.224263 2287 daemon.go:1005] Completing pending config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73 I0515 10:24:04.229678 2287 update.go:1051] completed update for config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73 I0515 10:24:04.232522 2287 daemon.go:1021] In desired config rendered-worker-ed86a1b4000a8548a66c6f2ac521aa73 ~~~ As you can appreciate, the node rebooted 126 times in a period of 5 days, and this is happening alternatively to all nodes in cluster. Investigating about the issue it happens that the machineconfigpools are being regenerated, pushing new configurations to the nodes and then rebooting them (this was taking at a different moment, thus the different in observedGeneration vs reboots): ~~~ $ oc get mcp -o name | while read mcp; do echo "------------ $mcp -------------------"; oc get $mcp -oyaml | grep observed; done ------------ machineconfigpool.machineconfiguration.openshift.io/infra ------------------- observedGeneration: 260 ------------ machineconfigpool.machineconfiguration.openshift.io/master ------------------- observedGeneration: 233 ------------ machineconfigpool.machineconfiguration.openshift.io/worker ------------------- observedGeneration: 261 ~~~ Following this investigation line we verified machine configs and see that there are two of them that are being continuously regenerated: $ cat 03-machineconfigs.log | awk '/^Name:|Generation:/ {print $0}' Name: 00-master Generation: 3 Name: 00-worker Generation: 3 ...output suppressed... Name: 99-master-a3e5f6eb-a1c2-4a80-8a5c-53ccc9f0b56e-registries Generation: 147 Name: 99-worker-d747fbe5-23a7-4baf-bbe7-15798060d81d-registries Generation: 175 ...output suppressed... I'm attaching the machineconfigs for your analysis. This two particular machineconfigs establish the contents of files "/etc/containers/registries.conf" and "/etc/containers/policy.json" with customer specific contents. This files are also previously set in default config files "01-master-container-runtime" and "01-worker-container-runtime". The only interesting thing that i found about this two machineconfigs are that they have this annotation: Name: 99-master-a3e5f6eb-a1c2-4a80-8a5c-53ccc9f0b56e-registries Annotations: machineconfiguration.openshift.io/generated-by-controller-version: f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54 Name: 99-worker-d747fbe5-23a7-4baf-bbe7-15798060d81d-registries Annotations: machineconfiguration.openshift.io/generated-by-controller-version: f6d1fe753cbcecb3aa1c2d3d3edd4a5d04ffca54 that makes me thing that they are probably regenerated by machine-config-operator. Version-Release number of selected component (if applicable): Openshift 4.3.18 How reproducible: Not tested in lab Steps to Reproduce: 1. 2. 3. Actual results: Nodes continuously reboot. Expected results: Nodes stay stable Additional info: The cluster is recently installed. We stopped an external artifact: for gitops customer is using ArgoCD. But nodes still reboots after that. Customer have just removed the configurations they added that are this ones: 99-worker-z-container-registry-conf 99-master-z-container-registry-conf Interesting enough is that this two machine configs are not the ones being regenerated. This makes me thing that somewhat these machine configs are triggering the generation of the previously mentioned. The result of this "deletion" is going to be reviewed next monday.
Can you please attach a must gather for the cluster?
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1809007 In this case, we'd advise upgrading to 4.3.19 or higher to pick up the fix before applying the changes. Reassigning to Node team to verify on the mustgather and analysis.
*** This bug has been marked as a duplicate of bug 1809007 ***