Description of problem: OpenShift IPI never finishes, MCO shows as degraded right after installation with machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found I suppose this happens when the control plane switches over from the bootstrap node to the master nodes. I can only suspect that the machineconfiguration files are regenerated and not carried over from the bootstrap, and this leads to this kind of issue? This is similar in nature to https://bugzilla.redhat.com/show_bug.cgi?id=1881213 , https://bugzilla.redhat.com/show_bug.cgi?id=1881057 I am still investigating what is causing this behavior in my environment. However, even if this is driven by invalid configuration, there should be an easy path to troubleshoot this. For administators, there is no way to figure our why this is happening, as the IPI bootstrap node along with its storage is deleted after the bootstrap operation. And there is no trace inside the new cluster of the old rendered configuration other than in the master journal, as far as I can tell. The installation shows a degraded MachineConfigOperator: ~~~ [root@openshift-jumpserver-0 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master False True True 3 0 0 3 43m worker rendered-worker-161afd8d86be52a2d2aebb20cf7d42ee True False False 2 2 2 0 43m [root@openshift-jumpserver-0 ~]# oc describe mcp master Name: master Namespace: Labels: machineconfiguration.openshift.io/mco-built-in= operator.machineconfiguration.openshift.io/required-for-upgrade= pools.operator.machineconfiguration.openshift.io/master= Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2021-07-27T09:13:01Z Generation: 2 Managed Fields: API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: .: f:machineconfiguration.openshift.io/mco-built-in: f:operator.machineconfiguration.openshift.io/required-for-upgrade: f:pools.operator.machineconfiguration.openshift.io/master: f:spec: .: f:configuration: f:machineConfigSelector: .: f:matchLabels: .: f:machineconfiguration.openshift.io/role: f:nodeSelector: .: f:matchLabels: .: f:node-role.kubernetes.io/master: f:paused: Manager: machine-config-operator Operation: Update Time: 2021-07-27T09:13:01Z API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: f:source: f:status: .: f:conditions: f:configuration: f:degradedMachineCount: f:machineCount: f:observedGeneration: f:readyMachineCount: f:unavailableMachineCount: f:updatedMachineCount: Manager: machine-config-controller Operation: Update Time: 2021-07-27T09:14:01Z Resource Version: 9243 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master UID: 2da9b58d-785d-4ea2-a3bd-cd9b10222fa6 Spec: Configuration: Name: rendered-master-7dddd0f5f1ed6745c5e68ef190b4e1c3 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-installer-ignition-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-generated-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-mtu API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-ssh Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: master Node Selector: Match Labels: node-role.kubernetes.io/master: Paused: false Status: Conditions: Last Transition Time: 2021-07-27T09:13:56Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2021-07-27T09:14:01Z Message: Reason: Status: False Type: Updated Last Transition Time: 2021-07-27T09:14:01Z Message: All nodes are updating to rendered-master-7dddd0f5f1ed6745c5e68ef190b4e1c3 Reason: Status: True Type: Updating Last Transition Time: 2021-07-27T09:14:01Z Message: Node openshift-master-2 is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found", Node openshift-master-0 is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found", Node openshift-master-1 is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found" Reason: 3 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2021-07-27T09:14:01Z Message: Reason: Status: True Type: Degraded Configuration: Degraded Machine Count: 3 Machine Count: 3 Observed Generation: 2 Ready Machine Count: 0 Unavailable Machine Count: 3 Updated Machine Count: 0 Events: <none> ~~~ Looking at one of the master daemon logs: ~~~ [root@openshift-jumpserver-0 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-controller-7d9bcdf859-27cmd 1/1 Running 0 36m 172.25.0.24 openshift-master-2 <none> <none> machine-config-daemon-fbp6j 2/2 Running 0 36m 192.168.123.202 openshift-master-2 <none> <none> machine-config-daemon-h8xjt 2/2 Running 0 7m 192.168.123.220 openshift-worker-0 <none> <none> machine-config-daemon-t967q 2/2 Running 0 3m57s 192.168.123.221 openshift-worker-1 <none> <none> machine-config-daemon-vdgmz 2/2 Running 0 36m 192.168.123.200 openshift-master-0 <none> <none> machine-config-daemon-xwq8t 2/2 Running 0 36m 192.168.123.201 openshift-master-1 <none> <none> machine-config-operator-699d8cf454-vvh47 1/1 Running 0 51m 172.24.0.12 openshift-master-1 <none> <none> machine-config-server-j9k89 1/1 Running 0 36m 192.168.123.201 openshift-master-1 <none> <none> machine-config-server-rxdbv 1/1 Running 0 36m 192.168.123.200 openshift-master-0 <none> <none> machine-config-server-vgmr9 1/1 Running 0 36m 192.168.123.202 openshift-master-2 <none> <none> [root@openshift-jumpserver-0 ~]# oc logs -f machine-config-daemon-vdgmz error: a container name must be specified for pod machine-config-daemon-vdgmz, choose one of: [machine-config-daemon oauth-proxy] [root@openshift-jumpserver-0 ~]# oc logs -f machine-config-daemon-vdgmz -c machine-config-daemon I0727 09:13:34.260662 11570 start.go:108] Version: v4.7.0-202105111858.p0-dirty (e3863b02b7403342cdf0f981889e8c3cfc2d86bb) I0727 09:13:34.265089 11570 start.go:121] Calling chroot("/rootfs") I0727 09:13:34.265302 11570 rpm-ostree.go:261] Running captured: rpm-ostree status --json I0727 09:13:34.632123 11570 daemon.go:218] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d (47.83.202105111846-0) I0727 09:13:34.735061 11570 start.go:97] Copied self to /run/bin/machine-config-daemon on host I0727 09:13:34.736775 11570 metrics.go:105] Registering Prometheus metrics I0727 09:13:34.736917 11570 metrics.go:110] Starting metrics listener on 127.0.0.1:8797 I0727 09:13:34.738590 11570 update.go:1904] Starting to manage node: openshift-master-0 I0727 09:13:34.748798 11570 rpm-ostree.go:261] Running captured: rpm-ostree status I0727 09:13:34.803891 11570 daemon.go:849] State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d CustomOrigin: Managed by machine-config-operator Version: 47.83.202105111846-0 (2021-05-11T18:49:55Z) ostree://3fdd1488024f054e39b1be508781d535d1ac7ed423bb3b4b656c2f345934220d Version: 47.83.202103251640-0 (2021-03-25T16:44:03Z) I0727 09:13:34.803995 11570 rpm-ostree.go:261] Running captured: journalctl --list-boots I0727 09:13:34.810631 11570 daemon.go:856] journalctl --list-boots: -1 2244dd692ce345f4be0bfc3522a8f0cf Tue 2021-07-27 09:08:18 UTC—Tue 2021-07-27 09:09:44 UTC 0 76d333fa1287421eb069c3ea050e271e Tue 2021-07-27 09:09:53 UTC—Tue 2021-07-27 09:13:34 UTC I0727 09:13:34.810728 11570 rpm-ostree.go:261] Running captured: systemctl list-units --state=failed --no-legend I0727 09:13:34.819273 11570 daemon.go:871] systemd service state: OK I0727 09:13:34.819342 11570 daemon.go:603] Starting MachineConfigDaemon I0727 09:13:34.819478 11570 daemon.go:610] Enabling Kubelet Healthz Monitor I0727 09:13:35.765623 11570 daemon.go:381] Node openshift-master-0 is part of the control plane I0727 09:13:36.416662 11570 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node openshift-master-0: map[k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac:0a:58:ac:1a:00:03 k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_openshift-master-0","mac-address":"52:54:00:68:7d:70","ip-addresses":["192.168.123.200/24"],"ip-address":"192.168.123.200/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:6dd89371-92d8-4bfb-a1e8-9c24b4e4650b k8s.ovn.org/node-local-nat-ip:{"default":["169.254.15.58"]} k8s.ovn.org/node-mgmt-port-mac-address:e2:6d:58:dc:9e:b3 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"192.168.123.200/24","ipv6":"fc00::5929:d49a:8c16:fef/64"} k8s.ovn.org/node-subnets:{"default":"172.26.0.0/23"} volumes.kubernetes.io/controller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json I0727 09:13:36.417226 11570 node.go:45] Setting initial node config: rendered-master-35c59c4037a4dd2a9f32df2e363e1342 I0727 09:13:36.464522 11570 daemon.go:767] In bootstrap mode E0727 09:13:36.464642 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:13:38.462718 11570 daemon.go:767] In bootstrap mode E0727 09:13:38.462840 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:13:54.489363 11570 daemon.go:767] In bootstrap mode E0727 09:13:54.489418 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:14:26.507564 11570 daemon.go:767] In bootstrap mode E0727 09:14:26.507597 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:15:26.524230 11570 daemon.go:767] In bootstrap mode E0727 09:15:26.524298 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:16:26.600918 11570 daemon.go:767] In bootstrap mode E0727 09:16:26.601044 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:17:26.622945 11570 daemon.go:767] In bootstrap mode E0727 09:17:26.623036 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:18:26.645998 11570 daemon.go:767] In bootstrap mode E0727 09:18:26.646095 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found E0727 09:18:45.526259 11570 writer.go:154] Error setting Degraded annotation for node openshift-master-0: unable to update node "&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://172.30.0.1:443/api/v1/nodes/openshift-master-0": http2: client connection lost W0727 09:18:45.526347 11570 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding W0727 09:18:45.526381 11570 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding I0727 09:19:45.536336 11570 daemon.go:767] In bootstrap mode E0727 09:19:45.536437 11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found I0727 09:20:45.555160 11570 daemon.go:767] In bootstrap mode (...) ~~~ ~~~ [root@openshift-jumpserver-0 ~]# oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 00-worker e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 01-master-container-runtime e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 01-master-kubelet e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 01-worker-container-runtime e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 01-worker-kubelet e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 99-installer-ignition-master 3.2.0 48m 99-installer-ignition-worker 3.2.0 48m 99-master-generated-registries e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 99-master-mtu 2.2.0 48m 99-master-ssh 3.2.0 48m 99-worker-generated-registries e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m 99-worker-mtu 2.2.0 48m 99-worker-ssh 3.2.0 48m rendered-master-7dddd0f5f1ed6745c5e68ef190b4e1c3 e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m rendered-worker-161afd8d86be52a2d2aebb20cf7d42ee e3863b02b7403342cdf0f981889e8c3cfc2d86bb 3.2.0 33m ~~~ On the other hand, when I look at a master node's journal, I can see the MachineConfig: ~~~ [root@openshift-master-0 ~]# journalctl | grep rendered-master-35c59c4037a4dd2a9f32df2e363e1342 Jul 27 09:09:20 openshift-master-0 machine-config-daemon[2431]: I0727 09:09:20.398278 2431 update.go:596] Checking Reconcilable for config mco-empty-mc to rendered-master-35c59c4037a4dd2a9f32df2e363e1342 Jul 27 09:09:20 openshift-master-0 machine-config-daemon[2431]: I0727 09:09:20.399853 2431 update.go:1904] Starting update from mco-empty-mc to rendered-master-35c59c4037a4dd2a9f32df2e363e1342: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false} Jul 27 09:09:20 openshift-master-0 root[2459]: machine-config-daemon[2431]: Starting update from mco-empty-mc to rendered-master-35c59c4037a4dd2a9f32df2e363e1342: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false} Jul 27 09:09:41 openshift-master-0 logger[2543]: rendered-master-35c59c4037a4dd2a9f32df2e363e1342 Jul 27 09:09:41 openshift-master-0 machine-config-daemon[2431]: I0727 09:09:41.597913 2431 update.go:1904] initiating reboot: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342 Jul 27 09:09:41 openshift-master-0 root[2545]: machine-config-daemon[2431]: initiating reboot: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342 Jul 27 09:09:41 openshift-master-0 systemd[1]: Started machine-config-daemon: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342. Jul 27 09:09:41 openshift-master-0 systemd[1]: Stopping machine-config-daemon: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342... Jul 27 09:09:41 openshift-master-0 systemd[1]: Stopped machine-config-daemon: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342. ~~~ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I think I found the culprit. I am pushing a root password via ignition: ~~~ password=$(ansible -m debug -a msg="{{ 'redhat' | password_hash('sha512') }}" localhost | awk '/msg/ {print $NF}' | sed 's/^"//' | sed 's/"$//') for type in master worker ; do cat /root/openshift-install/${type}.ign | jq '. += {"passwd" : { "users" : [ { "name": "core", "passwordHash": "'$(echo -n $password)'"}]}}' | tee /root/openshift-install/${type}.ign done ~~~ Down the road, this then leads to the issues with the MachineConfigOperator. I still believe that this is a bug because: a) changing the password with ignition for early deployment troubleshooting IMO makes sense b) if the MCO later on decides to override this configuration, it should do so gracefully; it should not fail c) the same issue bites us in the aforementioned bugzillas and will also bite us in other situations (whenever a conflict occurs between ignition and later MC??)