Bug 1986344 - OpenShift IPI, MCO shows as degraded right after installation, machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found
Summary: OpenShift IPI, MCO shows as degraded right after installation, machineconfig...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Yu Qi Zhang
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-27 10:17 UTC by Andreas Karis
Modified: 2021-11-08 17:39 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-08 17:39:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andreas Karis 2021-07-27 10:17:30 UTC
Description of problem:

OpenShift IPI never finishes, MCO shows as degraded right after installation with  machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found

I suppose this happens when the control plane switches over from the bootstrap node to the master nodes. I can only suspect that the machineconfiguration files are regenerated and not carried over from the bootstrap, and this leads to this kind of issue?

This is similar in nature to https://bugzilla.redhat.com/show_bug.cgi?id=1881213 , https://bugzilla.redhat.com/show_bug.cgi?id=1881057

I am still investigating what is causing this behavior in my environment. However, even if this is driven by invalid configuration, there should be an easy path to troubleshoot this. For administators, there is no way to figure our why this is happening, as the IPI bootstrap node along with its storage is deleted after the bootstrap operation. And there is no trace inside the new cluster of the old rendered configuration other than in the master journal, as far as I can tell.

The installation shows a degraded MachineConfigOperator:
~~~
[root@openshift-jumpserver-0 ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master                                                      False     True       True       3              0                   0                     3                      43m
worker   rendered-worker-161afd8d86be52a2d2aebb20cf7d42ee   True      False      False      2              2                   2                     0                      43m
[root@openshift-jumpserver-0 ~]# oc describe mcp master
Name:         master
Namespace:    
Labels:       machineconfiguration.openshift.io/mco-built-in=
              operator.machineconfiguration.openshift.io/required-for-upgrade=
              pools.operator.machineconfiguration.openshift.io/master=
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2021-07-27T09:13:01Z
  Generation:          2
  Managed Fields:
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:machineconfiguration.openshift.io/mco-built-in:
          f:operator.machineconfiguration.openshift.io/required-for-upgrade:
          f:pools.operator.machineconfiguration.openshift.io/master:
      f:spec:
        .:
        f:configuration:
        f:machineConfigSelector:
          .:
          f:matchLabels:
            .:
            f:machineconfiguration.openshift.io/role:
        f:nodeSelector:
          .:
          f:matchLabels:
            .:
            f:node-role.kubernetes.io/master:
        f:paused:
    Manager:      machine-config-operator
    Operation:    Update
    Time:         2021-07-27T09:13:01Z
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name:
          f:source:
      f:status:
        .:
        f:conditions:
        f:configuration:
        f:degradedMachineCount:
        f:machineCount:
        f:observedGeneration:
        f:readyMachineCount:
        f:unavailableMachineCount:
        f:updatedMachineCount:
    Manager:         machine-config-controller
    Operation:       Update
    Time:            2021-07-27T09:14:01Z
  Resource Version:  9243
  Self Link:         /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  UID:               2da9b58d-785d-4ea2-a3bd-cd9b10222fa6
Spec:
  Configuration:
    Name:  rendered-master-7dddd0f5f1ed6745c5e68ef190b4e1c3
    Source:
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         00-master
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-master-container-runtime
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-master-kubelet
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-installer-ignition-master
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-master-generated-registries
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-master-mtu
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-master-ssh
  Machine Config Selector:
    Match Labels:
      machineconfiguration.openshift.io/role:  master
  Node Selector:
    Match Labels:
      node-role.kubernetes.io/master:  
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2021-07-27T09:13:56Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2021-07-27T09:14:01Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2021-07-27T09:14:01Z
    Message:               All nodes are updating to rendered-master-7dddd0f5f1ed6745c5e68ef190b4e1c3
    Reason:                
    Status:                True
    Type:                  Updating
    Last Transition Time:  2021-07-27T09:14:01Z
    Message:               Node openshift-master-2 is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found", Node openshift-master-0 is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found", Node openshift-master-1 is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-35c59c4037a4dd2a9f32df2e363e1342\" not found"
    Reason:                3 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2021-07-27T09:14:01Z
    Message:               
    Reason:                
    Status:                True
    Type:                  Degraded
  Configuration:
  Degraded Machine Count:     3
  Machine Count:              3
  Observed Generation:        2
  Ready Machine Count:        0
  Unavailable Machine Count:  3
  Updated Machine Count:      0
Events:                       <none>
~~~

Looking at one of the master daemon logs:
~~~
[root@openshift-jumpserver-0 ~]# oc get pods -o wide
NAME                                         READY   STATUS    RESTARTS   AGE     IP                NODE                 NOMINATED NODE   READINESS GATES
machine-config-controller-7d9bcdf859-27cmd   1/1     Running   0          36m     172.25.0.24       openshift-master-2   <none>           <none>
machine-config-daemon-fbp6j                  2/2     Running   0          36m     192.168.123.202   openshift-master-2   <none>           <none>
machine-config-daemon-h8xjt                  2/2     Running   0          7m      192.168.123.220   openshift-worker-0   <none>           <none>
machine-config-daemon-t967q                  2/2     Running   0          3m57s   192.168.123.221   openshift-worker-1   <none>           <none>
machine-config-daemon-vdgmz                  2/2     Running   0          36m     192.168.123.200   openshift-master-0   <none>           <none>
machine-config-daemon-xwq8t                  2/2     Running   0          36m     192.168.123.201   openshift-master-1   <none>           <none>
machine-config-operator-699d8cf454-vvh47     1/1     Running   0          51m     172.24.0.12       openshift-master-1   <none>           <none>
machine-config-server-j9k89                  1/1     Running   0          36m     192.168.123.201   openshift-master-1   <none>           <none>
machine-config-server-rxdbv                  1/1     Running   0          36m     192.168.123.200   openshift-master-0   <none>           <none>
machine-config-server-vgmr9                  1/1     Running   0          36m     192.168.123.202   openshift-master-2   <none>           <none>
[root@openshift-jumpserver-0 ~]# oc logs -f machine-config-daemon-vdgmz
error: a container name must be specified for pod machine-config-daemon-vdgmz, choose one of: [machine-config-daemon oauth-proxy]
[root@openshift-jumpserver-0 ~]# oc logs -f machine-config-daemon-vdgmz -c machine-config-daemon
I0727 09:13:34.260662   11570 start.go:108] Version: v4.7.0-202105111858.p0-dirty (e3863b02b7403342cdf0f981889e8c3cfc2d86bb)
I0727 09:13:34.265089   11570 start.go:121] Calling chroot("/rootfs")
I0727 09:13:34.265302   11570 rpm-ostree.go:261] Running captured: rpm-ostree status --json
I0727 09:13:34.632123   11570 daemon.go:218] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d (47.83.202105111846-0)
I0727 09:13:34.735061   11570 start.go:97] Copied self to /run/bin/machine-config-daemon on host
I0727 09:13:34.736775   11570 metrics.go:105] Registering Prometheus metrics
I0727 09:13:34.736917   11570 metrics.go:110] Starting metrics listener on 127.0.0.1:8797
I0727 09:13:34.738590   11570 update.go:1904] Starting to manage node: openshift-master-0
I0727 09:13:34.748798   11570 rpm-ostree.go:261] Running captured: rpm-ostree status
I0727 09:13:34.803891   11570 daemon.go:849] State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d
              CustomOrigin: Managed by machine-config-operator
                   Version: 47.83.202105111846-0 (2021-05-11T18:49:55Z)

  ostree://3fdd1488024f054e39b1be508781d535d1ac7ed423bb3b4b656c2f345934220d
                   Version: 47.83.202103251640-0 (2021-03-25T16:44:03Z)
I0727 09:13:34.803995   11570 rpm-ostree.go:261] Running captured: journalctl --list-boots
I0727 09:13:34.810631   11570 daemon.go:856] journalctl --list-boots:
-1 2244dd692ce345f4be0bfc3522a8f0cf Tue 2021-07-27 09:08:18 UTC—Tue 2021-07-27 09:09:44 UTC
 0 76d333fa1287421eb069c3ea050e271e Tue 2021-07-27 09:09:53 UTC—Tue 2021-07-27 09:13:34 UTC
I0727 09:13:34.810728   11570 rpm-ostree.go:261] Running captured: systemctl list-units --state=failed --no-legend
I0727 09:13:34.819273   11570 daemon.go:871] systemd service state: OK
I0727 09:13:34.819342   11570 daemon.go:603] Starting MachineConfigDaemon
I0727 09:13:34.819478   11570 daemon.go:610] Enabling Kubelet Healthz Monitor
I0727 09:13:35.765623   11570 daemon.go:381] Node openshift-master-0 is part of the control plane
I0727 09:13:36.416662   11570 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node openshift-master-0: map[k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac:0a:58:ac:1a:00:03 k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_openshift-master-0","mac-address":"52:54:00:68:7d:70","ip-addresses":["192.168.123.200/24"],"ip-address":"192.168.123.200/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:6dd89371-92d8-4bfb-a1e8-9c24b4e4650b k8s.ovn.org/node-local-nat-ip:{"default":["169.254.15.58"]} k8s.ovn.org/node-mgmt-port-mac-address:e2:6d:58:dc:9e:b3 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"192.168.123.200/24","ipv6":"fc00::5929:d49a:8c16:fef/64"} k8s.ovn.org/node-subnets:{"default":"172.26.0.0/23"} volumes.kubernetes.io/controller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json
I0727 09:13:36.417226   11570 node.go:45] Setting initial node config: rendered-master-35c59c4037a4dd2a9f32df2e363e1342
I0727 09:13:36.464522   11570 daemon.go:767] In bootstrap mode
E0727 09:13:36.464642   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:13:38.462718   11570 daemon.go:767] In bootstrap mode
E0727 09:13:38.462840   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:13:54.489363   11570 daemon.go:767] In bootstrap mode
E0727 09:13:54.489418   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:14:26.507564   11570 daemon.go:767] In bootstrap mode
E0727 09:14:26.507597   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:15:26.524230   11570 daemon.go:767] In bootstrap mode
E0727 09:15:26.524298   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:16:26.600918   11570 daemon.go:767] In bootstrap mode
E0727 09:16:26.601044   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:17:26.622945   11570 daemon.go:767] In bootstrap mode
E0727 09:17:26.623036   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:18:26.645998   11570 daemon.go:767] In bootstrap mode
E0727 09:18:26.646095   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
E0727 09:18:45.526259   11570 writer.go:154] Error setting Degraded annotation for node openshift-master-0: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://172.30.0.1:443/api/v1/nodes/openshift-master-0": http2: client connection lost
W0727 09:18:45.526347   11570 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0727 09:18:45.526381   11570 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
I0727 09:19:45.536336   11570 daemon.go:767] In bootstrap mode
E0727 09:19:45.536437   11570 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-35c59c4037a4dd2a9f32df2e363e1342" not found
I0727 09:20:45.555160   11570 daemon.go:767] In bootstrap mode
(...)
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
00-worker                                          e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
01-master-container-runtime                        e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
01-master-kubelet                                  e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
01-worker-container-runtime                        e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
01-worker-kubelet                                  e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
99-installer-ignition-master                                                                  3.2.0             48m
99-installer-ignition-worker                                                                  3.2.0             48m
99-master-generated-registries                     e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
99-master-mtu                                                                                 2.2.0             48m
99-master-ssh                                                                                 3.2.0             48m
99-worker-generated-registries                     e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
99-worker-mtu                                                                                 2.2.0             48m
99-worker-ssh                                                                                 3.2.0             48m
rendered-master-7dddd0f5f1ed6745c5e68ef190b4e1c3   e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
rendered-worker-161afd8d86be52a2d2aebb20cf7d42ee   e3863b02b7403342cdf0f981889e8c3cfc2d86bb   3.2.0             33m
~~~

On the other hand, when I look at a master node's journal, I can see the MachineConfig:
~~~
[root@openshift-master-0 ~]# journalctl | grep rendered-master-35c59c4037a4dd2a9f32df2e363e1342
Jul 27 09:09:20 openshift-master-0 machine-config-daemon[2431]: I0727 09:09:20.398278    2431 update.go:596] Checking Reconcilable for config mco-empty-mc to rendered-master-35c59c4037a4dd2a9f32df2e363e1342
Jul 27 09:09:20 openshift-master-0 machine-config-daemon[2431]: I0727 09:09:20.399853    2431 update.go:1904] Starting update from mco-empty-mc to rendered-master-35c59c4037a4dd2a9f32df2e363e1342: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Jul 27 09:09:20 openshift-master-0 root[2459]: machine-config-daemon[2431]: Starting update from mco-empty-mc to rendered-master-35c59c4037a4dd2a9f32df2e363e1342: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Jul 27 09:09:41 openshift-master-0 logger[2543]: rendered-master-35c59c4037a4dd2a9f32df2e363e1342
Jul 27 09:09:41 openshift-master-0 machine-config-daemon[2431]: I0727 09:09:41.597913    2431 update.go:1904] initiating reboot: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342
Jul 27 09:09:41 openshift-master-0 root[2545]: machine-config-daemon[2431]: initiating reboot: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342
Jul 27 09:09:41 openshift-master-0 systemd[1]: Started machine-config-daemon: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342.
Jul 27 09:09:41 openshift-master-0 systemd[1]: Stopping machine-config-daemon: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342...
Jul 27 09:09:41 openshift-master-0 systemd[1]: Stopped machine-config-daemon: Completing firstboot provisioning to rendered-master-35c59c4037a4dd2a9f32df2e363e1342.
~~~



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Andreas Karis 2021-07-27 11:23:30 UTC
I think I found the culprit. I am pushing a root password via ignition:
~~~
password=$(ansible -m debug -a msg="{{ 'redhat'  | password_hash('sha512') }}" localhost | awk '/msg/ {print $NF}' | sed 's/^"//' | sed 's/"$//')
for type in master worker ; do
        cat /root/openshift-install/${type}.ign  | jq '. += {"passwd" : { "users" : [ { "name": "core", "passwordHash": "'$(echo -n $password)'"}]}}' | tee /root/openshift-install/${type}.ign
done
~~~

Down the road, this then leads to the issues with the MachineConfigOperator. 

I still believe that this is a bug because:

a) changing the password with ignition for early deployment troubleshooting IMO makes sense
b) if the MCO later on decides to override this configuration, it should do so gracefully; it should not fail
c) the same issue bites us in the aforementioned bugzillas and will also bite us in other situations (whenever a conflict occurs between ignition and later MC??)


Note You need to log in before you can comment on or make changes to this bug.