Bug 1881057 - dual-stack cluster comes up with MCO degraded (possibly due to install-time FeatureGate?)
Summary: dual-stack cluster comes up with MCO degraded (possibly due to install-time F...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Dan Winship
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-21 13:23 UTC by Dan Winship
Modified: 2021-07-02 08:21 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1881213 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:43:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2108 0 None closed Bug 1881057: ignore the IPv6DualStack feature gate for the kubelet config 2020-12-04 17:55:52 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:43:54 UTC

Description Dan Winship 2020-09-21 13:23:03 UTC
Trying to bring up a bare metal dual-stack cluster using dev-scripts, MCO is degraded:

  - lastTransitionTime: "2020-09-21T11:36:03Z"
    message: 'Unable to apply 4.6.0-0.ci.test-2020-09-21-103241-ci-ln-2mpl212: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node master-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\", Node master-2 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\", Node master-1 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\"", retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded

The referenced MachineConfig does not actually exist. MCD logs show:

  I0921 11:20:29.043729   11596 node.go:45] Setting initial node config: rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5
  I0921 11:20:29.061157   11596 daemon.go:781] In bootstrap mode
  E0921 11:20:29.061191   11596 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5" not found
  I0921 11:20:31.060836   11596 daemon.go:781] In bootstrap mode
  E0921 11:20:31.060885   11596 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5" not found
  ...

Based on a slack discussion, one possible culprit is the fact that we add a FeatureGate object to the install manifests:

  kind: FeatureGate
  metadata:
    name: cluster
  spec:
    featureSet: IPv6DualStackNoUpgrade

MCC saw this:

  I0921 11:26:04.574593       1 kubelet_config_features.go:152] Applied FeatureSet cluster on MachineConfigPool master

But the MachineConfig currently in use on the masters does not reflect it:

  sh-5.0# more /etc/kubernetes/kubelet.conf 
  ...
  featureGates:
    APIPriorityAndFairness: true
    LegacyNodeRoleBehavior: false
    NodeDisruptionExclusion: true
    RotateKubeletServerCertificate: true
    SCTPSupport: true
    ServiceNodeExclusion: true
    SupportPodPidsLimit: true

Comment 2 Dan Winship 2020-09-21 14:10:30 UTC
So yeah, it seems like FeatureGates are only processed post-bootstrap. So at bootstrap time, the MC components generate their configs ignoring the FeatureGate. Then when the non-bootstrap components come up, they process everything _with_ the FeatureGate and generate a different MachineConfig than the nodes are expecting, and it can't recover.

This seems... probably not _easily_ fixable?

Actually, kubelet doesn't do anything useful with the `IPv6DualStack` feature gate in 1.19 anyway... maybe if I just patch MCO to ignore that gate when generating the kubelet config that will solve the problem for 4.6.

Comment 4 Micah Abbott 2020-09-22 21:18:48 UTC
@Johnny is this something that your team could verify?

Comment 5 Johnny Liu 2020-09-23 03:24:12 UTC
Presently IPv6 is only supported on IPI Baremetal. So maybe edge QE team can help on that.

Comment 6 Micah Abbott 2020-09-23 13:30:51 UTC
@Ariel could someone from your team verify this BZ?

Comment 7 Micah Abbott 2020-10-01 14:15:49 UTC
@Luksa @Lucie  I'm looking for assistance getting this BZ verified.  It is possible that someone from your team would be able to check this?

Comment 9 Micah Abbott 2020-10-01 17:55:07 UTC
I got access to a dual-stack bare metal environment that was installed using 4.6.0-fc.8.

The cluster looks healthy as far as I can tell, so going to mark this one verified as SanityOnly.


```
kni@r640-u09 ~]$ oc get clusterversion                                                                                                                        
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-fc.8   True        False         25m     Cluster version is 4.6.0-fc.8
[kni@r640-u09 ~]$ oc get nodes                                                                                                                                 
NAME                                                STATUS   ROLES    AGE   VERSION
openshift-master-0.qe3.kni.lab.eng.bos.redhat.com   Ready    master   64m   v1.19.0+8a39924
openshift-master-1.qe3.kni.lab.eng.bos.redhat.com   Ready    master   62m   v1.19.0+8a39924
openshift-master-2.qe3.kni.lab.eng.bos.redhat.com   Ready    master   63m   v1.19.0+8a39924
openshift-worker-0.qe3.kni.lab.eng.bos.redhat.com   Ready    worker   35m   v1.19.0+8a39924            
openshift-worker-1.qe3.kni.lab.eng.bos.redhat.com   Ready    worker   35m   v1.19.0+8a39924
[kni@r640-u09 ~]$ oc describe node/openshift-master-0.qe3.kni.lab.eng.bos.redhat.com                                                                                                                                                                                                                                          
Name:               openshift-master-0.qe3.kni.lab.eng.bos.redhat.com                                                                                                                                                                                                                                                         
Roles:              master                                                                                                                                                                                                                                                                                                    
Labels:             beta.kubernetes.io/arch=amd64                                                                                                                                                                                                                                                                             
                    beta.kubernetes.io/os=linux                                                                                                                                                                                                                                                                               
                    kubernetes.io/arch=amd64                                                                                                                                                                                                                                                                                  
                    kubernetes.io/hostname=openshift-master-0.qe3.kni.lab.eng.bos.redhat.com                                                                                                                                                                                                                                  
                    kubernetes.io/os=linux                                                                                                                                                                                                                                                                                    
                    node-role.kubernetes.io/master=                                                                                                                                                                                                                                                                           
                    node.openshift.io/os_id=rhcos                                                                                                                                                                                                                                                                             
Annotations:        k8s.ovn.org/l3-gateway-config:                                                                                                                                                                                                                                                                            
                      {"default":{"mode":"shared","interface-id":"br-ex_openshift-master-0.qe3.kni.lab.eng.bos.redhat.com","mac-address":"98:03:9b:61:71:79","ip...                      
                    k8s.ovn.org/node-chassis-id: b049fb50-575c-402a-829b-cb66e0253ba5                                                                                                                                                                                                                                         
                    k8s.ovn.org/node-join-subnets: {"default":"fd98:0:0:1::/64"}                                                                                                                                                                                                                                              
                    k8s.ovn.org/node-local-nat-ip: {"default":["fd99::a184"]}                                                                                                                                                                                                                                                 
                    k8s.ovn.org/node-mgmt-port-mac-address: 72:b4:85:98:19:32                                                                                                                                                                                                                                                 
                    k8s.ovn.org/node-primary-ifaddr: {"ipv6":"2620:52:0:1386::91/128"}                                                                                                                                                                                                                                        
                    k8s.ovn.org/node-subnets: {"default":"fd01:0:0:2::/64"}                                                                                                                                                                                                                                                   
                    machine.openshift.io/machine: openshift-machine-api/qe3-rxhxg-master-0                                                                                                                                                                                                                                    
                    machineconfiguration.openshift.io/currentConfig: rendered-master-9d089e6353d1eb00acc37c491547b42c                                                                                                                                                                                                         
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-9d089e6353d1eb00acc37c491547b42c                                                                                                                                                                                                         
                    machineconfiguration.openshift.io/reason:                                                                                                                                                                                                                                                                 
                    machineconfiguration.openshift.io/state: Done                                                                                                                                                                                                                                                             
                    volumes.kubernetes.io/controller-managed-attach-detach: true                                                                                                                                                                                                                                              
CreationTimestamp:  Thu, 01 Oct 2020 12:48:10 -0400                                                                                                                                                                                                                                                                           
Taints:             node-role.kubernetes.io/master:NoSchedule                                                                                                                                                                                                                                                                 
Unschedulable:      false                                                                                                                                                                                                                                                                                                     
Lease:                                                                                                                                                                                                                                                                                                                        
  HolderIdentity:  openshift-master-0.qe3.kni.lab.eng.bos.redhat.com                                                                                                                                                                                                                                                          
  AcquireTime:     <unset>                                                                                                                                                                                                                                                                                                    
  RenewTime:       Thu, 01 Oct 2020 13:52:46 -0400                                                                                                                                                                                                                                                                            
Conditions:                                                                                                                                                                                                                                                                                                                   
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message                                                                                                                                                                                           
  ----             ------  -----------------                 ------------------                ------                       -------                                                                                                                                                                                           
  MemoryPressure   False   Thu, 01 Oct 2020 13:48:08 -0400   Thu, 01 Oct 2020 12:48:11 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available                      
  DiskPressure     False   Thu, 01 Oct 2020 13:48:08 -0400   Thu, 01 Oct 2020 12:48:11 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure                                                                                                                                                                      
  PIDPressure      False   Thu, 01 Oct 2020 13:48:08 -0400   Thu, 01 Oct 2020 12:48:11 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available                         
  Ready            True    Thu, 01 Oct 2020 13:48:08 -0400   Thu, 01 Oct 2020 12:51:22 -0400   KubeletReady                 kubelet is posting ready status                                                                                                                                                                   
Addresses:                                                                                                                                                                                                                                                                                                                    
  InternalIP:  2620:52:0:1386::91                                                                                                                                                                                                                                                                                             
  Hostname:    openshift-master-0.qe3.kni.lab.eng.bos.redhat.com                                                                                                                                                                                                                                                              
...
                                                                                                                                            

[kni@r640-u09 ~]$ oc -n openshift-machine-config-operator logs po/machine-config-daemon-lmh58 machine-config-daemon
I1001 16:52:00.818712   16424 start.go:108] Version: v4.6.0-202009240159.p0-dirty (a3c9532c8e8f2efe9b0f739fbd761b32cc0bfa2b)
I1001 16:52:00.821428   16424 start.go:121] Calling chroot("/rootfs")
I1001 16:52:00.821486   16424 rpm-ostree.go:261] Running captured: rpm-ostree status --json
I1001 16:52:00.964096   16424 daemon.go:226] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b973b2f9e432b12388874a9c8d191e699106bcbf12d962c729b4e16307dbd83f (46.82.202009222340-0)
I1001 16:52:00.997093   16424 daemon.go:233] Installed Ignition binary version: 2.6.0
I1001 16:52:01.017542   16424 start.go:97] Copied self to /run/bin/machine-config-daemon on host
I1001 16:52:01.019779   16424 metrics.go:106] Registering Prometheus metrics
I1001 16:52:01.019883   16424 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I1001 16:52:01.021413   16424 update.go:1565] Starting to manage node: openshift-master-0.qe3.kni.lab.eng.bos.redhat.com
I1001 16:52:01.024892   16424 rpm-ostree.go:261] Running captured: rpm-ostree status
I1001 16:52:01.056146   16424 daemon.go:863] State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b973b2f9e432b12388874a9c8d191e699106bcbf12d962c729b4e16307dbd83f
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202009222340-0 (2020-09-22T23:44:32Z)

  ostree://5d65bddfb072101a84501cd87b8abc650beb8dc0aa2bfeff022fc750cde52f1d
                   Version: 46.82.202009222340-0 (2020-09-22T23:44:32Z)
I1001 16:52:01.056167   16424 rpm-ostree.go:261] Running captured: journalctl --list-boots
I1001 16:52:01.061298   16424 daemon.go:870] journalctl --list-boots:
-1 2e4395008e0544118d9ade9c7a9c73e3 Thu 2020-10-01 16:44:33 UTC—Thu 2020-10-01 16:46:06 UTC
 0 bd1cb9b851854871922c5b3e4a35a6fe Thu 2020-10-01 16:47:29 UTC—Thu 2020-10-01 16:52:01 UTC
I1001 16:52:01.061322   16424 rpm-ostree.go:261] Running captured: systemctl list-units --state=failed --no-legend
I1001 16:52:01.067939   16424 daemon.go:885] systemd service state: OK
I1001 16:52:01.067953   16424 daemon.go:617] Starting MachineConfigDaemon
I1001 16:52:01.068077   16424 daemon.go:624] Enabling Kubelet Healthz Monitor
I1001 16:52:02.031362   16424 daemon.go:397] Node openshift-master-0.qe3.kni.lab.eng.bos.redhat.com is part of the control plane
I1001 16:52:02.062519   16424 controlplane.go:52] Set root blockdev /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0/host1/target1:2:0/1:2:0:0/block/sdb to use scheduler bfq
I1001 16:52:03.610075   16424 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node openshift-master-0.qe3.kni.lab.eng.bos.redhat.com: map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_openshift-master-0.qe3.kni.lab.eng.bos.redhat.com","mac-address":"98:03:9b:61:71:79","ip-addresses":["2620:52:0:1386::91/128"],"ip-address":"2620:52:0:1386::91/128","next-hops":["fe80::22d8:b00:c909:1e41"],"next-hop":"fe80::22d8:b00:c909:1e41","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:b049fb50-575c-402a-829b-cb66e0253ba5 k8s.ovn.org/node-join-subnets:{"default":"fd98:0:0:1::/64"} k8s.ovn.org/node-local-nat-ip:{"default":["fd99::a184"]} k8s.ovn.org/node-mgmt-port-mac-address:72:b4:85:98:19:32 k8s.ovn.org/node-primary-ifaddr:{"ipv6":"2620:52:0:1386::91/128"} k8s.ovn.org/node-subnets:{"default":"fd01:0:0:2::/64"} volumes.kubernetes.io/controller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json
I1001 16:52:03.610415   16424 node.go:45] Setting initial node config: rendered-master-9d089e6353d1eb00acc37c491547b42c
I1001 16:52:03.623167   16424 daemon.go:781] In bootstrap mode
E1001 16:52:03.623199   16424 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-9d089e6353d1eb00acc37c491547b42c" not found
I1001 16:52:05.623901   16424 daemon.go:781] In bootstrap mode
E1001 16:52:05.623925   16424 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-9d089e6353d1eb00acc37c491547b42c" not found
I1001 16:52:21.633843   16424 daemon.go:781] In bootstrap mode
I1001 16:52:21.633862   16424 daemon.go:809] Current+desired config: rendered-master-9d089e6353d1eb00acc37c491547b42c
I1001 16:52:21.637473   16424 daemon.go:1045] No bootstrap pivot required; unlinking bootstrap node annotations
I1001 16:52:21.637501   16424 daemon.go:1083] Validating against pending config rendered-master-9d089e6353d1eb00acc37c491547b42c
I1001 16:52:21.648520   16424 daemon.go:1094] Validated on-disk state
I1001 16:52:21.669143   16424 daemon.go:1135] Completing pending config rendered-master-9d089e6353d1eb00acc37c491547b42c
I1001 16:52:21.669161   16424 update.go:1565] completed update for config rendered-master-9d089e6353d1eb00acc37c491547b42c
I1001 16:52:21.677168   16424 daemon.go:1151] In desired config rendered-master-9d089e6353d1eb00acc37c491547b42c
```

Comment 12 errata-xmlrpc 2020-10-27 16:43:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.