+++ This bug was initially created as a clone of Bug #1881057 +++ Trying to bring up a bare metal dual-stack cluster using dev-scripts, MCO is degraded: - lastTransitionTime: "2020-09-21T11:36:03Z" message: 'Unable to apply 4.6.0-0.ci.test-2020-09-21-103241-ci-ln-2mpl212: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node master-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\", Node master-2 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\", Node master-1 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\"", retrying' reason: RequiredPoolsFailed status: "True" type: Degraded The referenced MachineConfig does not actually exist. MCD logs show: I0921 11:20:29.043729 11596 node.go:45] Setting initial node config: rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5 I0921 11:20:29.061157 11596 daemon.go:781] In bootstrap mode E0921 11:20:29.061191 11596 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5" not found I0921 11:20:31.060836 11596 daemon.go:781] In bootstrap mode E0921 11:20:31.060885 11596 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5" not found ... Based on a slack discussion, one possible culprit is the fact that we add a FeatureGate object to the install manifests: kind: FeatureGate metadata: name: cluster spec: featureSet: IPv6DualStackNoUpgrade MCC saw this: I0921 11:26:04.574593 1 kubelet_config_features.go:152] Applied FeatureSet cluster on MachineConfigPool master But the MachineConfig currently in use on the masters does not reflect it: sh-5.0# more /etc/kubernetes/kubelet.conf ... featureGates: APIPriorityAndFairness: true LegacyNodeRoleBehavior: false NodeDisruptionExclusion: true RotateKubeletServerCertificate: true SCTPSupport: true ServiceNodeExclusion: true SupportPodPidsLimit: true --- Additional comment from Dan Winship on 2020-09-21 10:10:30 EDT --- So yeah, it seems like FeatureGates are only processed post-bootstrap. So at bootstrap time, the MC components generate their configs ignoring the FeatureGate. Then when the non-bootstrap components come up, they process everything _with_ the FeatureGate and generate a different MachineConfig than the nodes are expecting, and it can't recover. This seems... probably not _easily_ fixable? Actually, kubelet doesn't do anything useful with the `IPv6DualStack` feature gate in 1.19 anyway... maybe if I just patch MCO to ignore that gate when generating the kubelet config that will solve the problem for 4.6.
https://github.com/openshift/machine-config-operator/pull/2108 adds the workaround suggested above, making MCO ignore the IPv6DualStack feature gate when generating the kubelet config, which results in the bootstrap and "real" configurations ending up the same. But (a) it's an ugly hack that should be replaced with a real solution, and (b) it won't work for 4.7 anyway, because kubelet will actually have code that depends on the IPv6DualStack feature gate in 1.20 so we'll need the gate to be enabled correctly. To reproduce just add a FeatureGate object to the install manifest directory. eg: apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: name: cluster spec: featureSet: IPv6DualStackNoUpgrade well, except that that one won't trigger the bug now, but `featureSet: LatencySensitive` should work.
> Assignee: amurdaca → danw uh, this is generic MCO code, not anything networking-related. (And note that this bz is for 4.7; I implemented an IPv6DualStack-specific hack for 4.6 so that the underlying bug didn't need to be fixed right away, but the underlying bug is not related to dual-stack or anything networking-specific, it's just about how the boostrap MachineConfig is generated.)
So this would be on the Node team to implement day1 feature gate support - moving there.
This should affect 4.7 only right? Would make sense to change the Version field to 4.7? My understanding is the 4.6 clone is fixed.
Yeah, we can bump this up to be 4.7 only.
FeatureGates are all Day 2 operations. The API server, kubelet, and controller manager need to all be synced with the allowed featuregate. Applying a FeatureGate - day 0 - is not supported [1]. [1] https://docs.openshift.com/container-platform/4.6/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-cluster_nodes-cluster-enabling
If we want to support Day 0 or Day 1 FeatureGates, then there needs to be a higher level epic.
(In reply to Ryan Phillips from comment #7) > FeatureGates are all Day 2 operations. The API server, kubelet, and > controller manager need to all be synced with the allowed featuregate. > Applying a FeatureGate - day 0 - is not supported [1] (The linked doc does not actually say that applying feature gates on day 0 is not supported. It's true that it doesn't document how you'd do it, but is that really the same thing?) At the moment, we don't allow changing between single- and dual-stack after install time (most networking configuration is day-0-only), so if a cluster is going to be dual stack, that has to be configured at install time, which, at the moment, means a non-default feature gate must be set at install time. It seems plausible this scenario may happen again in the future, though, at the same time, (a) we will _eventually_ support upgrading from single- to dual-stack post-install, meaning people could set the FeatureGate on day 2; and (b) dual-stack will eventually not require a non-default feature gate anyway, so the problem will just go away. And for 4.7, we already have another workaround for this problem (https://github.com/openshift/machine-config-operator/pull/2277/commits/a0c44de3) So maybe this is WONTFIX?
Thanks Dan... If we want to explore featuregate enablement let's open an Epic. Since we have workarounds, I will close this bug as WONTFIX.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days