Bug 1881213 - Make MCO handle install-time FeatureGate, revert workaround for IPv6DualStack
Summary: Make MCO handle install-time FeatureGate, revert workaround for IPv6DualStack
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Ryan Phillips
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-21 20:02 UTC by Dan Winship
Modified: 2023-09-15 00:48 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1881057
Environment:
Last Closed: 2021-01-11 14:24:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dan Winship 2020-09-21 20:02:19 UTC
+++ This bug was initially created as a clone of Bug #1881057 +++

Trying to bring up a bare metal dual-stack cluster using dev-scripts, MCO is degraded:

  - lastTransitionTime: "2020-09-21T11:36:03Z"
    message: 'Unable to apply 4.6.0-0.ci.test-2020-09-21-103241-ci-ln-2mpl212: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node master-0 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\", Node master-2 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\", Node master-1 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5\\\" not found\"", retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded

The referenced MachineConfig does not actually exist. MCD logs show:

  I0921 11:20:29.043729   11596 node.go:45] Setting initial node config: rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5
  I0921 11:20:29.061157   11596 daemon.go:781] In bootstrap mode
  E0921 11:20:29.061191   11596 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5" not found
  I0921 11:20:31.060836   11596 daemon.go:781] In bootstrap mode
  E0921 11:20:31.060885   11596 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-4ffdaac60fdcb29578a2a0029f7dc5b5" not found
  ...

Based on a slack discussion, one possible culprit is the fact that we add a FeatureGate object to the install manifests:

  kind: FeatureGate
  metadata:
    name: cluster
  spec:
    featureSet: IPv6DualStackNoUpgrade

MCC saw this:

  I0921 11:26:04.574593       1 kubelet_config_features.go:152] Applied FeatureSet cluster on MachineConfigPool master

But the MachineConfig currently in use on the masters does not reflect it:

  sh-5.0# more /etc/kubernetes/kubelet.conf 
  ...
  featureGates:
    APIPriorityAndFairness: true
    LegacyNodeRoleBehavior: false
    NodeDisruptionExclusion: true
    RotateKubeletServerCertificate: true
    SCTPSupport: true
    ServiceNodeExclusion: true
    SupportPodPidsLimit: true

--- Additional comment from Dan Winship on 2020-09-21 10:10:30 EDT ---

So yeah, it seems like FeatureGates are only processed post-bootstrap. So at bootstrap time, the MC components generate their configs ignoring the FeatureGate. Then when the non-bootstrap components come up, they process everything _with_ the FeatureGate and generate a different MachineConfig than the nodes are expecting, and it can't recover.

This seems... probably not _easily_ fixable?

Actually, kubelet doesn't do anything useful with the `IPv6DualStack` feature gate in 1.19 anyway... maybe if I just patch MCO to ignore that gate when generating the kubelet config that will solve the problem for 4.6.

Comment 1 Dan Winship 2020-09-21 20:12:45 UTC
https://github.com/openshift/machine-config-operator/pull/2108 adds the workaround suggested above, making MCO ignore the IPv6DualStack feature gate when generating the kubelet config, which results in the bootstrap and "real" configurations ending up the same.

But (a) it's an ugly hack that should be replaced with a real solution, and (b) it won't work for 4.7 anyway, because kubelet will actually have code that depends on the IPv6DualStack feature gate in 1.20 so we'll need the gate to be enabled correctly.

To reproduce just add a FeatureGate object to the install manifest directory. eg:

  apiVersion: config.openshift.io/v1
  kind: FeatureGate
  metadata:
    name: cluster
  spec:
    featureSet: IPv6DualStackNoUpgrade

well, except that that one won't trigger the bug now, but `featureSet: LatencySensitive` should work.

Comment 2 Dan Winship 2020-09-23 15:53:37 UTC
> Assignee: amurdaca → danw

uh, this is generic MCO code, not anything networking-related. (And note that this bz is for 4.7; I implemented an IPv6DualStack-specific hack for 4.6 so that the underlying bug didn't need to be fixed right away, but the underlying bug is not related to dual-stack or anything networking-specific, it's just about how the boostrap MachineConfig is generated.)

Comment 3 Antonio Murdaca 2020-10-05 14:17:05 UTC
So this would be on the Node team to implement day1 feature gate support - moving there.

Comment 5 Federico Paolinelli 2020-11-06 16:45:58 UTC
This should affect 4.7 only right? Would make sense to change the Version field to 4.7?
My understanding is the 4.6 clone is fixed.

Comment 6 Urvashi Mohnani 2020-11-06 17:23:32 UTC
Yeah, we can bump this up to be 4.7 only.

Comment 7 Ryan Phillips 2020-12-10 17:35:02 UTC
FeatureGates are all Day 2 operations. The API server, kubelet, and controller manager need to all be synced with the allowed featuregate. Applying a FeatureGate - day 0 - is not supported [1].

[1] https://docs.openshift.com/container-platform/4.6/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-cluster_nodes-cluster-enabling

Comment 8 Ryan Phillips 2020-12-10 17:38:07 UTC
If we want to support Day 0 or Day 1 FeatureGates, then there needs to be a higher level epic.

Comment 9 Dan Winship 2020-12-10 19:15:30 UTC
(In reply to Ryan Phillips from comment #7)
> FeatureGates are all Day 2 operations. The API server, kubelet, and
> controller manager need to all be synced with the allowed featuregate.
> Applying a FeatureGate - day 0 - is not supported [1]

(The linked doc does not actually say that applying feature gates on day 0 is not supported. It's true that it doesn't document how you'd do it, but is that really the same thing?)

At the moment, we don't allow changing between single- and dual-stack after install time (most networking configuration is day-0-only), so if a cluster is going to be dual stack, that has to be configured at install time, which, at the moment, means a non-default feature gate must be set at install time.

It seems plausible this scenario may happen again in the future, though, at the same time, (a) we will _eventually_ support upgrading from single- to dual-stack post-install, meaning people could set the FeatureGate on day 2; and (b) dual-stack will eventually not require a non-default feature gate anyway, so the problem will just go away.

And for 4.7, we already have another workaround for this problem (https://github.com/openshift/machine-config-operator/pull/2277/commits/a0c44de3)

So maybe this is WONTFIX?

Comment 11 Ryan Phillips 2021-01-11 14:24:48 UTC
Thanks Dan... If we want to explore featuregate enablement let's open an Epic. Since we have workarounds, I will close this bug as WONTFIX.

Comment 12 Red Hat Bugzilla 2023-09-15 00:48:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.