Description of problem: dual stack with an ipv6 network fails on bootstrap phase Version-Release number of selected component (if applicable): 4.7+ How reproducible: always Steps to Reproduce: 1. prepare an install config, for instance on upi, with dual stacks entries 2. deploy 3. bootstrap kubelet fails to start Actual results: installation doesnt proceed, fails with Dec 14 16:10:18 qct8-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster. Dec 14 16:10:37 qct8-bootstrap bootkube.sh[2590]: Moving OpenShift manifests in with the rest of them Dec 14 16:10:37 qct8-bootstrap bootkube.sh[2590]: Rendering Cluster Version Operator Manifests... Dec 14 16:10:38 qct8-bootstrap bootkube.sh[2590]: Rendering CEO Manifests... Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: F1214 16:10:40.104291 1 render.go:66] machineNetwork is not found in install-config Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: goroutine 1 [running]: Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.stacks(0xc000134001, 0xc0004d8160, 0x5a, 0xad) Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.0/klog.go:996 +0xb9 Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.(*loggingT).output(0x37cc440, 0xc000000003, 0x0, 0x0, 0xc0008151f0, 0x36ebfb9, 0x9, 0x42, 0xc0> Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.0/klog.go:945 +0x191 Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.(*loggingT).printDepth(0x37cc440, 0xc000000003, 0x0, 0x0, 0x1, 0xc000529cd8, 0x1, 0x1) Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.0/klog.go:718 +0x165 Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.(*loggingT).print(...) Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.0/klog.go:703 Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.Fatal(...) Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.0/klog.go:1443 Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: github.com/openshift/cluster-etcd-operator/pkg/cmd/render.NewRenderCommand.func1.1(0xc000920670) Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: github.com/openshift/cluster-etcd-operator/pkg/cmd/render/render.go:66 +0xf6 Expected results: installation should proceed Additional info: dan indicated that this is a bug in cluster-etcd-operator; it wants to find the IPv4 machine network if the cluster is dual-stack, but it (accidentally) only looks at the first element of machineNetwork the install config used for this test: apiVersion: v1 baseDomain: karmalabs.com compute: - name: worker replicas: 0 controlPlane: name: master replicas: 3 metadata: name: qct8 networking: networkType: OVNKubernetes machineNetwork: - cidr: 2620:52:0:1302::/64 - cidr: 10.0.0.0/16 clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 - cidr: 10.132.0.0/14 hostPrefix: 23 serviceNetwork: - fd02::/112 - 172.30.0.0/16 platform: none: {} additionalTrustBundle: | -----BEGIN CERTIFICATE----- MIIF6zCCA9OgAwIBAgIUbe1d+p+o6nzXKTf3YFMPALPKkc8wDQYJKoZIhvcNAQEL BQAwdjELMAkGA1UEBhMCVVMxDzANBgNVBAgMBk1hZHJpZDEVMBMGA1UEBwwMU2Fu W4AwfWg95+4WNXwGNDE1bui1RZtEzCbjw6Fxiu9ASw== -----END CERTIFICATE----- pullSecret: '{"auths": {"qct8-disconnecter:5000": {"auth": "ZHVtbXk6ZHVtbXk=", "email": "jhendrix"}}}' sshKey: | ssh-rsa 7WVR0= root.lab.eng.bos.redhat.com imageContentSources: - mirrors: - qct8-disconnecter:5000/ocp4 source: quay.io/openshift-release-dev/ocp-v4.0-art-dev - mirrors: - qct8-disconnecter:5000/ocp4 source: quay.io/ocp-release
> Dec 14 16:10:38 qct8-bootstrap bootkube.sh[2590]: Rendering CEO Manifests... > Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: F1214 16:10:40.104291 1 render.go:66] machineNetwork is not found in install-config This is a bug in cluster-etcd-operator. It is trying to find the IPv4 value in machineNetwork, but it accidentally only looks at the first element of machineNetwork rather than the entire list. (https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L450). If you swapped the two elements around to be machineNetwork: - cidr: 10.0.0.0/16 - cidr: 2620:52:0:1302::/64 then it would work, though of course that would change your cluster in other ways by making everything IPv4-primary rather than IPv6-primary. In theory cluster-etcd-operator should just use the first value in machineNetwork, rather than trying to find the IPv4 value when the cluster is dual-stack. That is, rather than implementing "etcd listens on IPv6-only when the cluster is single-stack IPv6, and IPv4-only when the cluster is single-stack IPv4 or dual-stack", it should instead implement "etcd listens on the IP family of whatever the first element of machineNetwork is". Failing that, the machineNetwork-parsing code in render.go needs to be fixed. I tried doing that but parsing YAML by hand like that is just gross and my first few attempts got it wrong (so I guess I can't blame the current code for having gotten it wrong too...)
Thanks for the research, Dan. I believe the parsing fix would be to change this line: https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L451 machineCIDR := fmt.Sprintf("%v", network) to machineCIDR := fmt.Sprintf("%v", network["cidr"]) Which I think we agree would technically work but would in this case result in the bootstrap etcd member binding to the IPv4 addr even though IPv6 is probably more consistent with the rest of the setup. Picking the family based on the first CIDR in the machine network list sounds like it could produce the more consistent effect of the etcd bootstrap member binding to IPv6, but would be a more significant behavioral change. Either way I think this code needs some test coverage so I'm not opposed to either way and would defer to you (or anybody else) with a strong opinion on the matter.
No, the line before that one is buggy too. for _, network := range networking["machineNetwork"].([]interface{})[0].(map[string]interface{}) { machineCIDR := fmt.Sprintf("%v", network) networking["machineNetwork"] is an array of objects. The code ought to be looping over each object in the array, and checking the value of its "cidr" property, but instead it's looping over each key/value pair of only the first object in the array, but ignoring the keys and assuming all the values are CIDRs. ie, if the install config looked like: machineNetwork: - cidr: 2620:52:0:1302::/64 totallyFakeCIDR: 99.99.0.0/16 - cidr: 10.0.0.0/16 then it would return "99.99.0.0/16". What it should be doing is something like for network := range networking["machineNetwork"].([]interface{}) { networkMap := network.(map[string]interface{}) machineCIDR := networkMap["cidr"].(string) I think? I tried a few things before and they kept not working...
I have contacted to edge team to help setup dual stack env for we have not env to simulate this cluster env, and there are trying it, thanks cc: @yprokule
Hi Dan, according to comment 7, it seems there is another issue appears, could u help to investigate whether the new issue is original issue of this bug? if yes, we may change back the bug status, if no, perhaps we may verify this bug and file a new bug to trace the issue. thanks
according to comment7&9, QE have not hit this issue, but hit another issue, and filed a new bug to trace it, so close this bug and trace dual stack installer issue with new bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days