Bug 1907872 - dual stack with an ipv6 network fails on bootstrap phase [NEEDINFO]
Summary: dual stack with an ipv6 network fails on bootstrap phase
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Dan Mace
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-15 12:23 UTC by Karim Boumedhel
Modified: 2021-02-24 15:44 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A parsing bug in reading machine network CIDR. Consequence: The bootstrap rendering logic fails to detect a usable machine network CIDR when using IPv6 dual stack mode unless the IPv4 CIDR is the first element in the install config's machine network CIDR array. Fix: Fix the parsing logic to loop through all machine network CIDRs. Result: The IPv4 address is correctly located amongst the machine network CIDRs in dual stack mode.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:43:55 UTC
Target Upstream Version:
sbatsche: needinfo? (bbennett)
geliu: needinfo? (dmace)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 532 0 None closed Bug 1907872: Make dual stack bootstrapping more reliable 2021-02-08 10:05:38 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:44:37 UTC

Description Karim Boumedhel 2020-12-15 12:23:06 UTC
Description of problem:
dual stack with an ipv6 network fails on bootstrap phase


Version-Release number of selected component (if applicable):
4.7+

How reproducible:
always

Steps to Reproduce:
1. prepare an install config, for instance on upi, with dual stacks entries
2. deploy
3. bootstrap kubelet fails to start

Actual results:
installation doesnt proceed, fails with

Dec 14 16:10:18 qct8-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Dec 14 16:10:37 qct8-bootstrap bootkube.sh[2590]: Moving OpenShift manifests in with the rest of them
Dec 14 16:10:37 qct8-bootstrap bootkube.sh[2590]: Rendering Cluster Version Operator Manifests...
Dec 14 16:10:38 qct8-bootstrap bootkube.sh[2590]: Rendering CEO Manifests...
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: F1214 16:10:40.104291       1 render.go:66] machineNetwork is not found in install-config
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: goroutine 1 [running]:
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.stacks(0xc000134001, 0xc0004d8160, 0x5a, 0xad)
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]:         k8s.io/klog/v2@v2.3.0/klog.go:996 +0xb9
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.(*loggingT).output(0x37cc440, 0xc000000003, 0x0, 0x0, 0xc0008151f0, 0x36ebfb9, 0x9, 0x42, 0xc0>
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]:         k8s.io/klog/v2@v2.3.0/klog.go:945 +0x191
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.(*loggingT).printDepth(0x37cc440, 0xc000000003, 0x0, 0x0, 0x1, 0xc000529cd8, 0x1, 0x1)
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]:         k8s.io/klog/v2@v2.3.0/klog.go:718 +0x165
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.(*loggingT).print(...)
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]:         k8s.io/klog/v2@v2.3.0/klog.go:703
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: k8s.io/klog/v2.Fatal(...)
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]:         k8s.io/klog/v2@v2.3.0/klog.go:1443
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: github.com/openshift/cluster-etcd-operator/pkg/cmd/render.NewRenderCommand.func1.1(0xc000920670)
Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]:         github.com/openshift/cluster-etcd-operator/pkg/cmd/render/render.go:66 +0xf6



Expected results:
installation should proceed


Additional info:
dan indicated that this is a bug in cluster-etcd-operator; it wants to find the IPv4 machine network if the cluster is dual-stack, but it (accidentally) only looks at the first element of machineNetwork

the install config used for this test:

apiVersion: v1
baseDomain: karmalabs.com
compute:
- name: worker
  replicas: 0
controlPlane:
  name: master
  replicas: 3
metadata:
  name: qct8
networking:
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 2620:52:0:1302::/64
  - cidr: 10.0.0.0/16
  clusterNetwork:
  - cidr: fd01::/48
    hostPrefix: 64
  - cidr: 10.132.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - fd02::/112
  - 172.30.0.0/16
platform:
  none: {}
additionalTrustBundle: |
 -----BEGIN CERTIFICATE-----
 MIIF6zCCA9OgAwIBAgIUbe1d+p+o6nzXKTf3YFMPALPKkc8wDQYJKoZIhvcNAQEL
 BQAwdjELMAkGA1UEBhMCVVMxDzANBgNVBAgMBk1hZHJpZDEVMBMGA1UEBwwMU2Fu
 W4AwfWg95+4WNXwGNDE1bui1RZtEzCbjw6Fxiu9ASw==
 -----END CERTIFICATE-----
pullSecret: '{"auths": {"qct8-disconnecter:5000": {"auth": "ZHVtbXk6ZHVtbXk=", "email": "jhendrix@karmalabs.com"}}}'
sshKey: |
  ssh-rsa 7WVR0= root@qct-d14u08.cloud.lab.eng.bos.redhat.com
imageContentSources:
- mirrors:
  - qct8-disconnecter:5000/ocp4
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors:
  - qct8-disconnecter:5000/ocp4
  source: quay.io/ocp-release

Comment 1 Dan Winship 2020-12-15 13:26:25 UTC
> Dec 14 16:10:38 qct8-bootstrap bootkube.sh[2590]: Rendering CEO Manifests...
> Dec 14 16:10:40 qct8-bootstrap bootkube.sh[2590]: F1214 16:10:40.104291       1 render.go:66] machineNetwork is not found in install-config

This is a bug in cluster-etcd-operator. It is trying to find the IPv4 value in machineNetwork, but it accidentally only looks at the first element of machineNetwork rather than the entire list. (https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L450). If you swapped the two elements around to be

  machineNetwork:
  - cidr: 10.0.0.0/16
  - cidr: 2620:52:0:1302::/64

then it would work, though of course that would change your cluster in other ways by making everything IPv4-primary rather than IPv6-primary.

In theory cluster-etcd-operator should just use the first value in machineNetwork, rather than trying to find the IPv4 value when the cluster is dual-stack. That is, rather than implementing "etcd listens on IPv6-only when the cluster is single-stack IPv6, and IPv4-only when the cluster is single-stack IPv4 or dual-stack", it should instead implement "etcd listens on the IP family of whatever the first element of machineNetwork is".

Failing that, the machineNetwork-parsing code in render.go needs to be fixed. I tried doing that but parsing YAML by hand like that is just gross and my first few attempts got it wrong (so I guess I can't blame the current code for having gotten it wrong too...)

Comment 3 Dan Mace 2021-01-05 15:45:22 UTC
Thanks for the research, Dan.

I believe the parsing fix would be to change this line:

https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L451

   machineCIDR := fmt.Sprintf("%v", network)

to

   machineCIDR := fmt.Sprintf("%v", network["cidr"])

Which I think we agree would technically work but would in this case result in the bootstrap etcd member binding to the IPv4 addr even though IPv6 is probably more consistent with the rest of the setup. Picking the family based on the first CIDR in the machine network list sounds like it could produce the more consistent effect of the etcd bootstrap member binding to IPv6, but would be a more significant behavioral change. Either way I think this code needs some test coverage so I'm not opposed to either way and would defer to you (or anybody else) with a strong opinion on the matter.

Comment 4 Dan Winship 2021-01-05 18:16:33 UTC
No, the line before that one is buggy too.

	for _, network := range networking["machineNetwork"].([]interface{})[0].(map[string]interface{}) {
		machineCIDR := fmt.Sprintf("%v", network)

networking["machineNetwork"] is an array of objects. The code ought to be looping over each object in the array, and checking the value of its "cidr" property, but instead it's looping over each key/value pair of only the first object in the array, but ignoring the keys and assuming all the values are CIDRs. ie, if the install config looked like:

    machineNetwork:
      - cidr: 2620:52:0:1302::/64
        totallyFakeCIDR: 99.99.0.0/16
      - cidr: 10.0.0.0/16

then it would return "99.99.0.0/16".

What it should be doing is something like

        for network := range networking["machineNetwork"].([]interface{}) {
                networkMap := network.(map[string]interface{})
                machineCIDR := networkMap["cidr"].(string)

I think? I tried a few things before and they kept not working...

Comment 6 ge liu 2021-02-05 03:44:03 UTC
I have contacted to edge team to help setup dual stack env for we have not env to simulate this cluster env, and there are trying it, thanks cc: @yprokule

Comment 8 ge liu 2021-02-08 14:49:46 UTC
Hi Dan, according to comment 7, it seems there is another issue appears, could u help to investigate whether the new issue is original issue of this bug? if yes, we may change back the bug status, if no, perhaps we may verify this bug and file a new bug to trace the issue. thanks

Comment 10 ge liu 2021-02-09 11:18:00 UTC
according to comment7&9, QE have not hit this issue, but hit another issue, and filed a new bug to trace it, so close this bug and trace dual stack installer issue with new bug.

Comment 13 errata-xmlrpc 2021-02-24 15:43:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.