Bug 1946079
Summary: | Virtual master is not getting an IP address | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sabina Aledort <saledort> | ||||||||
Component: | Bare Metal Hardware Provisioning | Assignee: | Angus Salkeld <asalkeld> | ||||||||
Bare Metal Hardware Provisioning sub component: | cluster-baremetal-operator | QA Contact: | Ori Michaeli <omichael> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | jfrye | ||||||||
Severity: | urgent | ||||||||||
Priority: | high | CC: | aos-bugs, asalkeld, beth.white, bfournie, derekh, jfrye, jlebon, keyoung, lshilin, mmethot, rbartal, shardy, stbenjam, walters, yjoseph, ykashtan, zbitter | ||||||||
Version: | 4.8 | Keywords: | AutomationBlocker, OtherQA, Triaged | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.8.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Release Note text:
Previously, a node sometimes selected the incorrect IP version upon startup (IPv6 instead of IPv4, or vice versa). The node would fail to start because it did not receive an IP address. This is fixed by the Cluster Bare Metal Operator passing the IP option to the downloader (`ip=dhcp` or `ip=dhcp6`), so this is set correctly at startup and the node starts as expected.
-------
Cause:
When a node is coming up, it sometimes selects the incorrect IP version (v6 instead of v4 or the otherway around).
Consequence:
The node will not get an IP address and fail to come up.
Fix:
cluster-baremetal-operator passes the IP Option to the downloader (ip=dhcp or ip=dhcp6) so that this is set correctly at start up.
Result:
The node comes up as expected.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-07-27 22:57:19 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1973367 | ||||||||||
Attachments: |
|
Description
Sabina Aledort
2021-04-04 06:46:35 UTC
Created attachment 1769020 [details]
cnfdd5-master-1.log
Created attachment 1769021 [details]
cnfdd5-master-2.log
Assigning to networking squad. On the journal that is failing to get ignition data, we see network manager finishing before it gets an IPv4 address [ 10.408640] ignition[793]: GET error: Get "https://10.19.16.122:22623/config/master": dial tcp 10.19.16.122:22623: connect: network is unreachable^M [ 10.911753] NetworkManager[833]: <info> [1617287414.8827] device (enp1s0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')^M [ 10.915630] NetworkManager[833]: <info> [1617287414.8828] device (enp1s0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')^M [ 10.918542] NetworkManager[833]: <info> [1617287414.8828] device (enp1s0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')^M [ 10.921465] NetworkManager[833]: <info> [1617287414.8829] manager: NetworkManager state is now CONNECTED_LOCAL^M [ 10.925318] NetworkManager[833]: <info> [1617287414.8833] manager: NetworkManager state is now CONNECTED_SITE^M [ 10.927699] NetworkManager[833]: <info> [1617287414.8834] policy: set 'Wired Connection' (enp1s0) as default for IPv6 routing and DNS^M [ 10.930355] NetworkManager[833]: <info> [1617287414.8834] device (enp1s0): Activation: successful, device activated.^M [ 10.932787] NetworkManager[833]: <info> [1617287414.8835] manager: NetworkManager state is now CONNECTED_GLOBAL^M [ 10.935193] NetworkManager[833]: <info> [1617287414.8840] manager: startup complete^M [ 10.936754] NetworkManager[833]: <info> [1617287414.8840] quitting now that startup is complete^M [ 10.944279] NetworkManager[833]: <info> [1617287414.9154] dhcp4 (enp1s0): canceled DHCP transaction^M [ 10.947633] NetworkManager[833]: <info> [1617287414.9155] dhcp4 (enp1s0): state changed unknown -> done^M [ 10.949953] NetworkManager[833]: <info> [1617287414.9157] manager: NetworkManager state is now CONNECTED_SITE^M [ 10.952341] NetworkManager[833]: <info> [1617287414.9158] exiting (success)^M [[ 10.979769] systemd[1]: Started dracut initqueue hook.^M I'm wondering if its possible there is a race condition between IPv4 and IPv6 ip address management [ 10.927699] NetworkManager[833]: <info> [1617287414.8834] policy: set 'Wired Connection' (enp1s0) as default for IPv6 routing and DNS^M please note how AI solved this: https://bugzilla.redhat.com/show_bug.cgi?id=1931852#c8 I strongly believe IPI should do the same ie if api ip is ipv4, add ip=dhcp if it's ipv6, add ip=dhcp6 I think we agreed in a prior bug on this that it's OK to do `virt-edit rhcos-openstack.qcow2` and inject kernel arguments that way. We in fact do this in some cases in our CI, see https://github.com/coreos/coreos-assembler/blob/c19aee6f6a346f36d2b24a2d8499c55b3c805a2d/mantle/platform/qemu.go#L706 Long term though it would be greatly beneficial for metal IPI converge on the Live ISO because that's all public APIs effectively with tooling built around it. (In reply to Colin Walters from comment #8) > I think we agreed in a prior bug on this that it's OK to do `virt-edit > rhcos-openstack.qcow2` and inject kernel arguments that way. We in fact do > this in some cases in our CI, see > https://github.com/coreos/coreos-assembler/blob/ > c19aee6f6a346f36d2b24a2d8499c55b3c805a2d/mantle/platform/qemu.go#L706 Thanks for confirming, so we'd perhaps add something like: sudo virt-edit -a image.qcow2 -m /dev/sda3 /boot/loader/entries/ostree-1-rhcos.conf -e "s/^options/options ip=dhcp/" in the downloader container somewhere around here: https://github.com/openshift/ironic-rhcos-downloader/blob/master/get-resource.sh#L65 (that's assuming virt-edit works OK inside a container...) We'd also need to wire in a new variable so the script knows if the MCS endpoint is accessible via ipv4 or ipv6 in a dual-stack environment (we can look at the API VIP IP to determine this as previously mentioned) > Long term though it would be greatly beneficial for metal IPI converge on > the Live ISO because that's all public APIs effectively with tooling built > around it. Ack, there is work in-progress around that (planned for 4.9), but since we may need to backport this interim fix probably modifying the image is preferable. Hi, We tested the suggested workaround (setting the kernel argument 'ip=dhcp' with 'virt-edit') and it seems to be working. All the masters are up. [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION cnfdd5.clus2.t5g.lab.eng.bos.redhat.com Ready worker 105s v1.21.0-rc.0+aa1dc1f cnfdd7.clus2.t5g.lab.eng.bos.redhat.com Ready worker 3m50s v1.21.0-rc.0+aa1dc1f cnfdd8.clus2.t5g.lab.eng.bos.redhat.com Ready worker 8m29s v1.21.0-rc.0+aa1dc1f dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com Ready master 52m v1.21.0-rc.0+aa1dc1f dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com Ready master 52m v1.21.0-rc.0+aa1dc1f dhcp19-17-117.clus2.t5g.lab.eng.bos.redhat.com Ready master 51m v1.21.0-rc.0+aa1dc1f Just to clarify, in this cluster (cnfdd5) we were hitting the bug in almost every deployment (at least once a day) as one or two masters didn't get an IP address. We applied the workaround twice today and all the masters joined the cluster. Just to clarify, in this cluster (cnfdd5) we were hitting the bug in almost every deployment (at least once a day) as one or two masters didn't get an IP address. We applied the workaround twice today and all the masters joined the cluster. I was able to get virt-edit working in a container, and so we're passing the API VIP to the ironic-rhcos-downloader, and based on that value we're updating the dhcp configuration. The two PRs are linked, and ready for review. It looks like there's a few issues with the implementation of this in https://github.com/openshift/cluster-baremetal-operator/pull/148. It looks like CBO is getting an IP on the service network for the API VIP (it's fd02::1), since it's using the SDN networking to reach the API VIP. I'm not sure we're really determining the right IP_OPTIONS to pass here. The machine networks could be totally different IP version -- perhaps there's a better way to look and see what we're using for the machine networks to determine if it's IPv4/IPv6/Dualstack. It also looks like the dualstack case is never handled, see the comment on https://github.com/openshift/cluster-baremetal-operator/pull/148/files/da0fdb5627d71ed60bcc1a5dfac8e9360b0d33a1#diff-1575ce96065be1a97bee923445ae60115c8ce02b4a2736788012df8162407100. Lastly, we've hardcoded /dev/sda3 in the downloader containers, but might not always be the case. There's a proposal that should work on any RHCOS image: https://github.com/openshift/ironic-rhcos-downloader/pull/40/files/52556f2395f22b8a586056b2d40c94d538645772#diff-4d7ed882fb9025e86dd94cc0b3034dec23a7201db1a0cf08109490a53528718f One small update: I think the dualstack case is handled implicitly, since it matches neither case in https://github.com/openshift/cluster-baremetal-operator/pull/148/files/da0fdb5627d71ed60bcc1a5dfac8e9360b0d33a1#diff-1575ce96065be1a97bee923445ae60115c8ce02b4a2736788012df8162407100, but maybe explicit would be clearer since at lest two people missed the subtly there. (In reply to Stephen Benjamin from comment #16) > One small update: I think the dualstack case is handled implicitly, since it > matches neither case in > https://github.com/openshift/cluster-baremetal-operator/pull/148/files/ > da0fdb5627d71ed60bcc1a5dfac8e9360b0d33a1#diff- > 1575ce96065be1a97bee923445ae60115c8ce02b4a2736788012df8162407100, but maybe > explicit would be clearer since at lest two people missed the subtly there. https://github.com/openshift/cluster-baremetal-operator/pull/154 This was verified on CI. *** Bug 1940128 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |