Bug 2087096
| Summary: | IPv6 ipi jobs failing: No route to registry virthost.ostest.test.metalkube.org:5000 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Derek Higgins <derekh> |
| Component: | Bare Metal Hardware Provisioning | Assignee: | Bob Fournier <bfournie> |
| Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Jad Haj Yahya <jhajyahy> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | medium | CC: | bfournie, dgoodwin, janders, lshilin, rpittau, stbenjam, tsedovic, zbitter |
| Version: | 4.11 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-09 01:19:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Derek Higgins
2022-05-17 10:08:30 UTC
I am not sure if this will help, but TRT's watcher notes from a few weeks ago had this note about NetworkManager in RHEL 8.6, if that's what you're picking up: RHCOS based on 8.6 was reintroduced and Azure is failing with nodes to reboot, still investigating other problems with the new RHCOS. #wg-4_11-triage on Slack NetworkManager team identified 2 bugs based on our logs, details in https://bugzilla.redhat.com/show_bug.cgi?id=2077605#c4 this problem is in deployment scripts, CI is back up and running with a workaround but I'm going to leave this bug open and urgent as the workaround will fail whenever equinix start using rocky 8.6 as their base image, which could happen any day. Not a release blocker, only effects dev-scripts Derek, the linked PR is merged. Can we close this / move to QE? (In reply to Tomas Sedovic from comment #6) > Derek, the linked PR is merged. Can we close this / move to QE? Afraid not, the merged Pr is only a workaround, we still need to find and fix the root cause Rocky 8.5 is no longer available, so we cannot rely on this workaround any longer (I removed it in https://github.com/openshift/release/pull/30488 because it was now breaking all dev-scripts jobs). Despite bug 2077605 having been fixed, it has not actually shipped so this bug is back: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/30488/rehearse-30488-pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-proxy-ipv6/1547658252685152256#1:build-log.txt%3A14907 Looking at the logs, I don't see any evidence of the DHCP problem (DHCP isn't used to add ::1 to ostestbm, is it?). The only thing I really see is ostestbm doesn't look like it ever goes online: Jul 14 19:28:28 cir-cir-19 NetworkManager[1842]: <info> [1657826908.4763] device (ostestbm): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external') Jul 14 19:28:28 cir-cir-19 NetworkManager[1842]: <info> [1657826908.4765] device (ostestbm): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external') Jul 14 19:28:28 cir-cir-19 NetworkManager[1842]: <info> [1657826908.4768] device (ostestbm): Activation: successful, device activated. Jul 14 19:28:28 cir-cir-19 dbus-daemon[1762]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.7' (uid=0 pid=1842 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0") Jul 14 19:28:28 cir-cir-19 dbus-daemon[1762]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.nm-dispatcher.service': Unit dbus-org.freedesktop.nm-dispatcher.service not found. Jul 14 19:28:28 cir-cir-19 kernel: IPv6: ADDRCONF(NETDEV_UP): ostestbm: link is not ready I'd suggest someone actually needs to spin up a host from Equinix and look at the bridges to see why they're not up. I did some investigation on this failure in dev-scripts using CentOS 8 Stream. This uses
$ NetworkManager --version
1.39.12-1.el8
The problem is that the ostestbm interface on the bridge is Down with NO-CARRIER which causes the failure.
172: ostestbm: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:bc:a0:46 brd ff:ff:ff:ff:ff:ff
inet6 fd2e:6f44:5dd8:c956::1/120 scope global tentative
valid_lft forever preferred_lft forever
During boot I saw the following errors:
+(./02_configure_host.sh:210): main(): sudo ifup ostestbm
WARN : [ifup] You are using 'ifup' script provided by 'network-scripts', which are now deprecated.
WARN : [ifup] 'network-scripts' will be removed in one of the next major releases of RHEL.
WARN : [ifup] It is advised to switch to 'NetworkManager' instead - it provides 'ifup/ifdown' scripts as well.
ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel
ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart network with '/sbin/service network restart'
INFO : [ipv6_wait_tentative] Waiting for interface ostestbm IPv6 address(es) to leave the 'tentative' state
Setting this in /etc/sysconfig/network
IPV6FORWARDING=yes
caused that ERROR to go away but it didn't resolve the issue with ostestbm
I found a workaround here https://serverfault.com/questions/1062421/linux-ipv6-bridge-address-does-not-work-when-mac-address-is-forced that suggested adding a dummy interface to the bridge.
I added this to 02_configure_host.sh:
+# add a dummy interface to ensure the bridge comes up
+if [[ -n "${EXTERNAL_SUBNET_V6}" ]] && [ ! "$INT_IF" ]; then
+ sudo ip link add name ${BAREMETAL_NETWORK_NAME}-dummy up master ${BAREMETAL_NETWORK_NAME} type dummy
+fi
With that change ostestbm came up and the install continued.
214: ostestbm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 52:54:00:b3:95:a0 brd ff:ff:ff:ff:ff:ff
inet6 fd2e:6f44:5dd8:c956::1/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:feb3:95a0/64 scope link
valid_lft forever preferred_lft forever
215: ostestbm-dummy: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master ostestbm state UNKNOWN group default qlen 1000
link/ether 82:1a:0e:a3:d7:ee brd ff:ff:ff:ff:ff:ff
inet6 fe80::801a:eff:fea3:d7ee/64 scope link
valid_lft forever preferred_lft forever
(In reply to Bob Fournier from comment #11) > I did some investigation on this failure in dev-scripts using CentOS 8 > Stream. This uses > $ NetworkManager --version > 1.39.12-1.el8 > > The problem is that the ostestbm interface on the bridge is Down with > NO-CARRIER which causes the failure. > 172: ostestbm: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue > state DOWN group default qlen 1000 > link/ether 52:54:00:bc:a0:46 brd ff:ff:ff:ff:ff:ff > inet6 fd2e:6f44:5dd8:c956::1/120 scope global tentative > valid_lft forever preferred_lft forever > > During boot I saw the following errors: > +(./02_configure_host.sh:210): main(): sudo ifup ostestbm > WARN : [ifup] You are using 'ifup' script provided by > 'network-scripts', which are now deprecated. > WARN : [ifup] 'network-scripts' will be removed in one of the next > major releases of RHEL. > WARN : [ifup] It is advised to switch to 'NetworkManager' instead - it > provides 'ifup/ifdown' scripts as well. > ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Global IPv6 > forwarding is disabled in configuration, but not currently disabled in kernel > ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart > network with '/sbin/service network restart' > INFO : [ipv6_wait_tentative] Waiting for interface ostestbm IPv6 > address(es) to leave the 'tentative' state > > Setting this in /etc/sysconfig/network > IPV6FORWARDING=yes > > caused that ERROR to go away but it didn't resolve the issue with ostestbm > > I found a workaround here > https://serverfault.com/questions/1062421/linux-ipv6-bridge-address-does-not- > work-when-mac-address-is-forced that suggested adding a dummy interface to > the bridge. > > I added this to 02_configure_host.sh: > +# add a dummy interface to ensure the bridge comes up > +if [[ -n "${EXTERNAL_SUBNET_V6}" ]] && [ ! "$INT_IF" ]; then > + sudo ip link add name ${BAREMETAL_NETWORK_NAME}-dummy up master > ${BAREMETAL_NETWORK_NAME} type dummy > +fi > > With that change ostestbm came up and the install continued. > 214: ostestbm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue > state UP group default qlen 1000 > link/ether 52:54:00:b3:95:a0 brd ff:ff:ff:ff:ff:ff > inet6 fd2e:6f44:5dd8:c956::1/120 scope global > valid_lft forever preferred_lft forever > inet6 fe80::5054:ff:feb3:95a0/64 scope link > valid_lft forever preferred_lft forever > 215: ostestbm-dummy: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue > master ostestbm state UNKNOWN group default qlen 1000 > link/ether 82:1a:0e:a3:d7:ee brd ff:ff:ff:ff:ff:ff > inet6 fe80::801a:eff:fea3:d7ee/64 scope link > valid_lft forever preferred_lft forever thanks Bob, that's great! I confirm the workaround works also using pure NetworkManager without legacy scripts tested it in CentOS Stream 8 Also tested on CentOS Stream 9 $ NetworkManager --version 1.39.90-1.el9 without workaround: Error: authenticating creds for "virthost.ostest.test.metalkube.org:5000": pinging container registry virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": dial tcp [fd2e:6f44:5dd8:c956::1]:5000: connect: no route to host with workaround: sudo podman login --authfile /home/metalhead/private-mirror-ostest.json -u ocp-user -p ocp-pass virthost.ostest.test.metalkube.org:5000 Login Succeeded! OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9270 |