Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2087096

Summary: IPv6 ipi jobs failing: No route to registry virthost.ostest.test.metalkube.org:5000
Product: OpenShift Container Platform Reporter: Derek Higgins <derekh>
Component: Bare Metal Hardware ProvisioningAssignee: Bob Fournier <bfournie>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Jad Haj Yahya <jhajyahy>
Status: CLOSED DEFERRED Docs Contact:
Severity: urgent    
Priority: medium CC: bfournie, dgoodwin, janders, lshilin, rpittau, stbenjam, tsedovic, zbitter
Version: 4.11Keywords: Triaged
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:19:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Derek Higgins 2022-05-17 10:08:30 UTC
Seems to have started just over 10 hours ago

Error is early in dev-scripts,
+(./02_configure_host.sh:243): main(): sudo podman login --authfile /root/private-mirror-ostest.json -u ocp-user -p ocp-pass virthost.ostest.test.metalkube.org:5000
Error: authenticating creds for "virthost.ostest.test.metalkube.org:5000": pinging container registry virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": dial tcp [fd2e:6f44:5dd8:c956::1]:5000: connect: no route to host

Current thinking is that it is related to the handling of legacy network-scripts

Comment 2 Devan Goodwin 2022-05-18 11:03:21 UTC
I am not sure if this will help, but TRT's watcher notes from a few weeks ago had this note about NetworkManager in RHEL 8.6, if that's what you're picking up:  

RHCOS based on 8.6 was reintroduced and Azure is failing with nodes to reboot, still investigating other problems with the new RHCOS.
#wg-4_11-triage on Slack
NetworkManager team identified 2 bugs based on our logs, details in https://bugzilla.redhat.com/show_bug.cgi?id=2077605#c4

Comment 4 Derek Higgins 2022-05-19 10:18:25 UTC
this problem is in deployment scripts, CI is back up and running with a workaround 
but I'm going to leave this bug open and urgent as the workaround will fail whenever
equinix start using rocky 8.6 as their base image, which could happen any day.

Comment 5 Derek Higgins 2022-06-03 16:39:24 UTC
Not a release blocker, only effects dev-scripts

Comment 6 Tomas Sedovic 2022-06-13 14:35:21 UTC
Derek, the linked PR is merged. Can we close this / move to QE?

Comment 7 Derek Higgins 2022-06-13 16:09:00 UTC
(In reply to Tomas Sedovic from comment #6)
> Derek, the linked PR is merged. Can we close this / move to QE?

Afraid not, the merged Pr is only a workaround, we still need to find and fix the root cause

Comment 8 Zane Bitter 2022-07-14 19:50:49 UTC
Rocky 8.5 is no longer available, so we cannot rely on this workaround any longer (I removed it in https://github.com/openshift/release/pull/30488 because it was now breaking all dev-scripts jobs).

Despite bug 2077605 having been fixed, it has not actually shipped so this bug is back:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/30488/rehearse-30488-pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-proxy-ipv6/1547658252685152256#1:build-log.txt%3A14907

Comment 9 Stephen Benjamin 2022-07-15 12:23:37 UTC
Looking at the logs, I don't see any evidence of the DHCP problem (DHCP isn't used to add ::1 to ostestbm, is it?).  The only thing I really see is ostestbm doesn't look like it ever goes online:

Jul 14 19:28:28 cir-cir-19 NetworkManager[1842]: <info>  [1657826908.4763] device (ostestbm): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jul 14 19:28:28 cir-cir-19 NetworkManager[1842]: <info>  [1657826908.4765] device (ostestbm): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jul 14 19:28:28 cir-cir-19 NetworkManager[1842]: <info>  [1657826908.4768] device (ostestbm): Activation: successful, device activated.
Jul 14 19:28:28 cir-cir-19 dbus-daemon[1762]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.7' (uid=0 pid=1842 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Jul 14 19:28:28 cir-cir-19 dbus-daemon[1762]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.nm-dispatcher.service': Unit dbus-org.freedesktop.nm-dispatcher.service not found.
Jul 14 19:28:28 cir-cir-19 kernel: IPv6: ADDRCONF(NETDEV_UP): ostestbm: link is not ready

I'd suggest someone actually needs to spin up a host from Equinix and look at the bridges to see why they're not up.

Comment 11 Bob Fournier 2022-08-25 20:39:17 UTC
I did some investigation on this failure in dev-scripts using CentOS 8 Stream. This uses
$ NetworkManager --version
1.39.12-1.el8

The problem is that the ostestbm interface on the bridge is Down with NO-CARRIER which causes the failure.
172: ostestbm: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:bc:a0:46 brd ff:ff:ff:ff:ff:ff
    inet6 fd2e:6f44:5dd8:c956::1/120 scope global tentative 
       valid_lft forever preferred_lft forever

During boot I saw the following errors:
+(./02_configure_host.sh:210): main(): sudo ifup ostestbm
WARN      : [ifup] You are using 'ifup' script provided by 'network-scripts', which are now deprecated.
WARN      : [ifup] 'network-scripts' will be removed in one of the next major releases of RHEL.
WARN      : [ifup] It is advised to switch to 'NetworkManager' instead - it provides 'ifup/ifdown' scripts as well.
ERROR     : [/etc/sysconfig/network-scripts/ifup-ipv6] Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel
ERROR     : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart network with '/sbin/service network restart'
INFO      : [ipv6_wait_tentative] Waiting for interface ostestbm IPv6 address(es) to leave the 'tentative' state

Setting this in /etc/sysconfig/network
IPV6FORWARDING=yes

caused that ERROR to go away but it didn't resolve the issue with ostestbm

I found a workaround here https://serverfault.com/questions/1062421/linux-ipv6-bridge-address-does-not-work-when-mac-address-is-forced that suggested adding a dummy interface to the bridge. 

I added this to 02_configure_host.sh:
+# add a dummy interface to ensure the bridge comes up
+if [[ -n "${EXTERNAL_SUBNET_V6}" ]] && [ ! "$INT_IF" ]; then
+    sudo ip link add name ${BAREMETAL_NETWORK_NAME}-dummy up master ${BAREMETAL_NETWORK_NAME} type dummy
+fi

With that change ostestbm came up and the install continued.
214: ostestbm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:b3:95:a0 brd ff:ff:ff:ff:ff:ff
    inet6 fd2e:6f44:5dd8:c956::1/120 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:feb3:95a0/64 scope link 
       valid_lft forever preferred_lft forever
215: ostestbm-dummy: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master ostestbm state UNKNOWN group default qlen 1000
    link/ether 82:1a:0e:a3:d7:ee brd ff:ff:ff:ff:ff:ff
    inet6 fe80::801a:eff:fea3:d7ee/64 scope link 
       valid_lft forever preferred_lft forever

Comment 12 Riccardo Pittau 2022-08-26 08:23:20 UTC
(In reply to Bob Fournier from comment #11)
> I did some investigation on this failure in dev-scripts using CentOS 8
> Stream. This uses
> $ NetworkManager --version
> 1.39.12-1.el8
> 
> The problem is that the ostestbm interface on the bridge is Down with
> NO-CARRIER which causes the failure.
> 172: ostestbm: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue
> state DOWN group default qlen 1000
>     link/ether 52:54:00:bc:a0:46 brd ff:ff:ff:ff:ff:ff
>     inet6 fd2e:6f44:5dd8:c956::1/120 scope global tentative 
>        valid_lft forever preferred_lft forever
> 
> During boot I saw the following errors:
> +(./02_configure_host.sh:210): main(): sudo ifup ostestbm
> WARN      : [ifup] You are using 'ifup' script provided by
> 'network-scripts', which are now deprecated.
> WARN      : [ifup] 'network-scripts' will be removed in one of the next
> major releases of RHEL.
> WARN      : [ifup] It is advised to switch to 'NetworkManager' instead - it
> provides 'ifup/ifdown' scripts as well.
> ERROR     : [/etc/sysconfig/network-scripts/ifup-ipv6] Global IPv6
> forwarding is disabled in configuration, but not currently disabled in kernel
> ERROR     : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart
> network with '/sbin/service network restart'
> INFO      : [ipv6_wait_tentative] Waiting for interface ostestbm IPv6
> address(es) to leave the 'tentative' state
> 
> Setting this in /etc/sysconfig/network
> IPV6FORWARDING=yes
> 
> caused that ERROR to go away but it didn't resolve the issue with ostestbm
> 
> I found a workaround here
> https://serverfault.com/questions/1062421/linux-ipv6-bridge-address-does-not-
> work-when-mac-address-is-forced that suggested adding a dummy interface to
> the bridge. 
> 
> I added this to 02_configure_host.sh:
> +# add a dummy interface to ensure the bridge comes up
> +if [[ -n "${EXTERNAL_SUBNET_V6}" ]] && [ ! "$INT_IF" ]; then
> +    sudo ip link add name ${BAREMETAL_NETWORK_NAME}-dummy up master
> ${BAREMETAL_NETWORK_NAME} type dummy
> +fi
> 
> With that change ostestbm came up and the install continued.
> 214: ostestbm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
> state UP group default qlen 1000
>     link/ether 52:54:00:b3:95:a0 brd ff:ff:ff:ff:ff:ff
>     inet6 fd2e:6f44:5dd8:c956::1/120 scope global 
>        valid_lft forever preferred_lft forever
>     inet6 fe80::5054:ff:feb3:95a0/64 scope link 
>        valid_lft forever preferred_lft forever
> 215: ostestbm-dummy: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
> master ostestbm state UNKNOWN group default qlen 1000
>     link/ether 82:1a:0e:a3:d7:ee brd ff:ff:ff:ff:ff:ff
>     inet6 fe80::801a:eff:fea3:d7ee/64 scope link 
>        valid_lft forever preferred_lft forever

thanks Bob, that's great!
I confirm the workaround works also using pure NetworkManager without legacy scripts
tested it in CentOS Stream 8

Comment 13 Riccardo Pittau 2022-08-26 08:52:02 UTC
Also tested on CentOS Stream 9

$ NetworkManager --version
1.39.90-1.el9

without workaround:
Error: authenticating creds for "virthost.ostest.test.metalkube.org:5000": pinging container registry virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": dial tcp [fd2e:6f44:5dd8:c956::1]:5000: connect: no route to host

with workaround:
sudo podman login --authfile /home/metalhead/private-mirror-ostest.json -u ocp-user -p ocp-pass virthost.ostest.test.metalkube.org:5000
Login Succeeded!

Comment 14 Shiftzilla 2023-03-09 01:19:23 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9270