Bug 1943320
Summary: | Baremetal node loses connectivity with bonded interface and OVNKubernetes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andrew Austin <aaustin> | ||||||||
Component: | Networking | Assignee: | Mohamed Mahmoud <mmahmoud> | ||||||||
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | high | CC: | anbhat, astoycos, aygarg, memodi, mifiedle, mmahmoud, rbrattai, sbelmasg, thaller, trozet, vkochuku, vlaad, william.caban, zzhao | ||||||||
Version: | 4.7 | Keywords: | Reopened, UpcomingSprint | ||||||||
Target Milestone: | --- | Flags: | trozet:
needinfo-
mmahmoud: needinfo- |
||||||||
Target Release: | 4.8.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2022-08-26 14:13:05 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1945429 | ||||||||||
Bug Blocks: | 1951028 | ||||||||||
Attachments: |
|
Description
Andrew Austin
2021-03-25 19:23:49 UTC
Can you please provide logs for ovs-configuration.service and NM or full system journal logs? Created attachment 1767773 [details]
system journal 30-03-2021
Created attachment 1767774 [details]
ovs-configuration service log 30-03-2021
Created attachment 1767775 [details]
NetworkManager log 30-03-2021
I ended up in a different failure mode on attempting to reproduce the issue again for logs. This time, bond0 is still available but there is no active physical port attached to br-ex so the node remains in NotReady status. I have also provided Mohamed with access to the lab cluster in question. For reference, here is the isolinux config used to build the failing node including networking kernel args: label linux menu label ^Install RHEL CoreOS on ocp2-worker-3.lab.signal9.gg kernel /images/vmlinuz initrd /images/initramfs.img,/images/tls-ca.initrd append nomodeset rd.neednet=1 console=tty0 ignition.firstboot ignition.platform.id=metal coreos.inst=yes coreos.inst.install_dev=sda coreos.live.rootfs_url=https://172.18.0.59:443/ocp2/rhcos-rootfs.img coreos.inst.ignition_url=https://172.18.0.59:443/ocp2/worker.ign bootdev=bond0 bond=bond0:eno1,eno2:mode=active-backup ip=172.18.0.72::172.18.0.1:255.255.255.0:ocp2-worker-3.lab.signal9.gg:bond0:none:172.18.42.10 ip=eno3:none ip=eno3d1:none nameserver=172.18.42.10 nameserver=172.18.42.11 This looks like a bug in NetworkManager to me. auto-activation of eno2 (a child to bond0) is bringing up bond0 and taking the interface from ovs-if-phys0 this results in "bond0" connection being up with the same IP as ovs-if-br-ex. Created https://bugzilla.redhat.com/show_bug.cgi?id=1945429 For testing, I applied this patch to MCO and confirmed I was able to deploy a node that was failing with 4.7.3 successfully. I have not added logic to undo this change at the end of the configure script when run with OpenShiftSDN. diff --git a/templates/common/_base/files/configure-ovs-network.yaml b/templates/common/_base/files/configure-ovs-network.yaml index f2b79b98..79645f9a 100644 --- a/templates/common/_base/files/configure-ovs-network.yaml +++ b/templates/common/_base/files/configure-ovs-network.yaml @@ -173,6 +173,13 @@ contents: connection.autoconnect-priority 100 802-3-ethernet.mtu ${iface_mtu} ${extra_phys_args} fi + # Move any bond member interfaces to the new ovs-if-phys0 connection + if [ "$(nmcli --get-values connection.type conn show ${old_conn})" == "bond" ]; then + new_conn=$(grep uuid= $NM_CONN_PATH/ovs-if-phys0.nmconnection | sed s/uuid=//) + sed -i s/master=${old_conn}/master=${new_conn}/ $NM_CONN_PATH/*.nmconnection + nmcli conn reload + fi + nmcli conn up ovs-if-phys0 if ! nmcli connection show ovs-if-br-ex &> /dev/null; then Here is a more complete patch if you ultimately choose to handle the conflict on the MCO side. I don't think the reversion is perfect, but it results in a reachable system for debugging most of the time. https://github.com/marbindrakon/machine-config-operator/commit/192441aabd51cc77f57b7f7f060d3da7eb369891 *** Bug 1937914 has been marked as a duplicate of this bug. *** Thanks Andrew. I think it is a good start. Mohammed can you please create a PR for this? I'm not sure if we even need to revert setting the device as the master for the slaves. I think it is probably OK to leave it, but will leave it up to you. so we won't wait for NM team assessment for 1945429 to close on this ?!? From Beniamino's response it seems like they do not think it is a bug in NM. Even if it is, we would have to get NM fixed and carried all the way back to 4.6 for this which will take a considerable amount of time. I think it's safe for us to go ahead and proceed with a workaround to get this fixed asap. *** Bug 1948440 has been marked as a duplicate of this bug. *** @aaustin Any chance you can test the fix in your environment? OCP QE doesn't have an env immediately available. Sure thing. I will post results once the cluster build completes. Looking good to me. Tested using the 4.8-2021-04-16-184424 image stream, no intervention was required to bring up a worker using bonded interfaces defined via kernel arguments. [root@ocp2-worker-4 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens192: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000 link/ether 00:50:56:b7:ff:45 brd ff:ff:ff:ff:ff:ff 3: ens224: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000 link/ether 00:50:56:b7:ff:45 brd ff:ff:ff:ff:ff:ff permaddr 00:50:56:b7:9f:70 5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether 00:50:56:b7:ff:45 brd ff:ff:ff:ff:ff:ff 6: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 22:6d:fa:c8:ff:0a brd ff:ff:ff:ff:ff:ff 7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 00:50:56:b7:ff:45 brd ff:ff:ff:ff:ff:ff inet 172.18.0.73/24 brd 172.18.0.255 scope global noprefixroute br-ex valid_lft forever preferred_lft forever [root@ocp2-worker-4 ~]# grep -C 5 new_device /usr/local/bin/configure-ovs.sh nmcli c add type ${iface_type} conn.interface ${iface} master ovs-port-phys0 con-name ovs-if-phys0 \ connection.autoconnect-priority 100 802-3-ethernet.mtu ${iface_mtu} "${extra_phys_args[@]}" fi # Update connections with master property set to use the new device name new_device=$(nmcli --get-values connection.interface-name conn show ovs-if-phys0) for conn_uuid in $(nmcli -g UUID connection show) ; do if [ "$(nmcli -g connection.master connection show uuid "$conn_uuid")" != "$old_conn" ]; then continue fi nmcli conn mod uuid ${conn_uuid} connection.master ${new_device} done nmcli conn up ovs-if-phys0 if ! nmcli connection show ovs-if-br-ex &> /dev/null; then Thanks, Andrew. Move this bug to verified according to comment 21 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 Hello All, Will it be backported to 4.6 and 4.7? As the customer is building clusters with 4.6 and 4.7 versions. Regards, Ayush Garg (In reply to aygarg from comment #26) > Hello All, > > Will it be backported to 4.6 and 4.7? As the customer is building clusters > with 4.6 and 4.7 versions. > > Regards, > Ayush Garg this bug only for 4.8 version. for 4.6 and 4.7 version, please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1951089 https://bugzilla.redhat.com/show_bug.cgi?id=1951028 |