Bug 1887545 - 4.5 to 4.6 upgrade fails when external network is configured on a bond device: ovs-configuration service fails and node becomes unreachable
Summary: 4.5 to 4.6 upgrade fails when external network is configured on a bond device...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.7.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 1887596
TreeView+ depends on / blocked
 
Reported: 2020-10-12 18:44 UTC by Marius Cornea
Modified: 2021-02-24 15:50 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1887596 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:25:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2152 0 None closed Bug 1887545: Fix ovs-configuration detecting bond and vlan interfaces 2021-02-18 14:13:09 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:25:56 UTC

Description Marius Cornea 2020-10-12 18:44:26 UTC
Description of problem:

4.5 to 4.6 upgrade fails when external network is configured on a bond device: ovs-configuration service fails and node becomes unreachable

This issue was observed on a baremetal IPI deployment with the nodes having the following NICs layout:

nic1: provisioning network
nic2: bond0 member
nic3: bond0 member
bond0: external network

Example from one of the master nodes:

2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:b0:88:f0 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:1101::3/64 scope global dynamic 
       valid_lft 10sec preferred_lft 10sec
    inet6 fe80::fb54:adad:caa3:615d/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: enp5s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
    link/ether 52:54:00:b6:79:9c brd ff:ff:ff:ff:ff:ff
4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
    link/ether 52:54:00:b6:79:9c brd ff:ff:ff:ff:ff:ff
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:b6:79:9c brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.121/24 brd 192.168.123.255 scope global dynamic noprefixroute bond0
       valid_lft 3124sec preferred_lft 3124sec
    inet6 fe80::5054:ff:feb6:799c/64 scope link 
       valid_lft forever preferred_lft forever

When upgrading to 4.6, during machine-config operator upgrade, after the reboot of the first node the upgrade process is blocked because the worker node loses connectivity over the external network. Looking at the worker logs the failure is caused by ovs-configuration service failure:

-- Logs begin at Mon 2020-10-12 15:30:04 UTC, end at Mon 2020-10-12 18:34:43 UTC. --
Oct 12 16:48:44 worker-0-0 systemd[1]: Starting Configures OVS with proper host networking configuration...
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + touch /var/run/ovs-config-executed
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + '[' OVNKubernetes == OVNKubernetes ']'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + NM_CONN_PATH=/etc/NetworkManager/system-connections
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + iface=
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + counter=0
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + '[' 0 -lt 12 ']'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: ++ ip route show default
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: ++ awk '{if ($4 == "dev") print $5; exit}'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + iface=bond0
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + [[ -n bond0 ]]
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + echo 'IPv4 Default gateway interface found: bond0'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: IPv4 Default gateway interface found: bond0
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + break
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + '[' bond0 = br-ex ']'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + '[' -z bond0 ']'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + iface_mac=52:54:00:1f:c4:d7
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + echo 'MAC address found for iface: bond0: 52:54:00:1f:c4:d7'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: MAC address found for iface: bond0: 52:54:00:1f:c4:d7
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: ++ awk '{print $5; exit}'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: ++ ip link show bond0
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + iface_mtu=1500
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + [[ -z 1500 ]]
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + echo 'MTU found for iface: bond0: 1500'
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: MTU found for iface: bond0: 1500
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + nmcli connection show br-ex
Oct 12 16:48:44 worker-0-0 configure-ovs.sh[2095]: + nmcli c add type ovs-bridge conn.interface br-ex con-name br-ex 802-3-ethernet.mtu 1500 802-3-ethernet.cloned-mac-address 52:54:00:1f:c4:d7
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: Connection 'br-ex' (f7dfecd3-2069-411c-b2b3-eaceed0c0fa4) successfully added.
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: ++ nmcli --fields UUID,DEVICE conn show --active
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: ++ awk '{print $1}'
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: ++ grep bond0
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: + old_conn=ad33d8b0-1f7b-cab9-9447-ba07f855b143
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: + nmcli connection show ovs-port-phys0
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: + nmcli c add type ovs-port conn.interface bond0 master br-ex con-name ovs-port-phys0
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: Connection 'ovs-port-phys0' (0d8ba368-c8c0-4cd7-b796-8f6a061cb1bf) successfully added.
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: + nmcli connection show ovs-port-br-ex
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: + nmcli c add type ovs-port conn.interface br-ex master br-ex con-name ovs-port-br-ex
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: Connection 'ovs-port-br-ex' (5620c242-73ec-4d6c-af4e-fd8827bc92f8) successfully added.
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: + nmcli device disconnect bond0
Oct 12 16:48:45 worker-0-0 configure-ovs.sh[2095]: Device 'bond0' successfully disconnected.
Oct 12 16:48:45 localhost.localdomain configure-ovs.sh[2095]: + nmcli connection show ovs-if-phys0
Oct 12 16:48:45 localhost.localdomain configure-ovs.sh[2095]: + nmcli c add type 802-3-ethernet conn.interface bond0 master ovs-port-phys0 con-name ovs-if-phys0 connection.autoconnect-priority 100 802-3-ethernet.mtu 1500
Oct 12 16:48:45 localhost.localdomain configure-ovs.sh[2095]: Connection 'ovs-if-phys0' (e510ddbf-a2eb-4c10-a914-c4d080d78801) successfully added.
Oct 12 16:48:45 localhost.localdomain configure-ovs.sh[2095]: + nmcli conn up ovs-if-phys0
Oct 12 16:48:45 localhost.localdomain configure-ovs.sh[2095]: Error: Connection activation failed: No suitable device found for this connection (device enp4s0 not available because profile is not compatible with device (mismatching interface name)).
Oct 12 16:48:45 localhost.localdomain systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Oct 12 16:48:45 localhost.localdomain systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Oct 12 16:48:45 localhost.localdomain systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Oct 12 16:48:45 localhost.localdomain systemd[1]: ovs-configuration.service: Consumed 285ms CPU time


Version-Release number of selected component (if applicable):
4.6.0-rc.2

How reproducible:
100%

Steps to Reproduce:
1. Deploy 4.5 via baremetal IPI flow with nodes having the external network configured on top of a bond device. Initial bond configuration was set up via Ignition files:

{
  "ignition": {
    "version": "2.3.0"
  },
  "storage": {
    "files": [
      {
        "path": "/etc/sysconfig/network-scripts/ifcfg-enp5s0",
        "filesystem": "root",
        "mode": 436,
        "contents": {
          "source": "data:text/plain;charset=utf-8;base64,REVWSUNFPWVucDVzMApCT09UUFJPVE89bm9uZQpPTkJPT1Q9eWVzCk1BU1RFUj1ib25kMApTTEFWRT15ZXM="
        }
      },
      {
        "path": "/etc/sysconfig/network-scripts/ifcfg-enp6s0",
        "filesystem": "root",
        "mode": 436,
        "contents": {
          "source": "data:text/plain;charset=utf-8;base64,REVWSUNFPWVucDZzMApCT09UUFJPVE89bm9uZQpPTkJPT1Q9eWVzCk1BU1RFUj1ib25kMApTTEFWRT15ZXMK"
        }
      },
      {
        "path": "/etc/sysconfig/network-scripts/ifcfg-bond0",
        "filesystem": "root",
        "mode": 436,
        "contents": {
          "source": "data:text/plain;charset=utf-8;base64,Qk9ORElOR19PUFRTPWRvd25kZWxheT0wIGxhY3BfcmF0ZT1mYXN0IG1paW1vbj0xMDAgbW9kZT04MDIuM2FkIHVwZGVsYXk9MApUWVBFPUJvbmQKQk9ORElOR19NQVNURVI9eWVzCkJPT1RQUk9UTz1kaGNwCk5BTUU9Ym9uZDAKREVWSUNFPWJvbmQwCk9OQk9PVD15ZXM="
        }
      }
    ]
  }
}


2. Upgrade to 4.6.0-rc.2

Actual results:
Upgrade fails and leaves one of the worker nodes without external network connectivity

Expected results:
Upgrade succeeds

Additional info:

nmcli and ip a output from the worker node which lost connectivity:

[core@localhost ~]$ nmcli con
NAME                UUID                                  TYPE        DEVICE 
Wired connection 1  95ea186e-e3b7-3399-811a-80ea135e5e82  ethernet    enp4s0 
br-ex               f7dfecd3-2069-411c-b2b3-eaceed0c0fa4  ovs-bridge  br-ex  
ovs-port-br-ex      5620c242-73ec-4d6c-af4e-fd8827bc92f8  ovs-port    br-ex  
ovs-port-phys0      0d8ba368-c8c0-4cd7-b796-8f6a061cb1bf  ovs-port    bond0  
bond0               ad33d8b0-1f7b-cab9-9447-ba07f855b143  bond        --     
ovs-if-phys0        e510ddbf-a2eb-4c10-a914-c4d080d78801  ethernet    --     
System enp5s0       9310e179-14b6-430a-6843-6491c047d532  ethernet    --     
System enp6s0       b43fa2aa-5a85-7b0a-9a20-469067dba6d6  ethernet    --     

[core@localhost ~]$ 
[core@localhost ~]$ nmcli dev
DEVICE  TYPE        STATE         CONNECTION         
enp4s0  ethernet    connected     Wired connection 1 
br-ex   ovs-bridge  connected     br-ex              
bond0   ovs-port    connected     ovs-port-phys0     
br-ex   ovs-port    connected     ovs-port-br-ex     
enp5s0  ethernet    disconnected  --                 
enp6s0  ethernet    disconnected  --                 
lo      loopback    unmanaged     --                 

[core@localhost ~]$ ip a 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:39:7a:a9 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:1101::54/128 scope global dynamic noprefixroute 
       valid_lft 2893sec preferred_lft 2893sec
    inet6 fe80::cdb1:e04b:f7c:b20e/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:1f:c4:d7 brd ff:ff:ff:ff:ff:ff
4: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:d5:b0:0f brd ff:ff:ff:ff:ff:ff


find /etc/NetworkManager/system-connections/ -type f -print -exec cat {} \;
/etc/NetworkManager/system-connections/br-ex.nmconnection
[connection]
id=br-ex
uuid=f7dfecd3-2069-411c-b2b3-eaceed0c0fa4
type=ovs-bridge
interface-name=br-ex
permissions=

[ethernet]
cloned-mac-address=52:54:00:1F:C4:D7
mac-address-blacklist=
mtu=1500

[ovs-bridge]

[ipv4]
dns-search=
method=auto

[ipv6]
addr-gen-mode=stable-privacy
dns-search=
method=auto

[proxy]
/etc/NetworkManager/system-connections/ovs-port-phys0.nmconnection
[connection]
id=ovs-port-phys0
uuid=0d8ba368-c8c0-4cd7-b796-8f6a061cb1bf
type=ovs-port
interface-name=bond0
master=br-ex
permissions=
slave-type=ovs-bridge

[ovs-port]
/etc/NetworkManager/system-connections/ovs-port-br-ex.nmconnection
[connection]
id=ovs-port-br-ex
uuid=5620c242-73ec-4d6c-af4e-fd8827bc92f8
type=ovs-port
interface-name=br-ex
master=br-ex
permissions=
slave-type=ovs-bridge

[ovs-port]
/etc/NetworkManager/system-connections/ovs-if-phys0.nmconnection
[connection]
id=ovs-if-phys0
uuid=e510ddbf-a2eb-4c10-a914-c4d080d78801
type=ethernet
autoconnect-priority=100
interface-name=bond0
master=0d8ba368-c8c0-4cd7-b796-8f6a061cb1bf
permissions=
slave-type=ovs-port

[ethernet]
mac-address-blacklist=
mtu=1500

[ovs-interface]
type=system

Comment 7 errata-xmlrpc 2021-02-24 15:25:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.