Created attachment 1719306 [details] capture with port included to br-ex bridge Description of problem: This is a baremetal IPI environment simulated on VMs. During an upgrade attempt from 4.5.0-0.nightly-2020-10-03-071140 to 4.6.0-0.nightly-2020-10-02-065738, while updating the last operator(machine-config) and after the workers nodes got rebooted I could notice an increased CPU usage and packet loss when connecting to the VMs. Note that at this point the master nodes were not rebooted yet. When running tcpdump on the virtual switch interface on the hypervisor I could spot a huge amount of ARP frames which indicate to a broadcast storm. Trying to isolate the issue I left only one worker node powered on and removed the physical port from the br-ex bridge which stopped the broadcast storm. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-10-02-065738 How reproducible: Reproduced 2 times on a VM environment Steps to Reproduce: 1. Deploy a 4.5 baremetal IPI deployment simulated on VMs with OVN Hybrid overlay enabled 2. Trigger upgrade to 4.6 3. Wait until machine-config operator starts upgrading 4. Wait for the worker nodes to be rebooted and get the br-ex bridge configured 5. On the hypervisor run: tcpdump -i baremetal-0 arp -n where baremetal-0 is the libvirt interface of the external network the nodes are connected to Actual results: Huge amount of ARP frames which indicate a broadcast storm Expected results: Normal amount of ARP frames Additional info: After removing the port from br-ex bridge by running 'ovs-vsctl add-port br-ex enp5s0' the broadcast storm stops. When adding it back by 'ovs-vsctl add-port br-ex enp5s0' we can immediately see a huge amount of ARP frames. Attaching captures with and without the port connected to br-ex bridge on the worker node.
Created attachment 1719308 [details] capture with port removed from br-ex bridge
[root@worker-0-0 core]# ovs-vsctl show c4850a14-9674-45b3-ba97-d46e1f525870 Bridge br-ex Port patch-br-ex_worker-0-0-to-br-int Interface patch-br-ex_worker-0-0-to-br-int type: patch options: {peer=patch-br-int-to-br-ex_worker-0-0} Port br-ex Interface br-ex type: internal Port enp5s0 Interface enp5s0 Port patch-br-local_worker-0-0-to-br-int Interface patch-br-local_worker-0-0-to-br-int type: patch options: {peer=patch-br-int-to-br-local_worker-0-0} Bridge br-ext fail_mode: secure Port ext Interface ext type: patch options: {peer=int} Port br-ext Interface br-ext type: internal Port ext-vxlan Interface ext-vxlan type: vxlan options: {dst_port="4789", key=flow, remote_ip=flow} Bridge br-int fail_mode: secure Port aa9d8666adf378e Interface aa9d8666adf378e Port ovn-6ae888-0 Interface ovn-6ae888-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.123.120"} Port patch-br-int-to-br-ex_worker-0-0 Interface patch-br-int-to-br-ex_worker-0-0 type: patch options: {peer=patch-br-ex_worker-0-0-to-br-int} Port "2d5c228c291df21" Interface "2d5c228c291df21" Port "250942188c706bf" Interface "250942188c706bf" Port patch-br-int-to-br-local_worker-0-0 Interface patch-br-int-to-br-local_worker-0-0 type: patch options: {peer=patch-br-local_worker-0-0-to-br-int} Port "6bf97b802edac0a" Interface "6bf97b802edac0a" Port patch-br-int-to-lnet-node_local_switch Interface patch-br-int-to-lnet-node_local_switch type: patch options: {peer=patch-lnet-node_local_switch-to-br-int} Port adc8c326908e4c6 Interface adc8c326908e4c6 Port adbca919d99a261 Interface adbca919d99a261 Port ec7779a58a26600 Interface ec7779a58a26600 Port ovn-4370f2-0 Interface ovn-4370f2-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.123.102"} Port br-int Interface br-int type: internal Port "69c88504773f48d" Interface "69c88504773f48d" Port int Interface int type: patch options: {peer=ext} Port ovn-2cf2b9-0 Interface ovn-2cf2b9-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.123.123"} Port ovn-k8s-mp0 Interface ovn-k8s-mp0 type: internal Port b3b6431b038d414 Interface b3b6431b038d414 Port "54f65e329087557" Interface "54f65e329087557" Port "06c88cb9c305278" Interface "06c88cb9c305278" Bridge br-local Port br-local Interface br-local type: internal Port patch-lnet-node_local_switch-to-br-int Interface patch-lnet-node_local_switch-to-br-int type: patch options: {peer=patch-br-int-to-lnet-node_local_switch} Port ovn-k8s-gw0 Interface ovn-k8s-gw0 type: internal ovs_version: "2.13.2"
Looks like br-ex bridge is connected to br-int bridge through 2 patch ports: patch-br-ex_worker-0-0-to-br-int and patch-br-local_worker-0-0-to-br-int. By their names patch-br-local_worker-0-0-to-br-int doesn't seem right to be connected to br-ex bridge.
Marius, is it possible to get a must-gather on this? I'd need to see the ovs logs as to why that port was not removed on upgrade.
Or better, is this environment still available?
Ricky hopped on a broken cluster and debugged it and found that the system ovs and container ovs are both running. This seems to be a dupe of 1880591 *** This bug has been marked as a duplicate of bug 1880591 ***
Looking at Marius' cluster, this looks like a legitimate upgrade issue. The problem is when we upgrade from 4.5->4.6 aka old local gw mode to new local gw mode, we are not cleaning up the old ports that were there in OVN DB. ovn-controller is seeing these ports in the DB and then creating the extra patch port which is the cause of this issue. From OVN DB I can see: switch 89dea851-8220-4ae2-9881-704c79e61dae (ext_worker-0-0) port etor-GR_worker-0-0 type: router addresses: ["52:54:00:94:dc:86"] router-port: rtoe-GR_worker-0-0 port br-ex_worker-0-0 type: localnet addresses: ["unknown"] port br-local_worker-0-0 <-----extra port from previous version type: localnet addresses: ["unknown"] [root@master-0-0 ~]# ovn-nbctl --no-leader-only lsp-list 89dea851-8220-4ae2-9881-704c79e61dae 91cdbcd9-9e54-4b91-a032-6b48923c1cfe (br-ex_worker-0-0) e993ce55-00fe-4384-9a4a-fbbef6770961 (br-local_worker-0-0) 466f7277-3a84-4881-bc51-0cc2872f358c (etor-GR_worker-0-0) [root@master-0-0 ~]# ovn-nbctl --no-leader-only list logical_switch_port e993ce55-00fe-4384-9a4a-fbbef6770961 _uuid : e993ce55-00fe-4384-9a4a-fbbef6770961 addresses : [unknown] dhcpv4_options : [] dhcpv6_options : [] dynamic_addresses : [] enabled : [] external_ids : {} ha_chassis_group : [] name : br-local_worker-0-0 options : {network_name=physnet} parent_name : [] port_security : [] tag : [] tag_request : [] type : localnet up : false [root@master-0-0 ~]# ovn-nbctl --no-leader-only list logical_switch_port 91cdbcd9-9e54-4b91-a032-6b48923c1cfe _uuid : 91cdbcd9-9e54-4b91-a032-6b48923c1cfe addresses : [unknown] dhcpv4_options : [] dhcpv6_options : [] dynamic_addresses : [] enabled : [] external_ids : {} ha_chassis_group : [] name : br-ex_worker-0-0 options : {network_name=physnet} parent_name : [] port_security : [] tag : [] tag_request : 0 type : localnet up : false The solution here is to remove the old port from the OVN DB when we start. It looks like deleting it from OVN NB ends up deleting it in OVS via ovn-controller: [root@master-0-1 ~]# ovn-nbctl lsp-del e993ce55-00fe-4384-9a4a-fbbef6770961 [root@master-0-1 ~]# [root@master-0-1 ~]# ovn-nbctl lsp-list 89dea851-8220-4ae2-9881-704c79e61dae 91cdbcd9-9e54-4b91-a032-6b48923c1cfe (br-ex_worker-0-0) 466f7277-3a84-4881-bc51-0cc2872f358c (etor-GR_worker-0-0) [root@worker-0-0 ~]# ovs-vsctl show c4850a14-9674-45b3-ba97-d46e1f525870 Bridge br-ex Port patch-br-ex_worker-0-0-to-br-int Interface patch-br-ex_worker-0-0-to-br-int type: patch options: {peer=patch-br-int-to-br-ex_worker-0-0} Port br-ex Interface br-ex type: internal Port enp5s0 Interface enp5s0
Verified on 4.6.0-0.nightly-2020-10-09-033719 [kni@provisionhost-0-0 ~]$ ssh core@master-0-0 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_master-0-0-to-br-int [kni@provisionhost-0-0 ~]$ ssh core@master-0-1 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_master-0-1-to-br-int [kni@provisionhost-0-0 ~]$ ssh core@master-0-2 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_master-0-2-to-br-int [kni@provisionhost-0-0 ~]$ ssh core@worker-0-0 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_worker-0-0-to-br-int
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196