Bug 1885517
| Summary: | Baremetal IPI: broadcast storm is triggered by br-ex bridge during upgrade from 4.5.0-0.nightly-2020-10-03-071140 to 4.6.0-0.nightly-2020-10-02-065738 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Marius Cornea <mcornea> | ||||||
| Component: | Networking | Assignee: | Tim Rozet <trozet> | ||||||
| Networking sub component: | ovn-kubernetes | QA Contact: | Marius Cornea <mcornea> | ||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||
| Severity: | urgent | ||||||||
| Priority: | urgent | CC: | achernet, bbennett, ricarril, trozet, yprokule | ||||||
| Version: | 4.6 | Keywords: | Reopened, TestBlocker | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.6.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-10-27 16:47:49 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 1886166 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
Created attachment 1719308 [details]
capture with port removed from br-ex bridge
[root@worker-0-0 core]# ovs-vsctl show
c4850a14-9674-45b3-ba97-d46e1f525870
Bridge br-ex
Port patch-br-ex_worker-0-0-to-br-int
Interface patch-br-ex_worker-0-0-to-br-int
type: patch
options: {peer=patch-br-int-to-br-ex_worker-0-0}
Port br-ex
Interface br-ex
type: internal
Port enp5s0
Interface enp5s0
Port patch-br-local_worker-0-0-to-br-int
Interface patch-br-local_worker-0-0-to-br-int
type: patch
options: {peer=patch-br-int-to-br-local_worker-0-0}
Bridge br-ext
fail_mode: secure
Port ext
Interface ext
type: patch
options: {peer=int}
Port br-ext
Interface br-ext
type: internal
Port ext-vxlan
Interface ext-vxlan
type: vxlan
options: {dst_port="4789", key=flow, remote_ip=flow}
Bridge br-int
fail_mode: secure
Port aa9d8666adf378e
Interface aa9d8666adf378e
Port ovn-6ae888-0
Interface ovn-6ae888-0
type: geneve
options: {csum="true", key=flow, remote_ip="192.168.123.120"}
Port patch-br-int-to-br-ex_worker-0-0
Interface patch-br-int-to-br-ex_worker-0-0
type: patch
options: {peer=patch-br-ex_worker-0-0-to-br-int}
Port "2d5c228c291df21"
Interface "2d5c228c291df21"
Port "250942188c706bf"
Interface "250942188c706bf"
Port patch-br-int-to-br-local_worker-0-0
Interface patch-br-int-to-br-local_worker-0-0
type: patch
options: {peer=patch-br-local_worker-0-0-to-br-int}
Port "6bf97b802edac0a"
Interface "6bf97b802edac0a"
Port patch-br-int-to-lnet-node_local_switch
Interface patch-br-int-to-lnet-node_local_switch
type: patch
options: {peer=patch-lnet-node_local_switch-to-br-int}
Port adc8c326908e4c6
Interface adc8c326908e4c6
Port adbca919d99a261
Interface adbca919d99a261
Port ec7779a58a26600
Interface ec7779a58a26600
Port ovn-4370f2-0
Interface ovn-4370f2-0
type: geneve
options: {csum="true", key=flow, remote_ip="192.168.123.102"}
Port br-int
Interface br-int
type: internal
Port "69c88504773f48d"
Interface "69c88504773f48d"
Port int
Interface int
type: patch
options: {peer=ext}
Port ovn-2cf2b9-0
Interface ovn-2cf2b9-0
type: geneve
options: {csum="true", key=flow, remote_ip="192.168.123.123"}
Port ovn-k8s-mp0
Interface ovn-k8s-mp0
type: internal
Port b3b6431b038d414
Interface b3b6431b038d414
Port "54f65e329087557"
Interface "54f65e329087557"
Port "06c88cb9c305278"
Interface "06c88cb9c305278"
Bridge br-local
Port br-local
Interface br-local
type: internal
Port patch-lnet-node_local_switch-to-br-int
Interface patch-lnet-node_local_switch-to-br-int
type: patch
options: {peer=patch-br-int-to-lnet-node_local_switch}
Port ovn-k8s-gw0
Interface ovn-k8s-gw0
type: internal
ovs_version: "2.13.2"
Looks like br-ex bridge is connected to br-int bridge through 2 patch ports: patch-br-ex_worker-0-0-to-br-int and patch-br-local_worker-0-0-to-br-int. By their names patch-br-local_worker-0-0-to-br-int doesn't seem right to be connected to br-ex bridge. Marius, is it possible to get a must-gather on this? I'd need to see the ovs logs as to why that port was not removed on upgrade. Or better, is this environment still available? Ricky hopped on a broken cluster and debugged it and found that the system ovs and container ovs are both running. This seems to be a dupe of 1880591 *** This bug has been marked as a duplicate of bug 1880591 *** Looking at Marius' cluster, this looks like a legitimate upgrade issue. The problem is when we upgrade from 4.5->4.6 aka old local gw mode to new local gw mode, we are not cleaning up the old ports that were there in OVN DB. ovn-controller is seeing these ports in the DB and then creating the extra patch port which is the cause of this issue. From OVN DB I can see:
switch 89dea851-8220-4ae2-9881-704c79e61dae (ext_worker-0-0)
port etor-GR_worker-0-0
type: router
addresses: ["52:54:00:94:dc:86"]
router-port: rtoe-GR_worker-0-0
port br-ex_worker-0-0
type: localnet
addresses: ["unknown"]
port br-local_worker-0-0 <-----extra port from previous version
type: localnet
addresses: ["unknown"]
[root@master-0-0 ~]# ovn-nbctl --no-leader-only lsp-list 89dea851-8220-4ae2-9881-704c79e61dae
91cdbcd9-9e54-4b91-a032-6b48923c1cfe (br-ex_worker-0-0)
e993ce55-00fe-4384-9a4a-fbbef6770961 (br-local_worker-0-0)
466f7277-3a84-4881-bc51-0cc2872f358c (etor-GR_worker-0-0)
[root@master-0-0 ~]# ovn-nbctl --no-leader-only list logical_switch_port e993ce55-00fe-4384-9a4a-fbbef6770961
_uuid : e993ce55-00fe-4384-9a4a-fbbef6770961
addresses : [unknown]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids : {}
ha_chassis_group : []
name : br-local_worker-0-0
options : {network_name=physnet}
parent_name : []
port_security : []
tag : []
tag_request : []
type : localnet
up : false
[root@master-0-0 ~]# ovn-nbctl --no-leader-only list logical_switch_port 91cdbcd9-9e54-4b91-a032-6b48923c1cfe
_uuid : 91cdbcd9-9e54-4b91-a032-6b48923c1cfe
addresses : [unknown]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids : {}
ha_chassis_group : []
name : br-ex_worker-0-0
options : {network_name=physnet}
parent_name : []
port_security : []
tag : []
tag_request : 0
type : localnet
up : false
The solution here is to remove the old port from the OVN DB when we start. It looks like deleting it from OVN NB ends up deleting it in OVS via ovn-controller:
[root@master-0-1 ~]# ovn-nbctl lsp-del e993ce55-00fe-4384-9a4a-fbbef6770961
[root@master-0-1 ~]#
[root@master-0-1 ~]# ovn-nbctl lsp-list 89dea851-8220-4ae2-9881-704c79e61dae
91cdbcd9-9e54-4b91-a032-6b48923c1cfe (br-ex_worker-0-0)
466f7277-3a84-4881-bc51-0cc2872f358c (etor-GR_worker-0-0)
[root@worker-0-0 ~]# ovs-vsctl show
c4850a14-9674-45b3-ba97-d46e1f525870
Bridge br-ex
Port patch-br-ex_worker-0-0-to-br-int
Interface patch-br-ex_worker-0-0-to-br-int
type: patch
options: {peer=patch-br-int-to-br-ex_worker-0-0}
Port br-ex
Interface br-ex
type: internal
Port enp5s0
Interface enp5s0
Verified on 4.6.0-0.nightly-2020-10-09-033719 [kni@provisionhost-0-0 ~]$ ssh core@master-0-0 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_master-0-0-to-br-int [kni@provisionhost-0-0 ~]$ ssh core@master-0-1 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_master-0-1-to-br-int [kni@provisionhost-0-0 ~]$ ssh core@master-0-2 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_master-0-2-to-br-int [kni@provisionhost-0-0 ~]$ ssh core@worker-0-0 sudo ovs-vsctl list-ports br-ex enp5s0 patch-br-ex_worker-0-0-to-br-int Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Created attachment 1719306 [details] capture with port included to br-ex bridge Description of problem: This is a baremetal IPI environment simulated on VMs. During an upgrade attempt from 4.5.0-0.nightly-2020-10-03-071140 to 4.6.0-0.nightly-2020-10-02-065738, while updating the last operator(machine-config) and after the workers nodes got rebooted I could notice an increased CPU usage and packet loss when connecting to the VMs. Note that at this point the master nodes were not rebooted yet. When running tcpdump on the virtual switch interface on the hypervisor I could spot a huge amount of ARP frames which indicate to a broadcast storm. Trying to isolate the issue I left only one worker node powered on and removed the physical port from the br-ex bridge which stopped the broadcast storm. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-10-02-065738 How reproducible: Reproduced 2 times on a VM environment Steps to Reproduce: 1. Deploy a 4.5 baremetal IPI deployment simulated on VMs with OVN Hybrid overlay enabled 2. Trigger upgrade to 4.6 3. Wait until machine-config operator starts upgrading 4. Wait for the worker nodes to be rebooted and get the br-ex bridge configured 5. On the hypervisor run: tcpdump -i baremetal-0 arp -n where baremetal-0 is the libvirt interface of the external network the nodes are connected to Actual results: Huge amount of ARP frames which indicate a broadcast storm Expected results: Normal amount of ARP frames Additional info: After removing the port from br-ex bridge by running 'ovs-vsctl add-port br-ex enp5s0' the broadcast storm stops. When adding it back by 'ovs-vsctl add-port br-ex enp5s0' we can immediately see a huge amount of ARP frames. Attaching captures with and without the port connected to br-ex bridge on the worker node.