Bug 1751733

Summary: OVS-DPDK bridges should be DOWN
Product: Red Hat OpenStack Reporter: Fouad Hallal <fhallal>
Component: openvswitchAssignee: David Marchand <dmarchan>
Status: CLOSED WONTFIX QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: apevec, bcafarel, cfontain, chrisw, djuran, fbaudin, fhallal, kfida, njohnston, rhos-maint
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-13 14:12:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1628227    

Comment 2 David Marchand 2019-11-12 15:44:39 UTC
I can see two ways to handle the initial performance issue that had been observed:
- put the bridge interfaces down, but we then need two blocks (neighbour + routing information) for tunnels over IP, like VxLAN.
  OVS currently relies on the linux kernel to provide both functionality. An internal cache is filled with informations coming from netlink.
  The cache can be manually populated too, but any update coming from netlink would flush the whole cache.

- have the PMD threads "offload" the syscall/transmission of a packet on the bridge interface to a non PMD thread.
  This "offload" requires a non-blocking communication channel.

Comment 3 David Marchand 2019-12-09 13:15:58 UTC
Talked to Christophe, focusing on the first proposed solution.
We agreed on a debugging session on his platform, most likely this week.

Comment 4 David Marchand 2019-12-13 16:18:16 UTC
Writing my current notes following debugging sessions with Christophe.

We can make use of a "dummy" netdev.

- This netdev carries the ip address of the tunnel endpoint and is put in the bridge receiving the encapsulated traffic.
Either os-net-config or neutron must configure this ip address by calling ovs-appctl netdev-dummy/ip4address and enabling vxlan listener by configuring a route for the listening ip.
# ovs-appctl netdev-dummy/ip4addr dummy0 16.0.0.2/24
# ovs-appctl ovs/route/add 16.0.0.2/32 dummy0
# ovs-appctl ovs/route/add 16.0.0.0/24 dummy0

- No IP address is put on the bridge netdev itself, which is an issue for neutron that checks for this ip address.
So a command has been added in ovs to dump ip addresses.
# ovs-appctl ovs/ip/show dummy0
16.0.0.2/24

- This netdev itself replies to ARP request.
icmpv6 is not handled, I wrote a patch for it.

- One additional problem identified during these sessions is that ovs 2.11 is missing the upstream change "userspace: Enable non-bridge port as tunnel endpoint.".

- By default, the dummy netdev decides on a mac address with a fixed format, I will investigate this before submitting all those changes upstream.

Christophe wants to test a rpm I provided him with the current changes.

Comment 5 David Marchand 2021-01-13 14:09:15 UTC
Revisiting the problem and trying to summarize.

This issue comes from the use of NORMAL actions in OVS pipeline, that have a negative impact on performance.
On a bridge, such a NORMAL actions means that packets for unknown destinations are flooded and end up on a tap iface.
The workaround is to put the tap ifaces down to solve this.


In OVS-DPDK + vxlan setups, putting the tap iface down is a problem as the kernel is used to fulfill functions needed by OVS:
- provide routing informations, to know how to route packets into a IP tunnel,
- provide neighbour info, to know how to build the outer IP header when encapsulating packets,
- reply to ARP when the remote tunnel endpoint tries to resolve this side of the IP tunnel,


Now, if we reconsider this in the light of OVN getting into RHOSP and configuring OVS.
To route packets through tunnels, OVN writes fully described OF rules.
So there would be no need to keep the tap iface up.

Besides, OVN does not rely on NORMAL actions in its pipeline, so the performance issue should not occur anyway.

Comment 6 Franck Baudin 2021-01-13 14:12:39 UTC
As we move to OVN, we won't pursue in fixing this bug, and we will re-open it if OVN requires tap interfaces to be up, in the future.