Bug 1891516

Summary: OVN incorrectly prefers static routes over directly connected routes
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Tim Rozet <trozet>
Component: OVNAssignee: lorenzo bianconi <lorenzo.bianconi>
Status: CLOSED WONTFIX QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: unspecified    
Version: RHEL 8.0CC: ctrautma, lorenzo.bianconi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-22 20:23:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tim Rozet 2020-10-26 14:02:34 UTC
Description of problem:
Consider the following topology:

    100.64.0.1                                       10.244.0.6
ext--GR1 ------------OVN cluster router (DR)-----------pod1 
                     |
ext--GR2 -------------
    100.64.0.2

pod1 and GR 1 are on the same node. GR1 and GR2 both SNAT ingress traffic to them.

OVN DR has static routes for:

10.244.0.0/24 via 100.64.0.1 src-ip

Now if a packet comes from GR1 from external network (ext) it will be SNAT to 100.64.0.1 and send to pod1. pod 1 sends a reply packet and it goes back to 100.64.0.1 correctly.  However, if a packet ingresses at GR2, it gets SNAT to 100.64.0.2 and the packet is sent to pod 1. pod1 then replies and the packet goes to the DR. At this point OVN uses the static route and incorrectly sends the packet to GR1 (asymmetrical routing).

The correct behavior should have been to send the packet to GR2. This is because the destination is 100.64.0.2, and a router should always have a lower metric for directly connected routes. A workaround to this problem is to add a dummy static route with higher prefix match like:

100.64.0.1/32 via 100.64.0.1 dst-ip
100.64.0.2/32 via 100.64.0.2 dst-ip

^This is kind of silly, but it works.

Comment 1 Tim Rozet 2021-02-10 17:32:13 UTC
Here are the relevant lflows:
  table=10(lr_in_ip_routing   ), priority=65   , match=(ip4.dst == 100.64.0.2/32), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.2; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=65   , match=(ip4.dst == 100.64.0.3/32), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.3; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=65   , match=(ip4.dst == 100.64.0.4/32), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.4; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=49   , match=(ip4.dst == 10.244.0.0/24), action=(ip.ttl--; reg8[0..15] = 0; reg0 = ip4.dst; reg1 = 10.244.0.1; eth.src = 0a:58:0a:f4:00:01; outport = "rtos-ovn-worker"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=49   , match=(ip4.dst == 10.244.1.0/24), action=(ip.ttl--; reg8[0..15] = 0; reg0 = ip4.dst; reg1 = 10.244.1.1; eth.src = 0a:58:0a:f4:01:01; outport = "rtos-ovn-control-plane"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=49   , match=(ip4.dst == 10.244.2.0/24), action=(ip.ttl--; reg8[0..15] = 0; reg0 = ip4.dst; reg1 = 10.244.2.1; eth.src = 0a:58:0a:f4:02:01; outport = "rtos-ovn-worker2"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=48   , match=(ip4.src == 10.244.0.0/24), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.4; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=48   , match=(ip4.src == 10.244.1.0/24), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.2; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=48   , match=(ip4.src == 10.244.2.0/24), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.3; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=41   , match=(ip4.dst == 169.254.0.0/20), action=(ip.ttl--; reg8[0..15] = 0; reg0 = ip4.dst; reg1 = 169.254.0.2; eth.src = 0a:58:a9:fe:00:02; outport = "rtos-node_local_switch"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=33   , match=(ip4.dst == 100.64.0.0/16), action=(ip.ttl--; reg8[0..15] = 0; reg0 = ip4.dst; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)

The priority 65 flows here are the outcome of our workaround routes:
  table=10(lr_in_ip_routing   ), priority=65   , match=(ip4.dst == 100.64.0.2/32), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.2; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=65   , match=(ip4.dst == 100.64.0.3/32), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.3; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)
  table=10(lr_in_ip_routing   ), priority=65   , match=(ip4.dst == 100.64.0.4/32), action=(ip.ttl--; reg8[0..15] = 0; reg0 = 100.64.0.4; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)

100.64.0.1/32 via 100.64.0.1 dst-ip
100.64.0.2/32 via 100.64.0.2 dst-ip

The directly connected route has a priority of 33 here:
  table=10(lr_in_ip_routing   ), priority=33   , match=(ip4.dst == 100.64.0.0/16), action=(ip.ttl--; reg8[0..15] = 0; reg0 = ip4.dst; reg1 = 100.64.0.1; eth.src = 0a:58:64:40:00:01; outport = "rtoj-ovn_cluster_router"; flags.loopback = 1; next;)

As Lorenzo explained, this is calculated by prefix * 2 + 1. However, this route should have the highest possible priority (67 should be safe). This is because this is a directly connected route which has a lower metric in routing than any other static route. On most real routing appliances a directly connected route has a metric of 0, and a user is not able to add a static route with a metric any lower than 1. In other words, directly connected routes *always* have higher priority and cannot be overidden with static routes. If a user wants to do something different than traditional routing, they may use policy routing (available in OVN) to override the route lookup determined.

Comment 2 lorenzo bianconi 2021-02-18 22:17:21 UTC
(In reply to Tim Rozet from comment #0)
> Description of problem:
> Consider the following topology:
> 
>     100.64.0.1                                       10.244.0.6
> ext--GR1 ------------OVN cluster router (DR)-----------pod1 
>                      |
> ext--GR2 -------------
>     100.64.0.2
> 
> pod1 and GR 1 are on the same node. GR1 and GR2 both SNAT ingress traffic to
> them.
> 
> OVN DR has static routes for:
> 
> 10.244.0.0/24 via 100.64.0.1 src-ip
> 
> Now if a packet comes from GR1 from external network (ext) it will be SNAT
> to 100.64.0.1 and send to pod1. pod 1 sends a reply packet and it goes back
> to 100.64.0.1 correctly.  However, if a packet ingresses at GR2, it gets
> SNAT to 100.64.0.2 and the packet is sent to pod 1. pod1 then replies and
> the packet goes to the DR. At this point OVN uses the static route and
> incorrectly sends the packet to GR1 (asymmetrical routing).
> 
> The correct behavior should have been to send the packet to GR2. This is
> because the destination is 100.64.0.2, and a router should always have a
> lower metric for directly connected routes. A workaround to this problem is
> to add a dummy static route with higher prefix match like:
> 
> 100.64.0.1/32 via 100.64.0.1 dst-ip
> 100.64.0.2/32 via 100.64.0.2 dst-ip
> 
> ^This is kind of silly, but it works.

Based on the upstream discussion [0], the issue can be fixed using the policy routing table implemented in OVN without removing the capability to overwrite automatic routes with static ones according to the "longest prefix match".
@Tim: do you agree to close the bz as "not a bug"?

[0] https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380295.html

Comment 3 Tim Rozet 2021-02-22 20:05:15 UTC
Yeah, we can just keep the workaround for now. It doesn't hurt us to have it.