Bug 1849683
| Summary: | [RFE] Add support for stateful next hop (ECMP bypass) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Tim Rozet <trozet> |
| Component: | OVN | Assignee: | Mark Michelson <mmichels> |
| Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | RHEL 8.0 | CC: | ctrautma, dceara, mmichels, nusiddiq, ricarril |
| Target Milestone: | --- | Keywords: | FutureFeature |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-16 16:01:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I've been looking into this today, and I think I have some good foundations for this. But I want to run the idea by both the core OVN team and the OpenShift folk to make sure that everything is sound. For external -> internal traffic First order of business is detecting if we have received traffic from one of our configured ECMP routes and if so, which one. The only way I'm aware of to recognize that the packet came from one of our nexthop routes is based on source MAC. Next, once we determine which of the ECMP routes we received the packet on, we need to invoke conntrack. My thought was that we could use ct_label or ct_mark to hold the ECMP group ID and ECMP route ID. In my example below, I use ct_label, but we could just as easily use ct_mark. For internal -> external traffic We need to establish a rule at a higher priority than typical ECMP rules that states that if ct_label holds ECMP IDs, use those instead of the usual selection method. Let's run through a sample scenario using the diagram above. Let's say that 1.1.1.3 has MAC 00:00:01:01:01:03 and 1.1.1.4 has MAC 00:00:01:01:01:04. Our router port (gr-ext) has an IP address of 192.168.0.1 and a MAC of 00:00:00:00:01 Currently, the ingress router pipeline would have flows like the following: table=9 (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = select(2, 3);) table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;) table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;) In table 9, we select one of the two routes to use (I arbitrarily gave them IDs 2 and 3 here). And in table 10, we use that selected value to set the nexthop address in reg0. Now let's add in the changes I described. Now the ingress router pipeline would look like this: table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == 00:00:01:01:01:03 && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=00000002/32);) table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == 00:00:01:01:01:04 && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=00000003/32);) table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] == 0 && ct_label[16..31] == 2 && ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 2;) table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] == 0 && ct_label[16..31] == 3 && ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 3;) table=9 (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = select(2, 3);) table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;) table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;) The top four flows are new. The bottom three are identical to the previous ones. At some point prior to table 9, we need to check the source MAC and commit to conntrack if it matches one of our ECMP nexthops. The flow also matches on the ip4 destination in order to limit conntrack use only to when we should need it. The ct_label's top four bits are the ECMP group ID (0) and the bottom four bits are the route ID (2 and 3, respectively). I put ? for the table because I'm not sure if this fits well in an existing table or if it is going to require a new one to be created. Now in table 9, we've added two new 100-priority flows. These will match on reply traffic and sets reg8 to have the same contents as ct_label. Does this seem like a reasonable approach? In my previous comment, all the "outport = lrp1s" should be "outport = gr-ext". (I really wish bugzilla allowed comment editing) (In reply to Mark Michelson from comment #1) > I've been looking into this today, and I think I have some good foundations > for this. But I want to run the idea by both the core OVN team and the > OpenShift folk to make sure that everything is sound. > > For external -> internal traffic > > First order of business is detecting if we have received traffic from one of > our configured ECMP routes and if so, which one. The only way I'm aware of > to recognize that the packet came from one of our nexthop routes is based on > source MAC. > > Next, once we determine which of the ECMP routes we received the packet on, > we need to invoke conntrack. My thought was that we could use ct_label or > ct_mark to hold the ECMP group ID and ECMP route ID. In my example below, I > use ct_label, but we could just as easily use ct_mark. > > For internal -> external traffic > > We need to establish a rule at a higher priority than typical ECMP rules > that states that if ct_label holds ECMP IDs, use those instead of the usual > selection method. > > > > Let's run through a sample scenario using the diagram above. Let's say that > 1.1.1.3 has MAC 00:00:01:01:01:03 and 1.1.1.4 has MAC 00:00:01:01:01:04. Our > router port (gr-ext) has an IP address of 192.168.0.1 and a MAC of > 00:00:00:00:01 > A bit of a nit pick but gr-ext IP should be in the same subnet as the ECMP next-hops. We don't support indirect next-hops in OVN. And setting reg0 to 1.1.1.3 and reg1 to 192.168.0.1 will cause ARP requests with arp.spa=192.168.0.1 && arp.tpa==1.1.1.3 to be generated. These would most likely be dropped as invalid at 1.1.1.3. > Currently, the ingress router pipeline would have flows like the following: > > table=9 (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), > action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = > select(2, 3);) > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = > 00:00:00:00:00:01; outport = lrp1;) > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = > 00:00:00:00:00:01; outport = lrp1;) > > In table 9, we select one of the two routes to use (I arbitrarily gave them > IDs 2 and 3 here). And in table 10, we use that selected value to set the > nexthop address in reg0. > > Now let's add in the changes I described. Now the ingress router pipeline > would look like this: > It seems to me like we miss a flow before this that would do ct() in the current zone to get the ct_state that's used below to determine if a packet is for a new session or not. > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == > 00:00:01:01:01:03 && ip4.dst == 10.0.0.6), > action=(ct_commit(ct_label=00000002/32);) > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == > 00:00:01:01:01:04 && ip4.dst == 10.0.0.6), > action=(ct_commit(ct_label=00000003/32);) I'm a bit confused, how do we know the ETH addresses of the ECMP next-hops? These can be outside OVN (and they actually are outside in the diagram above)? Is the plan to install these flows dynamically based on MAC_Binding records? > table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] > == 0 && ct_label[16..31] == 2 && ip4.src == 10.0.0.6), action=(ip.ttl--; > flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 2;) > table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] > == 0 && ct_label[16..31] == 3 && ip4.src == 10.0.0.6), action=(ip.ttl--; > flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 3;) > table=9 (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), > action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = > select(2, 3);) > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = > 00:00:00:00:00:01; outport = lrp1;) > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = > 00:00:00:00:00:01; outport = lrp1;) > > The top four flows are new. The bottom three are identical to the previous > ones. At some point prior to table 9, we need to check the source MAC and > commit to conntrack if it matches one of our ECMP nexthops. The flow also > matches on the ip4 destination in order to limit conntrack use only to when > we should need it. The ct_label's top four bits are the ECMP group ID (0) > and the bottom four bits are the route ID (2 and 3, respectively). I put ? > for the table because I'm not sure if this fits well in an existing table or > if it is going to require a new one to be created. Now in table 9, we've > added two new 100-priority flows. These will match on reply traffic and sets > reg8 to have the same contents as ct_label. > > Does this seem like a reasonable approach? (In reply to Dumitru Ceara from comment #3) > (In reply to Mark Michelson from comment #1) > > I've been looking into this today, and I think I have some good foundations > > for this. But I want to run the idea by both the core OVN team and the > > OpenShift folk to make sure that everything is sound. > > > > For external -> internal traffic > > > > First order of business is detecting if we have received traffic from one of > > our configured ECMP routes and if so, which one. The only way I'm aware of > > to recognize that the packet came from one of our nexthop routes is based on > > source MAC. > > > > Next, once we determine which of the ECMP routes we received the packet on, > > we need to invoke conntrack. My thought was that we could use ct_label or > > ct_mark to hold the ECMP group ID and ECMP route ID. In my example below, I > > use ct_label, but we could just as easily use ct_mark. > > > > For internal -> external traffic > > > > We need to establish a rule at a higher priority than typical ECMP rules > > that states that if ct_label holds ECMP IDs, use those instead of the usual > > selection method. > > > > > > > > Let's run through a sample scenario using the diagram above. Let's say that > > 1.1.1.3 has MAC 00:00:01:01:01:03 and 1.1.1.4 has MAC 00:00:01:01:01:04. Our > > router port (gr-ext) has an IP address of 192.168.0.1 and a MAC of > > 00:00:00:00:01 > > > > A bit of a nit pick but gr-ext IP should be in the same subnet as the ECMP > next-hops. We don't support indirect next-hops in OVN. And setting reg0 to > 1.1.1.3 and reg1 to 192.168.0.1 will cause ARP requests with > arp.spa=192.168.0.1 && arp.tpa==1.1.1.3 to be generated. These would most > likely be dropped as invalid at 1.1.1.3. Thanks for the clarification. Let's just mentally replace all occurrences of 192.168.0.1 with 1.1.1.3 instead. > > > Currently, the ingress router pipeline would have flows like the following: > > > > table=9 (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), > > action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = > > select(2, 3);) > > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > > reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = > > 00:00:00:00:00:01; outport = lrp1;) > > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > > reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = > > 00:00:00:00:00:01; outport = lrp1;) > > > > In table 9, we select one of the two routes to use (I arbitrarily gave them > > IDs 2 and 3 here). And in table 10, we use that selected value to set the > > nexthop address in reg0. > > > > Now let's add in the changes I described. Now the ingress router pipeline > > would look like this: > > > > It seems to me like we miss a flow before this that would do ct() in the > current zone to get the ct_state that's used below to determine if a packet > is for a new session or not. OK, that's a mistake on my part. > > > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == > > 00:00:01:01:01:03 && ip4.dst == 10.0.0.6), > > action=(ct_commit(ct_label=00000002/32);) > > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == > > 00:00:01:01:01:04 && ip4.dst == 10.0.0.6), > > action=(ct_commit(ct_label=00000003/32);) > > I'm a bit confused, how do we know the ETH addresses of the ECMP next-hops? > These can be outside OVN (and they actually are outside in the diagram > above)? Is the plan to install these flows dynamically based on MAC_Binding > records? Yes, that's exactly what I was thinking. Is there a more foolproof way to determine whether the previous hop was one of the ECMP nexthops? > > > table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] > > == 0 && ct_label[16..31] == 2 && ip4.src == 10.0.0.6), action=(ip.ttl--; > > flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 2;) > > table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] > > == 0 && ct_label[16..31] == 3 && ip4.src == 10.0.0.6), action=(ip.ttl--; > > flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 3;) > > table=9 (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), > > action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = > > select(2, 3);) > > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > > reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = > > 00:00:00:00:00:01; outport = lrp1;) > > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && > > reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = > > 00:00:00:00:00:01; outport = lrp1;) > > > > The top four flows are new. The bottom three are identical to the previous > > ones. At some point prior to table 9, we need to check the source MAC and > > commit to conntrack if it matches one of our ECMP nexthops. The flow also > > matches on the ip4 destination in order to limit conntrack use only to when > > we should need it. The ct_label's top four bits are the ECMP group ID (0) > > and the bottom four bits are the route ID (2 and 3, respectively). I put ? > > for the table because I'm not sure if this fits well in an existing table or > > if it is going to require a new one to be created. Now in table 9, we've > > added two new 100-priority flows. These will match on reply traffic and sets > > reg8 to have the same contents as ct_label. > > > > Does this seem like a reasonable approach? > >
> > > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> > > 00:00:01:01:01:03 && ip4.dst == 10.0.0.6),
> > > action=(ct_commit(ct_label=00000002/32);)
> > > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> > > 00:00:01:01:01:04 && ip4.dst == 10.0.0.6),
> > > action=(ct_commit(ct_label=00000003/32);)
> >
> > I'm a bit confused, how do we know the ETH addresses of the ECMP next-hops?
> > These can be outside OVN (and they actually are outside in the diagram
> > above)? Is the plan to install these flows dynamically based on MAC_Binding
> > records?
>
> Yes, that's exactly what I was thinking. Is there a more foolproof way to
> determine whether the previous hop was one of the ECMP nexthops?
>
Not that I can think of, unfortunately.
Is there no capability to dynamically just track every new ingress flow and label it with the incoming mac? like: table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && in_port=gr-ext && ip4.dst == 10.0.0.0/24),action=(ct_commit(ct_label=eth.src);) Yep that would work and is much easier than any of the MAC_Binding stuff that was suggested. Dumitru POC'd the idea here to show it works: http://pastebin.test.redhat.com/880881 I'll move forward with this. Thanks! So to sum up, this is (approximately) how it's going to look now. table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && inport == gr-ext && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=eth.src);) table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && inport == gr-ext && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=eth.src);) table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ip4.src == 10.0.0.6 && ct_label[0..47] != 0), action=(ip.ttl--; flags.loopback = 1; eth.src = 00:00:00:00:00:01; reg1 = 1.1.1.1; outport = gr-ext;) table=9 (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = select(2, 3);) table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 1.1.1.1; eth.src = 00:00:00:00:00:01; outport = gr-ext;) table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 1.1.1.1; eth.src = 00:00:00:00:00:01; outport = gr-ext;) table=11 (lr_in_policy), priority=65535, match=(ct.rpl && ct_label[0..47] != 0), action=(next;) table=12 (lr_in_arp_resolve), priority=200, match=(ct.rpl && ct_label[0..47] != 0), action=(eth.dst = ct_label[0..47];) This assumes that replying on the same route that incoming traffic was received on overrides all logical router policies. If that's not how this should work, then table 11 will need some adjusting from what I have here. "This assumes that replying on the same route that incoming traffic was received on overrides all logical router policies." At first glance I *think* that should be OK. We would still need to hit SNAT and unSNAT on the router. But usually return traffic leaving GR will be going to where it came from. If we want to narrow the scope of the behavior we could apply it only to specific routes. So in OVN config we would tag routes with this behavior like route src-ip 10.0.0.6 via 1.1.1.3 ecmp auto-bypass route src-ip 10.0.0.6 via 1.1.1.4 ecmp auto-bypass route dst-ip 8.8.8.8/32 via 1.1.1.2 auto-bypass route dst-ip 7.7.7.7/32 via 1.1.1.2 Then we would have bypass flows that matched this for adding CT label: table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && inport == gr-ext && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=eth.src);) table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && inport == gr-ext && ip4.src == 8.8.8.8), action=(ct_commit(ct_label=eth.src);) <no entry for 7.7.7.7> lr_in_??? would need to happen before snat. Thinking about this some more I think it is fine to make this a global setting for the router and all return ingress traffic is bypassed. I've hit a bit of a snag here. We asssumed that all ECMP routes would egress the same router port. However, this is not necessarily the case. It is a valid configuration to have ECMP routes that each egress from a different router port. By storing the source mac of ingress traffic, we can know the destination mac for subsequent egress packets, but that is not enough to know from which router port to source the packet. So in addition to the source mac, we need to also store the logical router port on which the packet was received, that logical router port's IP, and that logical router port's mac. Otherwise, I can't properly route egress reply traffic. So the options here are as follows: 1) Restrict the use of symmetric ECMP replies to routes that all egress the same logical router port and require that logical router port to be explicitly configured in the northbound ECMP routes. With this limitation, I can use the current proposed solution. 2) Use the mac_binding for determining which hop the ingress traffic came from, and store the ECMP route ID based on this. This would require us to ensure there is a mac_binding present, which may require some extra finagling (i.e. sending ARP/ND packets). What do you think? AFAIK, this feature is only currently requested by OpenShift, and the restriction imposed in (1) should be valid for that use case. Actually on second thought, I guess I only have to store the logical router port ID. I don't need to store its IP and MAC. I'll continue on this path and let you know how it works. Just an update: I've got code written here: https://github.com/putnopvut/ovn/tree/auto_next_hop I've written a test in tests/system-ovn.at that exercises the new feature. However, when I run the test, I see no packets matching the flow that checks for conntrack replies. cookie=0x90c4bd7d, duration=6.020s, table=18, n_packets=0, n_bytes=0, idle_age=6, priority=100,ct_state=+rpl+trk,ip,metadata=0x1,nw_src=10.0.0.0/24 actions=dec_ttl(),load:0x1->NXM_NX_REG10[0],mod_dl_src:00:00:04:01:02:03,load:0x14000001->NXM_NX_XXREG0[64..95],load:0x2->NXM_NX_REG15[],resubmit(,19) Instead, in table 18, we're hitting this: cookie=0x7b2618c1, duration=6.019s, table=18, n_packets=18, n_bytes=1764, idle_age=0, priority=48,ip,metadata=0x1,nw_src=10.0.0.0/24 actions=dec_ttl(),load:0x1->NXM_NX_REG10[0],load:0x1->OXM_OF_PKT_REG4[32..47],group:1 which is the ordinary ECMP selection flow. It appears the ct_state is not what I am expecting in this case. Once I get this debugged and make the behavior configurable, the patch will be ready for more formal testing and code review. I figured out the problem. There was a mismatch of conntrack zones being used. conntrack was being committed in one zone but then the state was being checked in a separate zone. I've made the test scenario pass now, but the way I did it may not hold water in code review. So the things to do now are 1) Tweak the test scenario to only succeed under proper conditions 2) Ensure that conntrack zone usage I added won't cause problems 3) Make symmetric reply behavior configurable Patch posted upstream: https://patchwork.ozlabs.org/project/openvswitch/list/?series=191072 To test, you'll need to specify that ecmp routes should have symmetric replies. One way to do it is: ovn-nbctl --ecmp-symmetric-reply lr-route-add <prefix> <nexthop> Another way is: ovn-nbctl create Logical_Router_Static_Route prefix=<prefix> nexthop=<nexthop> options:ecmp_symmetric_reply=true This patch has been merged into master upstream. I am now working on getting it merged into downstream so it will be available in FDP builds. The patch is now merged into the fast-datapath-next branch of ovn2.13. It will be in the next FDP release of OVN. tested with following script:
# foo -- R1 -- join - R2 -- alice -- |
# | | server
# bar ---- - R3 --- bob ---- |
#
systemctl start openvswitch
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.31.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.31.25
systemctl restart ovn-controller
ovn-nbctl lr-add R1
ovn-nbctl lr-add R2
ovn-nbctl lr-add R3
ovn-nbctl set logical_router R2 options:chassis=hv1
ovn-nbctl set logical_router R3 options:chassis=hv1
ovn-nbctl ls-add foo
ovn-nbctl ls-add bar
ovn-nbctl ls-add alice
ovn-nbctl ls-add bob
ovn-nbctl ls-add join
ovn-nbctl lrp-add R1 foo 00:00:01:01:02:03 192.168.1.1/24 2001::1/64
ovn-nbctl lsp-add foo rp-foo -- set logical_switch_port rp-foo \
type=router options:router-port=foo addresses=\"00:00:01:01:02:03\"
ovn-nbctl lrp-add R1 bar 00:00:01:01:02:04 192.168.2.1/24 2002::1/64
ovn-nbctl lsp-add bar rp-bar -- set Logical_Switch_Port rp-bar \
type=router options:router-port=bar addresses=\"00:00:01:01:02:04\"
ovn-nbctl lrp-add R2 alice 00:00:02:01:02:03 172.16.1.1/24 3001::1/64
ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
type=router options:router-port=alice addresses=\"00:00:02:01:02:03\"
ovn-nbctl lrp-add R3 bob 00:00:03:01:02:03 172.17.1.1/24 3002::1/64
ovn-nbctl lsp-add bob rp-bob -- set Logical_Switch_Port rp-bob \
type=router options:router-port=bob addresses=\"00:00:03:01:02:03\"
ovn-nbctl lrp-add R1 R1_join 00:00:04:01:02:03 20.0.0.1/24 4000::1/64
ovn-nbctl lsp-add join r1-join -- set Logical_Switch_Port r1-join \
type=router options:router-port=R1_join addresses='"00:00:04:01:02:03"'
ovn-nbctl lrp-add R2 R2_join 00:00:04:01:02:04 20.0.0.2/24 4000::2/64
ovn-nbctl lsp-add join r2-join -- set Logical_Switch_Port r2-join \
type=router options:router-port=R2_join addresses='"00:00:04:01:02:04"'
ovn-nbctl lrp-add R3 R3_join 00:00:04:01:02:05 20.0.0.3/24 4000::3/64
ovn-nbctl lsp-add join r3-join -- set Logical_Switch_Port r3-join \
type=router options:router-port=R3_join addresses='"00:00:04:01:02:05"'
ovn-nbctl lr-route-add R2 192.168.0.0/16 20.0.0.1
ovn-nbctl lr-route-add R3 192.168.0.0/16 20.0.0.1
ovn-nbctl lr-route-add R2 2001::/64 4000::1
ovn-nbctl lr-route-add R2 2002::/64 4000::1
ovn-nbctl lr-route-add R3 2001::/64 4000::1
ovn-nbctl lr-route-add R3 2002::/64 4000::1
ovn-nbctl lr-route-add R2 1.1.1.0/24 172.16.1.3
ovn-nbctl lr-route-add R3 1.1.1.0/24 172.17.1.4
ovn-nbctl lr-route-add R2 1111::/64 3001::3
ovn-nbctl lr-route-add R3 1111::/64 3002::4
ip netns add foo1
ovs-vsctl add-port br-int foo1 -- set interface foo1 type=internal
ip link set foo1 netns foo1
ip netns exec foo1 ip link set foo1 address f0:00:00:01:02:03
ip netns exec foo1 ip link set foo1 up
ip netns exec foo1 ip addr add 192.168.1.2/24 dev foo1
ip netns exec foo1 ip -6 addr add 2001::2/64 dev foo1
ip netns exec foo1 ip route add default via 192.168.1.1 dev foo1
ip netns exec foo1 ip -6 route add default via 2001::1 dev foo1
ovs-vsctl set interface foo1 external_ids:iface-id=foo1
ovn-nbctl lsp-add foo foo1 -- lsp-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2 2001::2"
ip netns add bar1
ip link add bar1 netns bar1 type veth peer name bar1_br
ip netns exec bar1 ip link set bar1 address f0:00:00:01:02:05
ip netns exec bar1 ip link set bar1 up
ip netns exec bar1 ip addr add 192.168.2.2/24 dev bar1
ip netns exec bar1 ip -6 addr add 2002::2/64 dev bar1
ip netns exec bar1 ip route add default via 192.168.2.1 dev bar1
ip netns exec bar1 ip -6 route add default via 2002::1 dev bar1
ip link set bar1_br up
ovs-vsctl add-port br-int bar1_br
ovs-vsctl set interface bar1_br external_ids:iface-id=bar1
ovn-nbctl lsp-add bar bar1 -- lsp-set-addresses bar1 "f0:00:00:01:02:05 192.168.2.2 2002::2"
ovs-vsctl add-br br_alice
ovs-vsctl add-br br_bob
ovs-vsctl set open . external-ids:ovn-bridge-mappings=net_alice:br_alice,net_bob:br_bob
ovn-nbctl lsp-add alice ln_alice
ovn-nbctl lsp-set-type ln_alice localnet
ovn-nbctl lsp-set-addresses ln_alice unknown
ovn-nbctl lsp-set-options ln_alice network_name=net_alice
ip netns add alice1
ovs-vsctl add-port br_alice alice1 -- set interface alice1 type=internal
ip link set alice1 netns alice1
ip netns exec alice1 ip link set alice1 address f0:00:00:01:02:04
ip netns exec alice1 ip link set alice1 up
ip netns exec alice1 ip addr add 172.16.1.3/24 dev alice1
ip netns exec alice1 ip -6 addr add 3001::3/64 dev alice1
ip netns exec alice1 ip route add default via 172.16.1.1 dev alice1
ip netns exec alice1 ip -6 route add default via 3001::1 dev alice1
ovn-nbctl lsp-add bob ln_bob
ovn-nbctl lsp-set-type ln_bob localnet
ovn-nbctl lsp-set-addresses ln_bob unknown
ovn-nbctl lsp-set-options ln_bob network_name=net_bob
ip netns add bob1
ip link add bob1 netns bob1 type veth peer name bob1_br
ip netns exec bob1 ip link set bob1 address f0:00:00:01:02:06
ip netns exec bob1 ip link set bob1 up
ip netns exec bob1 ip addr add 172.17.1.4/24 dev bob1
ip netns exec bob1 ip -6 addr add 3002::4/64 dev bob1
ip netns exec bob1 ip route add default via 172.17.1.1 dev bob1
ip netns exec bob1 ip -6 route add default via 3002::1 dev bob1
ip link set bob1_br up
ovs-vsctl add-port br_bob bob1_br
ip link add br_test type bridge
ip link set br_test up
ip link add a1 netns alice1 type veth peer name a1_br
ip link add b1 netns bob1 type veth peer name b1_br
ip link set a1_br master br_test
ip link set b1_br master br_test
ip link set a1_br up
ip link set b1_br up
ip netns exec alice1 ip link set a1 up
ip netns exec bob1 ip link set b1 up
ip netns exec alice1 ip addr add 1.1.1.1/24 dev a1
ip netns exec alice1 ip -6 addr add 1111::1/64 dev a1
ip netns exec bob1 ip addr add 1.1.1.2/24 dev b1
ip netns exec bob1 ip -6 addr add 1111::2/64 dev b1
ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
ip netns exec bob1 sysctl -w net.ipv4.conf.all.forwarding=1
ip netns exec alice1 sysctl -w net.ipv6.conf.all.forwarding=1
ip netns exec bob1 sysctl -w net.ipv6.conf.all.forwarding=1
ip netns add server
ip link add s1 netns server type veth peer name s1_br
ip link set s1_br master br_test
ip link set s1_br up
ip netns exec server ip link set s1 up
ip netns exec server ip addr add 1.1.1.10/24 dev s1
ip netns exec server ip route add default via 1.1.1.1 dev s1
ip netns exec server ip -6 addr add 1111::10/64 dev s1
ip netns exec server ip -6 route add default via 1111::1 dev s1
ip netns exec server sysctl -w net.ipv4.conf.all.rp_filter=0
ip netns exec server sysctl -w net.ipv4.conf.default.rp_filter=0
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.2
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.3
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::2
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::3
tested on ovn2.13-20.06.2-2.el8fdp.x86_64:
[root@dell-per740-12 bz1849683]# ip netns exec foo1 ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
72673: foo1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether f0:00:00:01:02:03 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.2/24 scope global foo1
valid_lft forever preferred_lft forever
inet6 2001::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::f200:ff:fe01:203/64 scope link
valid_lft forever preferred_lft forever
[root@dell-per740-12 bz1849683]# ip netns exec server ip route list default
default via 1.1.1.1 dev s1
[root@dell-per740-12 bz1849683]# ip netns exec server ip -6 route list default
default via 1111::1 dev s1 metric 1024 pref medium
[root@dell-per740-12 bz1849683]# ip netns exec foo1 nc -l 83219 -k &
[root@dell-per740-12 bz1849683]# ip netns exec bob1 tcpdump -i any -w bob1.pcap &
[root@dell-per740-12 bz1849683]# for i in {1..10}; do
ip netns exec server nc 2001::2 10010 <<< h; done
[root@dell-per740-12 bz1849683]# for i in {1..10}; do
ip netns exec server nc 192.168.1.2 10010 <<< h; done
[root@dell-per740-12 bz1849683]# tcpdump -r bob1.pcap host 2001::2 -nnle
reading from file bob1.pcap, link-type LINUX_SLL (Linux cooked v1)
dropped privs to tcpdump
02:39:50.042647 In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34782: Flags [S.], seq 3475408287, ack 3362943445, win 28560, options [mss 1440,sackOK,TS val 3312980532 ecr 872133314,nop,wscale 7], length 0
02:39:50.042673 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34782: Flags [S.], seq 3475408287, ack 3362943445, win 28560, options [mss 1440,sackOK,TS val 3312980532 ecr 872133314,nop,wscale 7], length 0
02:39:50.042791 In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.042802 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.042926 In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.042943 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.203681 In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34790: Flags [S.], seq 3068695351, ack 1492396622, win 28560, options [mss 1440,sackOK,TS val 3312980693 ecr 872133475,nop,wscale 7], length 0
02:39:50.203698 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34790: Flags [S.], seq 3068695351, ack 1492396622, win 28560, options [mss 1440,sackOK,TS val 3312980693 ecr 872133475,nop,wscale 7], length 0
02:39:50.203815 In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0
02:39:50.203821 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0
02:39:50.203990 In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0
02:39:50.204004 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0
<=== still get packets from foo1(2001::2) on bob1
[root@dell-per740-12 bz1849683]# tcpdump -r bob1.pcap host 192.168.1.2 -nnle
reading from file bob1.pcap, link-type LINUX_SLL (Linux cooked v1)
dropped privs to tcpdump
02:40:07.042605 In 00:00:03:01:02:03 ethertype IPv4 (0x0800), length 76: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [S.], seq 1582010048, ack 4256554733, win 28960, options [mss 1460,sackOK,TS val 3813709470 ecr 2021325724,nop,wscale 7], length 0
02:40:07.042626 Out 1a:4b:91:bb:db:46 ethertype IPv4 (0x0800), length 76: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [S.], seq 1582010048, ack 4256554733, win 28960, options [mss 1460,sackOK,TS val 3813709470 ecr 2021325724,nop,wscale 7], length 0
02:40:07.042730 In 00:00:03:01:02:03 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [.], ack 3, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length 0
02:40:07.042738 Out 1a:4b:91:bb:db:46 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [.], ack 3, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length 0
02:40:07.042845 In 00:00:03:01:02:03 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [F.], seq 1, ack 4, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length
0
02:40:07.042859 Out 1a:4b:91:bb:db:46 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [F.], seq 1, ack 4, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length
0
...
<=== still get ipv4 packet from foo1 (192.168.1.2) on bob1
as the default ipv4 route on server is via 1.1.1.1(alice1), so nc to foo1(192.168.1.2) would go through alice1, then to R2(20.0.0.2), then the return packet should also go through R2 then to alice1, then bob1 should not receive the return packet.
the same for ipv6 packet.
Mark, How do you think? anything wrong?
packages used for comment 20: [root@dell-per740-12 bz1849683]# uname -a Linux dell-per740-12.rhts.eng.pek2.redhat.com 4.18.0-232.el8.x86_64 #1 SMP Mon Aug 10 06:55:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux [root@dell-per740-12 bz1849683]# rpm -qa | grep -E "openvswitch|ovn" ovn2.13-20.06.2-2.el8fdp.x86_64 ovn2.13-central-20.06.2-2.el8fdp.x86_64 openvswitch2.13-2.13.0-54.el8fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-common-1.0-7.noarch python3-openvswitch2.13-2.13.0-54.el8fdp.x86_64 ovn2.13-host-20.06.2-2.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
> [root@dell-per740-12 bz1849683]# ip netns exec server ip route list default
> default via 1.1.1.1 dev s1
> [root@dell-per740-12 bz1849683]# ip netns exec server ip -6 route list
> default
> default via 1111::1 dev s1 metric 1024 pref medium
>
> [root@dell-per740-12 bz1849683]# ip netns exec foo1 nc -l 83219 -k &
<=== correction: here is "ip netns exec foo1 nc -l 10010 -k &"
Hi Jianlin shi, Please add the below in your rep.sh ovn-nbctl set logical_router R1 options:chassis=hv1 Symmetric ECMP reply is only usable on gateway routers. And hence you need to set R1 to a chassis. Thanks (In reply to Numan Siddique from comment #23) > Hi Jianlin shi, > > Please add the below in your rep.sh > > ovn-nbctl set logical_router R1 options:chassis=hv1 > > > Symmetric ECMP reply is only usable on gateway routers. And hence you need > to set R1 to a chassis. > > Thanks it works after add the setting. no packets received on bob1. and when add ecmp with: ovn-nbctl --ecmp lr-route-add R1 0.0.0.0/0 20.0.0.2 ovn-nbctl --ecmp lr-route-add R1 0.0.0.0/0 20.0.0.3 ovn-nbctl --ecmp lr-route-add R1 ::/0 4000::2 ovn-nbctl --ecmp lr-route-add R1 ::/0 4000::3 bob1 would receive packets. so --ecmp-symmetric-reply works Verified both on rhel7 and rhel8 version.
the complete script is:
# foo -- R1 -- join - R2 -- alice -- |
# | | server
# bar ---- - R3 --- bob ---- |
#
systemctl start openvswitch
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.50.26:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.50.26
systemctl restart ovn-controller
ovn-nbctl lr-add R1
ovn-nbctl lr-add R2
ovn-nbctl lr-add R3
ovn-nbctl set logical_router R1 options:chassis=hv1
ovn-nbctl set logical_router R2 options:chassis=hv1
ovn-nbctl set logical_router R3 options:chassis=hv1
ovn-nbctl ls-add foo
ovn-nbctl ls-add bar
ovn-nbctl ls-add alice
ovn-nbctl ls-add bob
ovn-nbctl ls-add join
ovn-nbctl lrp-add R1 foo 00:00:01:01:02:03 192.168.1.1/24 2001::1/64
ovn-nbctl lsp-add foo rp-foo -- set logical_switch_port rp-foo \
type=router options:router-port=foo addresses=\"00:00:01:01:02:03\"
ovn-nbctl lrp-add R1 bar 00:00:01:01:02:04 192.168.2.1/24 2002::1/64
ovn-nbctl lsp-add bar rp-bar -- set Logical_Switch_Port rp-bar \
type=router options:router-port=bar addresses=\"00:00:01:01:02:04\"
ovn-nbctl lrp-add R2 alice 00:00:02:01:02:03 172.16.1.1/24 3001::1/64
ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
type=router options:router-port=alice addresses=\"00:00:02:01:02:03\"
ovn-nbctl lrp-add R3 bob 00:00:03:01:02:03 172.17.1.1/24 3002::1/64
ovn-nbctl lsp-add bob rp-bob -- set Logical_Switch_Port rp-bob \
type=router options:router-port=bob addresses=\"00:00:03:01:02:03\"
ovn-nbctl lrp-add R1 R1_join 00:00:04:01:02:03 20.0.0.1/24 4000::1/64
ovn-nbctl lsp-add join r1-join -- set Logical_Switch_Port r1-join \
type=router options:router-port=R1_join addresses='"00:00:04:01:02:03"'
ovn-nbctl lrp-add R2 R2_join 00:00:04:01:02:04 20.0.0.2/24 4000::2/64
ovn-nbctl lsp-add join r2-join -- set Logical_Switch_Port r2-join \
type=router options:router-port=R2_join addresses='"00:00:04:01:02:04"'
ovn-nbctl lrp-add R3 R3_join 00:00:04:01:02:05 20.0.0.3/24 4000::3/64
ovn-nbctl lsp-add join r3-join -- set Logical_Switch_Port r3-join \
type=router options:router-port=R3_join addresses='"00:00:04:01:02:05"'
ovn-nbctl lr-route-add R2 192.168.0.0/16 20.0.0.1
ovn-nbctl lr-route-add R3 192.168.0.0/16 20.0.0.1
ovn-nbctl lr-route-add R2 2001::/64 4000::1
ovn-nbctl lr-route-add R2 2002::/64 4000::1
ovn-nbctl lr-route-add R3 2001::/64 4000::1
ovn-nbctl lr-route-add R3 2002::/64 4000::1
ovn-nbctl lr-route-add R2 1.1.1.0/24 172.16.1.3
ovn-nbctl lr-route-add R3 1.1.1.0/24 172.17.1.4
ovn-nbctl lr-route-add R2 1111::/64 3001::3
ovn-nbctl lr-route-add R3 1111::/64 3002::4
ip netns add foo1
ovs-vsctl add-port br-int foo1 -- set interface foo1 type=internal
ip link set foo1 netns foo1
ip netns exec foo1 ip link set foo1 address f0:00:00:01:02:03
ip netns exec foo1 ip link set foo1 up
ip netns exec foo1 ip addr add 192.168.1.2/24 dev foo1
ip netns exec foo1 ip -6 addr add 2001::2/64 dev foo1
ip netns exec foo1 ip route add default via 192.168.1.1 dev foo1
ip netns exec foo1 ip -6 route add default via 2001::1 dev foo1
ovs-vsctl set interface foo1 external_ids:iface-id=foo1
ovn-nbctl lsp-add foo foo1 -- lsp-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2 2001::2"
ip netns add bar1
ip link add bar1 netns bar1 type veth peer name bar1_br
ip netns exec bar1 ip link set bar1 address f0:00:00:01:02:05
ip netns exec bar1 ip link set bar1 up
ip netns exec bar1 ip addr add 192.168.2.2/24 dev bar1
ip netns exec bar1 ip -6 addr add 2002::2/64 dev bar1
ip netns exec bar1 ip route add default via 192.168.2.1 dev bar1
ip netns exec bar1 ip -6 route add default via 2002::1 dev bar1
ip link set bar1_br up
ovs-vsctl add-port br-int bar1_br
ovs-vsctl set interface bar1_br external_ids:iface-id=bar1
ovn-nbctl lsp-add bar bar1 -- lsp-set-addresses bar1 "f0:00:00:01:02:05 192.168.2.2 2002::2"
ovs-vsctl add-br br_alice
ovs-vsctl add-br br_bob
ovs-vsctl set open . external-ids:ovn-bridge-mappings=net_alice:br_alice,net_bob:br_bob
ovn-nbctl lsp-add alice ln_alice
ovn-nbctl lsp-set-type ln_alice localnet
ovn-nbctl lsp-set-addresses ln_alice unknown
ovn-nbctl lsp-set-options ln_alice network_name=net_alice
ip netns add alice1
ovs-vsctl add-port br_alice alice1 -- set interface alice1 type=internal
ip link set alice1 netns alice1
ip netns exec alice1 ip link set alice1 address f0:00:00:01:02:04
ip netns exec alice1 ip link set alice1 up
ip netns exec alice1 ip addr add 172.16.1.3/24 dev alice1
ip netns exec alice1 ip -6 addr add 3001::3/64 dev alice1
ip netns exec alice1 ip route add default via 172.16.1.1 dev alice1
ip netns exec alice1 ip -6 route add default via 3001::1 dev alice1
ovn-nbctl lsp-add bob ln_bob
ovn-nbctl lsp-set-type ln_bob localnet
ovn-nbctl lsp-set-addresses ln_bob unknown
ovn-nbctl lsp-set-options ln_bob network_name=net_bob
ip netns add bob1
ip link add bob1 netns bob1 type veth peer name bob1_br
ip netns exec bob1 ip link set bob1 address f0:00:00:01:02:06
ip netns exec bob1 ip link set bob1 up
ip netns exec bob1 ip addr add 172.17.1.4/24 dev bob1
ip netns exec bob1 ip -6 addr add 3002::4/64 dev bob1
ip netns exec bob1 ip route add default via 172.17.1.1 dev bob1
ip netns exec bob1 ip -6 route add default via 3002::1 dev bob1
ip link set bob1_br up
ovs-vsctl add-port br_bob bob1_br
ip link add br_test type bridge
ip link set br_test up
ip link add a1 netns alice1 type veth peer name a1_br
ip link add b1 netns bob1 type veth peer name b1_br
ip link set a1_br master br_test
ip link set b1_br master br_test
ip link set a1_br up
ip link set b1_br up
ip netns exec alice1 ip link set a1 up
ip netns exec bob1 ip link set b1 up
ip netns exec alice1 ip addr add 1.1.1.1/24 dev a1
ip netns exec alice1 ip -6 addr add 1111::1/64 dev a1
ip netns exec bob1 ip addr add 1.1.1.2/24 dev b1
ip netns exec bob1 ip -6 addr add 1111::2/64 dev b1
ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
ip netns exec bob1 sysctl -w net.ipv4.conf.all.forwarding=1
ip netns exec alice1 sysctl -w net.ipv6.conf.all.forwarding=1
ip netns exec bob1 sysctl -w net.ipv6.conf.all.forwarding=1
ip netns add server
ip link add s1 netns server type veth peer name s1_br
ip link set s1_br master br_test
ip link set s1_br up
ip netns exec server ip link set s1 up
ip netns exec server ip addr add 1.1.1.10/24 dev s1
ip netns exec server ip route add default via 1.1.1.1 dev s1
ip netns exec server ip -6 addr add 1111::10/64 dev s1
ip netns exec server ip -6 route add default via 1111::1 dev s1
ip netns exec server sysctl -w net.ipv4.conf.all.rp_filter=0
ip netns exec server sysctl -w net.ipv4.conf.default.rp_filter=0
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.2
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.3
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::2
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3769 |
Description of problem: When using ECMP on a GR to balance egress traffic from a pod to multiple external gateways, ingress return traffic may be sent to the wrong gateway. Return traffic needs to go back to the same gateway and ignore ECMP. This is especially true for connection oriented protocols like TCP: Consider this topology: +----------------+1.1.1.3 | | +-----------------+ | exgw1 |---\ | | +-----------------+ +---------------+ +-----------------+ | | -------\ +-------------------+ | | +--------------+ | | | | | | +----------------+ ---- | | | | | | |-------------| ovn worker sw| -------- | pod 10.0.0.6 | | OVN external sw|-------------| OVN GR |------------| join sw |------------| ovn cluster rtr | | | | | -- | | | | | | | |10.0.0.1 +---------------+ +-----------------+ --/ +-------------------+ | | +--------------+ | | +----------------+ --/ | | +-----------------+ | | --/ +-----------------+ | exgw2 | -/ | | 1.1.1.4 +----------------+ src ip route match 10.0.0.6, send via ecmp 1.1.1.3, 1.1.1.4 + 1. First case Pod -> Egress Pod sends and traffic hits GR, traffic hashed on 5 tuple always goes to same gw. Return traffic targets Pod so no problem. This case works fine. 2. 2nd Case External traffic -> Pod (Ingress) gw 1.1.1.3 routes some traffic to GR and then to Pod. Pod then responds to this traffic, it hits GR and now it could potentially be hashed and sent to 1.1.1.4. This is the issue.