1849683 – [RFE] Add support for stateful next hop (ECMP bypass)

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1849683 - [RFE] Add support for stateful next hop (ECMP bypass)

Summary: [RFE] Add support for stateful next hop (ECMP bypass)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	OVN
Sub Component:
Version:	RHEL 8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Mark Michelson
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-22 14:00 UTC by Tim Rozet
Modified:	2020-09-16 16:01 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-16 16:01:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:3769	0	None	None	None	2020-09-16 16:01:39 UTC

Description Tim Rozet 2020-06-22 14:00:43 UTC

Description of problem:

When using ECMP on a GR to balance egress traffic from a pod to multiple external gateways, ingress return traffic may be sent to the wrong gateway. Return traffic needs to go back to the same gateway and ignore ECMP. This is especially true for connection oriented protocols like TCP:

Consider this topology:

                                                                                                                                                                                                            
                                                                                                                                                                                                            
                                                                                                                                                                                                            
                                                                                                                                                                                                            
+----------------+1.1.1.3                                                                                                                                                                                   
|                |                                                 +-----------------+                                                                                                                      
|    exgw1       |---\                                             |                 |                                        +-----------------+             +---------------+          +-----------------+
|                |    -------\   +-------------------+             |                 |            +--------------+            |                 |             |               |          |                 |
+----------------+            ----                   |             |                 |            |              |            |                 |-------------|  ovn worker sw| -------- |   pod 10.0.0.6  |
                                 |    OVN external sw|-------------|   OVN GR        |------------|   join sw    |------------| ovn cluster rtr |             |               |          |                 |
                              -- |                   |             |                 |            |              |            |                 |10.0.0.1     +---------------+          +-----------------+
                           --/   +-------------------+             |                 |            +--------------+            |                 |                                                           
+----------------+      --/                                        |                 |                                        +-----------------+                                                           
|                |   --/                                           +-----------------+                                                                                                                      
|    exgw2       | -/                                                                                                                                                                                       
|                | 1.1.1.4                                                                                                                                                                                  
+----------------+                                                   src ip route match                                                                                                                     
                                                                     10.0.0.6, send via ecmp                                                                                                                
                                                                     1.1.1.3, 1.1.1.4                                                                                                                       
                                                                                                                                                                                                            
         +                                                                                                                                                                                                  
                                                                                                                                                                                                            
                                                                                                                                                                                                            

1. First case Pod -> Egress
Pod sends and traffic hits GR, traffic hashed on 5 tuple always goes to same gw. Return traffic targets Pod so no problem. This case works fine.

2. 2nd Case External traffic -> Pod (Ingress)
gw 1.1.1.3 routes some traffic to GR and then to Pod. Pod then responds to this traffic, it hits GR and now it could potentially be hashed and sent to 1.1.1.4. This is the issue.

Comment 1 Mark Michelson 2020-06-30 20:12:00 UTC

I've been looking into this today, and I think I have some good foundations for this. But I want to run the idea by both the core OVN team and the OpenShift folk to make sure that everything is sound.

For external -> internal traffic

First order of business is detecting if we have received traffic from one of our configured ECMP routes and if so, which one. The only way I'm aware of to recognize that the packet came from one of our nexthop routes is based on source MAC.

Next, once we determine which of the ECMP routes we received the packet on, we need to invoke conntrack. My thought was that we could use ct_label or ct_mark to hold the ECMP group ID and ECMP route ID. In my example below, I use ct_label, but we could just as easily use ct_mark.

For internal -> external traffic

We need to establish a rule at a higher priority than typical ECMP rules that states that if ct_label holds ECMP IDs, use those instead of the usual selection method.



Let's run through a sample scenario using the diagram above. Let's say that 1.1.1.3 has MAC 00:00:01:01:01:03 and 1.1.1.4 has MAC 00:00:01:01:01:04. Our router port (gr-ext) has an IP address of 192.168.0.1 and a MAC of 00:00:00:00:01

Currently, the ingress router pipeline would have flows like the following:

table=9  (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = select(2, 3);)
table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;)
table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;)

In table 9, we select one of the two routes to use (I arbitrarily gave them IDs 2 and 3 here). And in table 10, we use that selected value to set the nexthop address in reg0.

Now let's add in the changes I described. Now the ingress router pipeline would look like this:

table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == 00:00:01:01:01:03 && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=00000002/32);)
table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src == 00:00:01:01:01:04 && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=00000003/32);)
table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] == 0 && ct_label[16..31] == 2 && ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 2;)
table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15] == 0 && ct_label[16..31] == 3 && ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 3;)
table=9  (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6), action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = select(2, 3);)
table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;)
table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 && reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src = 00:00:00:00:00:01; outport = lrp1;)

The top four flows are new. The bottom three are identical to the previous ones. At some point prior to table 9, we need to check the source MAC and commit to conntrack if it matches one of our ECMP nexthops. The flow also matches on the ip4 destination in order to limit conntrack use only to when we should need it. The ct_label's top four bits are the ECMP group ID (0) and the bottom four bits are the route ID (2 and 3, respectively). I put ? for the table because I'm not sure if this fits well in an existing table or if it is going to require a new one to be created. Now in table 9, we've added two new 100-priority flows. These will match on reply traffic and sets reg8 to have the same contents as ct_label.

Does this seem like a reasonable approach?

Comment 2 Mark Michelson 2020-06-30 23:41:24 UTC

In my previous comment, all the "outport = lrp1s" should be "outport = gr-ext".

(I really wish bugzilla allowed comment editing)

Comment 3 Dumitru Ceara 2020-07-01 10:59:10 UTC

(In reply to Mark Michelson from comment #1)
> I've been looking into this today, and I think I have some good foundations
> for this. But I want to run the idea by both the core OVN team and the
> OpenShift folk to make sure that everything is sound.
> 
> For external -> internal traffic
> 
> First order of business is detecting if we have received traffic from one of
> our configured ECMP routes and if so, which one. The only way I'm aware of
> to recognize that the packet came from one of our nexthop routes is based on
> source MAC.
> 
> Next, once we determine which of the ECMP routes we received the packet on,
> we need to invoke conntrack. My thought was that we could use ct_label or
> ct_mark to hold the ECMP group ID and ECMP route ID. In my example below, I
> use ct_label, but we could just as easily use ct_mark.
> 
> For internal -> external traffic
> 
> We need to establish a rule at a higher priority than typical ECMP rules
> that states that if ct_label holds ECMP IDs, use those instead of the usual
> selection method.
> 
> 
> 
> Let's run through a sample scenario using the diagram above. Let's say that
> 1.1.1.3 has MAC 00:00:01:01:01:03 and 1.1.1.4 has MAC 00:00:01:01:01:04. Our
> router port (gr-ext) has an IP address of 192.168.0.1 and a MAC of
> 00:00:00:00:01
> 

A bit of a nit pick but gr-ext IP should be in the same subnet as the ECMP next-hops. We don't support indirect next-hops in OVN. And setting reg0 to 1.1.1.3 and reg1 to 192.168.0.1 will cause ARP requests with arp.spa=192.168.0.1 && arp.tpa==1.1.1.3 to be generated. These would most likely be dropped as invalid at 1.1.1.3.

> Currently, the ingress router pipeline would have flows like the following:
> 
> table=9  (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6),
> action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] =
> select(2, 3);)
> table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src =
> 00:00:00:00:00:01; outport = lrp1;)
> table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src =
> 00:00:00:00:00:01; outport = lrp1;)
> 
> In table 9, we select one of the two routes to use (I arbitrarily gave them
> IDs 2 and 3 here). And in table 10, we use that selected value to set the
> nexthop address in reg0.
> 
> Now let's add in the changes I described. Now the ingress router pipeline
> would look like this:
> 

It seems to me like we miss a flow before this that would do ct() in the current zone to get the ct_state that's used below to determine if a packet is for a new session or not.

> table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> 00:00:01:01:01:03 && ip4.dst == 10.0.0.6),
> action=(ct_commit(ct_label=00000002/32);)
> table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> 00:00:01:01:01:04 && ip4.dst == 10.0.0.6),
> action=(ct_commit(ct_label=00000003/32);)

I'm a bit confused, how do we know the ETH addresses of the ECMP next-hops? These can be outside OVN (and they actually are outside in the diagram above)? Is the plan to install these flows dynamically based on MAC_Binding records?

> table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15]
> == 0 && ct_label[16..31] == 2 && ip4.src == 10.0.0.6), action=(ip.ttl--;
> flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 2;)
> table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15]
> == 0 && ct_label[16..31] == 3 && ip4.src == 10.0.0.6), action=(ip.ttl--;
> flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 3;)
> table=9  (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6),
> action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] =
> select(2, 3);)
> table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src =
> 00:00:00:00:00:01; outport = lrp1;)
> table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src =
> 00:00:00:00:00:01; outport = lrp1;)
> 
> The top four flows are new. The bottom three are identical to the previous
> ones. At some point prior to table 9, we need to check the source MAC and
> commit to conntrack if it matches one of our ECMP nexthops. The flow also
> matches on the ip4 destination in order to limit conntrack use only to when
> we should need it. The ct_label's top four bits are the ECMP group ID (0)
> and the bottom four bits are the route ID (2 and 3, respectively). I put ?
> for the table because I'm not sure if this fits well in an existing table or
> if it is going to require a new one to be created. Now in table 9, we've
> added two new 100-priority flows. These will match on reply traffic and sets
> reg8 to have the same contents as ct_label.
> 
> Does this seem like a reasonable approach?

Comment 4 Mark Michelson 2020-07-01 11:40:32 UTC

(In reply to Dumitru Ceara from comment #3)
> (In reply to Mark Michelson from comment #1)
> > I've been looking into this today, and I think I have some good foundations
> > for this. But I want to run the idea by both the core OVN team and the
> > OpenShift folk to make sure that everything is sound.
> > 
> > For external -> internal traffic
> > 
> > First order of business is detecting if we have received traffic from one of
> > our configured ECMP routes and if so, which one. The only way I'm aware of
> > to recognize that the packet came from one of our nexthop routes is based on
> > source MAC.
> > 
> > Next, once we determine which of the ECMP routes we received the packet on,
> > we need to invoke conntrack. My thought was that we could use ct_label or
> > ct_mark to hold the ECMP group ID and ECMP route ID. In my example below, I
> > use ct_label, but we could just as easily use ct_mark.
> > 
> > For internal -> external traffic
> > 
> > We need to establish a rule at a higher priority than typical ECMP rules
> > that states that if ct_label holds ECMP IDs, use those instead of the usual
> > selection method.
> > 
> > 
> > 
> > Let's run through a sample scenario using the diagram above. Let's say that
> > 1.1.1.3 has MAC 00:00:01:01:01:03 and 1.1.1.4 has MAC 00:00:01:01:01:04. Our
> > router port (gr-ext) has an IP address of 192.168.0.1 and a MAC of
> > 00:00:00:00:01
> > 
> 
> A bit of a nit pick but gr-ext IP should be in the same subnet as the ECMP
> next-hops. We don't support indirect next-hops in OVN. And setting reg0 to
> 1.1.1.3 and reg1 to 192.168.0.1 will cause ARP requests with
> arp.spa=192.168.0.1 && arp.tpa==1.1.1.3 to be generated. These would most
> likely be dropped as invalid at 1.1.1.3.

Thanks for the clarification. Let's just mentally replace all occurrences of 192.168.0.1 with 1.1.1.3 instead.

> 
> > Currently, the ingress router pipeline would have flows like the following:
> > 
> > table=9  (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6),
> > action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] =
> > select(2, 3);)
> > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> > reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src =
> > 00:00:00:00:00:01; outport = lrp1;)
> > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> > reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src =
> > 00:00:00:00:00:01; outport = lrp1;)
> > 
> > In table 9, we select one of the two routes to use (I arbitrarily gave them
> > IDs 2 and 3 here). And in table 10, we use that selected value to set the
> > nexthop address in reg0.
> > 
> > Now let's add in the changes I described. Now the ingress router pipeline
> > would look like this:
> > 
> 
> It seems to me like we miss a flow before this that would do ct() in the
> current zone to get the ct_state that's used below to determine if a packet
> is for a new session or not.

OK, that's a mistake on my part.

> 
> > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> > 00:00:01:01:01:03 && ip4.dst == 10.0.0.6),
> > action=(ct_commit(ct_label=00000002/32);)
> > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> > 00:00:01:01:01:04 && ip4.dst == 10.0.0.6),
> > action=(ct_commit(ct_label=00000003/32);)
> 
> I'm a bit confused, how do we know the ETH addresses of the ECMP next-hops?
> These can be outside OVN (and they actually are outside in the diagram
> above)? Is the plan to install these flows dynamically based on MAC_Binding
> records?

Yes, that's exactly what I was thinking. Is there a more foolproof way to determine whether the previous hop was one of the ECMP nexthops?

> 
> > table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15]
> > == 0 && ct_label[16..31] == 2 && ip4.src == 10.0.0.6), action=(ip.ttl--;
> > flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 2;)
> > table=9 (lr_in_ip_routing), priority=100, match=(ct.rpl && ct_label[0..15]
> > == 0 && ct_label[16..31] == 3 && ip4.src == 10.0.0.6), action=(ip.ttl--;
> > flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = 3;)
> > table=9  (lr_in_ip_routing), priority=64, match=(ip4.src == 10.0.0.6),
> > action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] =
> > select(2, 3);)
> > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> > reg8[16..31] == 2), action=(reg0 = 1.1.1.3; reg1 = 192.168.0.1; eth.src =
> > 00:00:00:00:00:01; outport = lrp1;)
> > table=10 (lr_in_ip_routing_ecmp), priority=100, match=(reg8[0..15] == 0 &&
> > reg8[16..31] == 3), action=(reg0 = 1.1.1.4; reg1 = 192.168.0.1; eth.src =
> > 00:00:00:00:00:01; outport = lrp1;)
> > 
> > The top four flows are new. The bottom three are identical to the previous
> > ones. At some point prior to table 9, we need to check the source MAC and
> > commit to conntrack if it matches one of our ECMP nexthops. The flow also
> > matches on the ip4 destination in order to limit conntrack use only to when
> > we should need it. The ct_label's top four bits are the ECMP group ID (0)
> > and the bottom four bits are the route ID (2 and 3, respectively). I put ?
> > for the table because I'm not sure if this fits well in an existing table or
> > if it is going to require a new one to be created. Now in table 9, we've
> > added two new 100-priority flows. These will match on reply traffic and sets
> > reg8 to have the same contents as ct_label.
> > 
> > Does this seem like a reasonable approach?

Comment 5 Dumitru Ceara 2020-07-01 12:59:17 UTC

> > 
> > > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> > > 00:00:01:01:01:03 && ip4.dst == 10.0.0.6),
> > > action=(ct_commit(ct_label=00000002/32);)
> > > table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && eth.src ==
> > > 00:00:01:01:01:04 && ip4.dst == 10.0.0.6),
> > > action=(ct_commit(ct_label=00000003/32);)
> > 
> > I'm a bit confused, how do we know the ETH addresses of the ECMP next-hops?
> > These can be outside OVN (and they actually are outside in the diagram
> > above)? Is the plan to install these flows dynamically based on MAC_Binding
> > records?
> 
> Yes, that's exactly what I was thinking. Is there a more foolproof way to
> determine whether the previous hop was one of the ECMP nexthops?
> 

Not that I can think of, unfortunately.

Comment 6 Tim Rozet 2020-07-01 14:32:22 UTC

Is there no capability to dynamically just track every new ingress flow and label it with the incoming mac? like:
table=? (lr_in_???), priority=100, match=((ct.new && !ct.est) && in_port=gr-ext && ip4.dst == 10.0.0.0/24),action=(ct_commit(ct_label=eth.src);)

Comment 7 Mark Michelson 2020-07-02 11:51:26 UTC

Yep that would work and is much easier than any of the MAC_Binding stuff that was suggested. Dumitru POC'd the idea here to show it works: http://pastebin.test.redhat.com/880881

I'll move forward with this. Thanks!

Comment 8 Mark Michelson 2020-07-02 12:31:43 UTC

So to sum up, this is (approximately) how it's going to look now.

table=?  (lr_in_???),             priority=100,   match=((ct.new && !ct.est) && inport == gr-ext && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=eth.src);)
table=?  (lr_in_???),             priority=100,   match=((ct.new && !ct.est) && inport == gr-ext && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=eth.src);)
table=9  (lr_in_ip_routing),      priority=100,   match=(ct.rpl && ip4.src == 10.0.0.6 && ct_label[0..47] != 0),          action=(ip.ttl--; flags.loopback = 1; eth.src = 00:00:00:00:00:01; reg1 = 1.1.1.1; outport = gr-ext;)
table=9  (lr_in_ip_routing),      priority=64,    match=(ip4.src == 10.0.0.6),                                            action=(ip.ttl--; flags.loopback = 1; reg8[0..15] = 0; reg8[16..31] = select(2, 3);)
table=10 (lr_in_ip_routing_ecmp), priority=100,   match=(reg8[0..15] == 0 && reg8[16..31] == 2),                          action=(reg0 = 1.1.1.3; reg1 = 1.1.1.1; eth.src = 00:00:00:00:00:01; outport = gr-ext;)
table=10 (lr_in_ip_routing_ecmp), priority=100,   match=(reg8[0..15] == 0 && reg8[16..31] == 3),                          action=(reg0 = 1.1.1.4; reg1 = 1.1.1.1; eth.src = 00:00:00:00:00:01; outport = gr-ext;)
table=11 (lr_in_policy),          priority=65535, match=(ct.rpl && ct_label[0..47] != 0),                                 action=(next;)
table=12 (lr_in_arp_resolve),     priority=200,   match=(ct.rpl && ct_label[0..47] != 0),                                 action=(eth.dst = ct_label[0..47];)

This assumes that replying on the same route that incoming traffic was received on overrides all logical router policies. If that's not how this should work, then table 11 will need some adjusting from what I have here.

Comment 9 Tim Rozet 2020-07-02 13:47:24 UTC

"This assumes that replying on the same route that incoming traffic was received on overrides all logical router policies."
At first glance I *think* that should be OK. We would still need to hit SNAT and unSNAT on the router. But usually return traffic leaving GR will be going to where it came from.

If we want to narrow the scope of the behavior we could apply it only to specific routes. So in OVN config we would tag routes with this behavior like

route src-ip 10.0.0.6 via 1.1.1.3 ecmp auto-bypass
route src-ip 10.0.0.6 via 1.1.1.4 ecmp auto-bypass
route dst-ip 8.8.8.8/32 via 1.1.1.2 auto-bypass
route dst-ip 7.7.7.7/32 via 1.1.1.2

Then we would have bypass flows that matched this for adding CT label:
table=?  (lr_in_???),             priority=100,   match=((ct.new && !ct.est) && inport == gr-ext && ip4.dst == 10.0.0.6), action=(ct_commit(ct_label=eth.src);)
table=?  (lr_in_???),             priority=100,   match=((ct.new && !ct.est) && inport == gr-ext && ip4.src == 8.8.8.8), action=(ct_commit(ct_label=eth.src);)
<no entry for 7.7.7.7>

lr_in_??? would need to happen before snat.

Comment 10 Tim Rozet 2020-07-02 20:47:11 UTC

Thinking about this some more I think it is fine to make this a global setting for the router and all return ingress traffic is bypassed.

Comment 11 Mark Michelson 2020-07-08 14:11:59 UTC

I've hit a bit of a snag here. We asssumed that all ECMP routes would egress the same router port. However, this is not necessarily the case. It is a valid configuration to have ECMP routes that each egress from a different router port. By storing the source mac of ingress traffic, we can know the destination mac for subsequent egress packets, but that is not enough to know from which router port to source the packet. So in addition to the source mac, we need to also store the logical router port on which the packet was received, that logical router port's IP, and that logical router port's mac. Otherwise, I can't properly route egress reply traffic.

So the options here are as follows:

1) Restrict the use of symmetric ECMP replies to routes that all egress the same logical router port and require that logical router port to be explicitly configured in the northbound ECMP routes. With this limitation, I can use the current proposed solution.
2) Use the mac_binding for determining which hop the ingress traffic came from, and store the ECMP route ID based on this. This would require us to ensure there is a mac_binding present, which may require some extra finagling (i.e. sending ARP/ND packets).

What do you think? AFAIK, this feature is only currently requested by OpenShift, and the restriction imposed in (1) should be valid for that use case.

Comment 12 Mark Michelson 2020-07-08 14:20:22 UTC

Actually on second thought, I guess I only have to store the logical router port ID. I don't need to store its IP and MAC. I'll continue on this path and let you know how it works.

Comment 13 Mark Michelson 2020-07-18 00:44:25 UTC

Just an update:

I've got code written here: https://github.com/putnopvut/ovn/tree/auto_next_hop

I've written a test in tests/system-ovn.at that exercises the new feature. However, when I run the test, I see no packets matching the flow that checks for conntrack replies.

 cookie=0x90c4bd7d, duration=6.020s, table=18, n_packets=0, n_bytes=0, idle_age=6, priority=100,ct_state=+rpl+trk,ip,metadata=0x1,nw_src=10.0.0.0/24 actions=dec_ttl(),load:0x1->NXM_NX_REG10[0],mod_dl_src:00:00:04:01:02:03,load:0x14000001->NXM_NX_XXREG0[64..95],load:0x2->NXM_NX_REG15[],resubmit(,19)

Instead, in table 18, we're hitting this:

 cookie=0x7b2618c1, duration=6.019s, table=18, n_packets=18, n_bytes=1764, idle_age=0, priority=48,ip,metadata=0x1,nw_src=10.0.0.0/24 actions=dec_ttl(),load:0x1->NXM_NX_REG10[0],load:0x1->OXM_OF_PKT_REG4[32..47],group:1                   

which is the ordinary ECMP selection flow. It appears the ct_state is not what I am expecting in this case.

Once I get this debugged and make the behavior configurable, the patch will be ready for more formal testing and code review.

Comment 14 Mark Michelson 2020-07-18 01:01:42 UTC

I figured out the problem. There was a mismatch of conntrack zones being used. conntrack was being committed in one zone but then the state was being checked in a separate zone. I've made the test scenario pass now, but the way I did it may not hold water in code review.

So the things to do now are

1) Tweak the test scenario to only succeed under proper conditions
2) Ensure that conntrack zone usage I added won't cause problems
3) Make symmetric reply behavior configurable

Comment 15 Mark Michelson 2020-07-20 21:00:53 UTC

Patch posted upstream: https://patchwork.ozlabs.org/project/openvswitch/list/?series=191072

To test, you'll need to specify that ecmp routes should have symmetric replies. One way to do it is:

ovn-nbctl --ecmp-symmetric-reply lr-route-add <prefix> <nexthop>

Another way is:

ovn-nbctl create Logical_Router_Static_Route prefix=<prefix> nexthop=<nexthop> options:ecmp_symmetric_reply=true

Comment 16 Mark Michelson 2020-07-30 19:00:13 UTC

This patch has been merged into master upstream. I am now working on getting it merged into downstream so it will be available in FDP builds.

Comment 17 Mark Michelson 2020-07-30 20:37:46 UTC

The patch is now merged into the fast-datapath-next branch of ovn2.13. It will be in the next FDP release of OVN.

Comment 20 Jianlin Shi 2020-08-27 06:45:44 UTC

tested with following script:

#    foo -- R1 -- join - R2 -- alice  --   |                    
#           |          |                 server                                                       
#    bar ----          - R3 --- bob ----   |                                                          
#                                                                                                     
                                                                                                      
systemctl start openvswitch                                                                           
systemctl start ovn-northd                                                                            
ovn-nbctl set-connection ptcp:6641                              
ovn-sbctl set-connection ptcp:6642                                         
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.31.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.31.25
systemctl restart ovn-controller                                                                      
                                                                                                      
ovn-nbctl lr-add R1                                                                                   
ovn-nbctl lr-add R2                                                                                   
ovn-nbctl lr-add R3                                                                                   
                                                                                                      
ovn-nbctl set logical_router R2 options:chassis=hv1                  
ovn-nbctl set logical_router R3 options:chassis=hv1                                                   
                                                                               
ovn-nbctl ls-add foo                                                                                  
ovn-nbctl ls-add bar                                         
ovn-nbctl ls-add alice                                     
ovn-nbctl ls-add bob                                            
ovn-nbctl ls-add join                                          
                                                                                                      
ovn-nbctl lrp-add R1 foo 00:00:01:01:02:03 192.168.1.1/24 2001::1/64                                  
ovn-nbctl lsp-add foo rp-foo -- set logical_switch_port rp-foo \
        type=router options:router-port=foo addresses=\"00:00:01:01:02:03\"                           
                                                                                                      
ovn-nbctl lrp-add R1 bar 00:00:01:01:02:04 192.168.2.1/24 2002::1/64                                  
ovn-nbctl lsp-add bar rp-bar -- set Logical_Switch_Port rp-bar \
        type=router options:router-port=bar addresses=\"00:00:01:01:02:04\"            
                                                      
ovn-nbctl lrp-add R2 alice 00:00:02:01:02:03 172.16.1.1/24 3001::1/64
ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
        type=router options:router-port=alice addresses=\"00:00:02:01:02:03\"
ovn-nbctl lrp-add R3 bob 00:00:03:01:02:03 172.17.1.1/24 3002::1/64                                   
ovn-nbctl lsp-add bob rp-bob -- set Logical_Switch_Port rp-bob \                                      
        type=router options:router-port=bob addresses=\"00:00:03:01:02:03\"

ovn-nbctl lrp-add R1 R1_join 00:00:04:01:02:03 20.0.0.1/24 4000::1/64
ovn-nbctl lsp-add join r1-join -- set Logical_Switch_Port r1-join \
        type=router options:router-port=R1_join addresses='"00:00:04:01:02:03"'
ovn-nbctl lrp-add R2 R2_join 00:00:04:01:02:04 20.0.0.2/24 4000::2/64
ovn-nbctl lsp-add join r2-join -- set Logical_Switch_Port r2-join \                         
        type=router options:router-port=R2_join addresses='"00:00:04:01:02:04"'
ovn-nbctl lrp-add R3 R3_join 00:00:04:01:02:05 20.0.0.3/24 4000::3/64
ovn-nbctl lsp-add join r3-join -- set Logical_Switch_Port r3-join \
        type=router options:router-port=R3_join addresses='"00:00:04:01:02:05"'
                                      
ovn-nbctl lr-route-add R2 192.168.0.0/16 20.0.0.1            
ovn-nbctl lr-route-add R3 192.168.0.0/16 20.0.0.1          
ovn-nbctl lr-route-add R2 2001::/64 4000::1                     
ovn-nbctl lr-route-add R2 2002::/64 4000::1                    
ovn-nbctl lr-route-add R3 2001::/64 4000::1            
ovn-nbctl lr-route-add R3 2002::/64 4000::1                         
                                                                
ovn-nbctl lr-route-add R2 1.1.1.0/24 172.16.1.3                                             
ovn-nbctl lr-route-add R3 1.1.1.0/24 172.17.1.4      
ovn-nbctl lr-route-add R2 1111::/64 3001::3                         
ovn-nbctl lr-route-add R3 1111::/64 3002::4 

ip netns add foo1                                                     
ovs-vsctl add-port br-int foo1 -- set interface foo1 type=internal           
ip link set foo1 netns foo1                                        
ip netns exec foo1 ip link set foo1 address f0:00:00:01:02:03   
ip netns exec foo1 ip link set foo1 up                                     
ip netns exec foo1 ip addr add 192.168.1.2/24 dev foo1                  
ip netns exec foo1 ip -6 addr add 2001::2/64 dev foo1                
ip netns exec foo1 ip route add default via  192.168.1.1 dev foo1  
ip netns exec foo1 ip -6 route add default via 2001::1 dev foo1                
ovs-vsctl set interface foo1 external_ids:iface-id=foo1              
ovn-nbctl lsp-add foo foo1 -- lsp-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2 2001::2"
                                                                               
ip netns add bar1                                                    
ip link add bar1 netns bar1 type veth peer name bar1_br            
ip netns exec bar1 ip link set bar1 address f0:00:00:01:02:05                  
ip netns exec bar1 ip link set bar1 up
ip netns exec bar1 ip addr add 192.168.2.2/24 dev bar1
ip netns exec bar1 ip -6 addr add 2002::2/64 dev bar1
ip netns exec bar1 ip route add default via 192.168.2.1 dev bar1
ip netns exec bar1 ip -6 route add default via 2002::1 dev bar1
ip link set bar1_br up
ovs-vsctl add-port br-int bar1_br
ovs-vsctl set interface bar1_br external_ids:iface-id=bar1
ovn-nbctl lsp-add bar bar1 -- lsp-set-addresses bar1 "f0:00:00:01:02:05 192.168.2.2 2002::2"

ovs-vsctl add-br br_alice
ovs-vsctl add-br br_bob
ovs-vsctl set open . external-ids:ovn-bridge-mappings=net_alice:br_alice,net_bob:br_bob

ovn-nbctl lsp-add alice ln_alice
ovn-nbctl lsp-set-type ln_alice localnet
ovn-nbctl lsp-set-addresses ln_alice unknown
ovn-nbctl lsp-set-options ln_alice network_name=net_alice

ip netns add alice1
ovs-vsctl add-port br_alice alice1 -- set interface alice1 type=internal
ip link set alice1 netns alice1
ip netns exec alice1 ip link set alice1 address f0:00:00:01:02:04
ip netns exec alice1 ip link set alice1 up
ip netns exec alice1 ip addr add 172.16.1.3/24 dev alice1
ip netns exec alice1 ip -6 addr add 3001::3/64 dev alice1
ip netns exec alice1 ip route add default via 172.16.1.1 dev alice1
ip netns exec alice1 ip -6 route add default via 3001::1 dev alice1

ovn-nbctl lsp-add bob ln_bob
ovn-nbctl lsp-set-type ln_bob localnet
ovn-nbctl lsp-set-addresses ln_bob unknown
ovn-nbctl lsp-set-options ln_bob network_name=net_bob

ip netns add bob1                                                                           
ip link add bob1 netns bob1 type veth peer name bob1_br                        
ip netns exec bob1 ip link set bob1 address f0:00:00:01:02:06        
ip netns exec bob1 ip link set bob1 up                             
ip netns exec bob1 ip addr add 172.17.1.4/24 dev bob1                          
ip netns exec bob1 ip -6 addr add 3002::4/64 dev bob1
ip netns exec bob1 ip route add default via 172.17.1.1 dev bob1
ip netns exec bob1 ip -6 route add default via 3002::1 dev bob1
ip link set bob1_br up                                          
ovs-vsctl add-port br_bob bob1_br                              
                                                       
ip link add br_test type bridge                              
ip link set br_test up                                    
                                                                                            
ip link add a1 netns alice1 type veth peer name a1_br
ip link add b1 netns bob1 type veth peer name b1_br            
ip link set a1_br master br_test                               
ip link set b1_br master br_test                                                       
ip link set a1_br up             
ip link set b1_br up            
ip netns exec alice1 ip link set a1 up  
ip netns exec bob1 ip link set b1 up        
ip netns exec alice1 ip addr add 1.1.1.1/24 dev a1       
ip netns exec alice1 ip -6 addr add 1111::1/64 dev a1
ip netns exec bob1 ip addr add 1.1.1.2/24 dev b1   
ip netns exec bob1 ip -6 addr add 1111::2/64 dev b1                     
                                
ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1    
ip netns exec bob1 sysctl -w net.ipv4.conf.all.forwarding=1
ip netns exec alice1 sysctl -w net.ipv6.conf.all.forwarding=1
ip netns exec bob1 sysctl -w net.ipv6.conf.all.forwarding=1
                                                                   
ip netns add server                                                
ip link add s1 netns server type veth peer name s1_br
ip link set s1_br master br_test                   
ip link set s1_br up                  
ip netns exec server ip link set s1 up                       
ip netns exec server ip addr add 1.1.1.10/24 dev s1        
ip netns exec server ip route add default via 1.1.1.1 dev s1 
ip netns exec server ip -6 addr add 1111::10/64 dev s1     
ip netns exec server ip -6 route add default via 1111::1 dev s1
ip netns exec server sysctl -w net.ipv4.conf.all.rp_filter=0 
ip netns exec server sysctl -w net.ipv4.conf.default.rp_filter=0
                                                                                                                   
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.2
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.3              
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::2   
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::3

tested on ovn2.13-20.06.2-2.el8fdp.x86_64:

[root@dell-per740-12 bz1849683]# ip netns exec foo1 ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
72673: foo1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether f0:00:00:01:02:03 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.2/24 scope global foo1
       valid_lft forever preferred_lft forever
    inet6 2001::2/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::f200:ff:fe01:203/64 scope link 
       valid_lft forever preferred_lft forever

[root@dell-per740-12 bz1849683]# ip netns exec server ip route list default
default via 1.1.1.1 dev s1 
[root@dell-per740-12 bz1849683]# ip netns exec server ip -6 route list default
default via 1111::1 dev s1 metric 1024 pref medium

[root@dell-per740-12 bz1849683]# ip netns exec foo1 nc -l 83219 -k &
[root@dell-per740-12 bz1849683]# ip netns exec bob1 tcpdump -i any -w bob1.pcap &
[root@dell-per740-12 bz1849683]# for i in {1..10}; do
ip netns exec server nc 2001::2 10010 <<< h; done
[root@dell-per740-12 bz1849683]# for i in {1..10}; do
ip netns exec server nc 192.168.1.2 10010 <<< h; done

[root@dell-per740-12 bz1849683]# tcpdump  -r bob1.pcap  host 2001::2 -nnle
reading from file bob1.pcap, link-type LINUX_SLL (Linux cooked v1)
dropped privs to tcpdump
02:39:50.042647  In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34782: Flags [S.], seq 3475408287, ack 3362943445, win 28560, options [mss 1440,sackOK,TS val 3312980532 ecr 872133314,nop,wscale 7], length 0
02:39:50.042673 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34782: Flags [S.], seq 3475408287, ack 3362943445, win 28560, options [mss 1440,sackOK,TS val 3312980532 ecr 872133314,nop,wscale 7], length 0
02:39:50.042791  In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.042802 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.042926  In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.042943 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34782: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980533 ecr 872133315], length 0
02:39:50.203681  In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34790: Flags [S.], seq 3068695351, ack 1492396622, win 28560, options [mss 1440,sackOK,TS val 3312980693 ecr 872133475,nop,wscale 7], length 0
02:39:50.203698 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 96: 2001::2.10010 > 1111::10.34790: Flags [S.], seq 3068695351, ack 1492396622, win 28560, options [mss 1440,sackOK,TS val 3312980693 ecr 872133475,nop,wscale 7], length 0
02:39:50.203815  In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0
02:39:50.203821 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [.], ack 3, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0
02:39:50.203990  In 00:00:03:01:02:03 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0
02:39:50.204004 Out 1a:4b:91:bb:db:46 ethertype IPv6 (0x86dd), length 88: 2001::2.10010 > 1111::10.34790: Flags [F.], seq 1, ack 4, win 224, options [nop,nop,TS val 3312980694 ecr 872133476], length 0

<=== still get packets from foo1(2001::2) on bob1

[root@dell-per740-12 bz1849683]# tcpdump  -r bob1.pcap  host 192.168.1.2 -nnle
reading from file bob1.pcap, link-type LINUX_SLL (Linux cooked v1)                                                                                                                                         
dropped privs to tcpdump
02:40:07.042605  In 00:00:03:01:02:03 ethertype IPv4 (0x0800), length 76: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [S.], seq 1582010048, ack 4256554733, win 28960, options [mss 1460,sackOK,TS val 3813709470 ecr 2021325724,nop,wscale 7], length 0
02:40:07.042626 Out 1a:4b:91:bb:db:46 ethertype IPv4 (0x0800), length 76: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [S.], seq 1582010048, ack 4256554733, win 28960, options [mss 1460,sackOK,TS val 3813709470 ecr 2021325724,nop,wscale 7], length 0
02:40:07.042730  In 00:00:03:01:02:03 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [.], ack 3, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length 0      
02:40:07.042738 Out 1a:4b:91:bb:db:46 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [.], ack 3, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length 0      
02:40:07.042845  In 00:00:03:01:02:03 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [F.], seq 1, ack 4, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length
0
02:40:07.042859 Out 1a:4b:91:bb:db:46 ethertype IPv4 (0x0800), length 68: 192.168.1.2.10010 > 1.1.1.10.38468: Flags [F.], seq 1, ack 4, win 227, options [nop,nop,TS val 3813709471 ecr 2021325726], length
0
...

<=== still get ipv4 packet from foo1 (192.168.1.2) on bob1

as the default ipv4 route on server is via 1.1.1.1(alice1), so nc to foo1(192.168.1.2) would go through alice1, then to R2(20.0.0.2), then the return packet should also go through R2 then to alice1, then bob1 should not receive the return packet.
the same for ipv6 packet.

Mark, How do you think? anything wrong?

Comment 21 Jianlin Shi 2020-08-27 06:46:40 UTC

packages used for comment 20:

[root@dell-per740-12 bz1849683]# uname -a
Linux dell-per740-12.rhts.eng.pek2.redhat.com 4.18.0-232.el8.x86_64 #1 SMP Mon Aug 10 06:55:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@dell-per740-12 bz1849683]# rpm -qa | grep -E "openvswitch|ovn"
ovn2.13-20.06.2-2.el8fdp.x86_64
ovn2.13-central-20.06.2-2.el8fdp.x86_64
openvswitch2.13-2.13.0-54.el8fdp.x86_64
kernel-kernel-networking-openvswitch-ovn-common-1.0-7.noarch
python3-openvswitch2.13-2.13.0-54.el8fdp.x86_64
ovn2.13-host-20.06.2-2.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch

Comment 22 Jianlin Shi 2020-08-31 14:49:07 UTC

> [root@dell-per740-12 bz1849683]# ip netns exec server ip route list default
> default via 1.1.1.1 dev s1 
> [root@dell-per740-12 bz1849683]# ip netns exec server ip -6 route list
> default
> default via 1111::1 dev s1 metric 1024 pref medium
> 
> [root@dell-per740-12 bz1849683]# ip netns exec foo1 nc -l 83219 -k &

<=== correction: here is "ip netns exec foo1 nc -l 10010 -k &"

Comment 23 Numan Siddique 2020-08-31 19:18:50 UTC

Hi Jianlin shi,

Please add the below in your rep.sh

ovn-nbctl set logical_router R1 options:chassis=hv1


Symmetric ECMP reply is only usable on gateway routers. And hence you need to set R1 to a chassis.

Thanks

Comment 24 Jianlin Shi 2020-09-01 01:51:18 UTC

(In reply to Numan Siddique from comment #23)
> Hi Jianlin shi,
> 
> Please add the below in your rep.sh
> 
> ovn-nbctl set logical_router R1 options:chassis=hv1
> 
> 
> Symmetric ECMP reply is only usable on gateway routers. And hence you need
> to set R1 to a chassis.
> 
> Thanks

it works after add the setting. no packets received on bob1.

and when add ecmp with:
ovn-nbctl --ecmp lr-route-add R1 0.0.0.0/0 20.0.0.2
ovn-nbctl --ecmp lr-route-add R1 0.0.0.0/0 20.0.0.3              
ovn-nbctl --ecmp lr-route-add R1 ::/0 4000::2   
ovn-nbctl --ecmp lr-route-add R1 ::/0 4000::3

bob1 would receive packets. so --ecmp-symmetric-reply works

Comment 25 Jianlin Shi 2020-09-01 03:28:41 UTC

Verified both on rhel7 and rhel8 version.

the complete script is:

#    foo -- R1 -- join - R2 -- alice  --   |                                           
#           |          |                 server                 
#    bar ----          - R3 --- bob ----   |                               
#                                       
                                                                     
systemctl start openvswitch                                           
systemctl start ovn-northd                                                   
ovn-nbctl set-connection ptcp:6641                                 
ovn-sbctl set-connection ptcp:6642                                      
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.50.26:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.50.26
systemctl restart ovn-controller                                 
                                                                     
ovn-nbctl lr-add R1                                                
ovn-nbctl lr-add R2                                                                         
ovn-nbctl lr-add R3                                                  
                                                                   
ovn-nbctl set logical_router R1 options:chassis=hv1                            
ovn-nbctl set logical_router R2 options:chassis=hv1                  
ovn-nbctl set logical_router R3 options:chassis=hv1                
                                                                               
ovn-nbctl ls-add foo                                       
ovn-nbctl ls-add bar                                            
ovn-nbctl ls-add alice                                         
ovn-nbctl ls-add bob                                   
ovn-nbctl ls-add join                                        
                                                          
ovn-nbctl lrp-add R1 foo 00:00:01:01:02:03 192.168.1.1/24 2001::1/64                        
ovn-nbctl lsp-add foo rp-foo -- set logical_switch_port rp-foo \
        type=router options:router-port=foo addresses=\"00:00:01:01:02:03\"
                                                               
ovn-nbctl lrp-add R1 bar 00:00:01:01:02:04 192.168.2.1/24 2002::1/64                   
ovn-nbctl lsp-add bar rp-bar -- set Logical_Switch_Port rp-bar \
        type=router options:router-port=bar addresses=\"00:00:01:01:02:04\"
                                                            
ovn-nbctl lrp-add R2 alice 00:00:02:01:02:03 172.16.1.1/24 3001::1/64
ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
        type=router options:router-port=alice addresses=\"00:00:02:01:02:03\"
ovn-nbctl lrp-add R3 bob 00:00:03:01:02:03 172.17.1.1/24 3002::1/64 
ovn-nbctl lsp-add bob rp-bob -- set Logical_Switch_Port rp-bob \        
        type=router options:router-port=bob addresses=\"00:00:03:01:02:03\"
                                                                 
ovn-nbctl lrp-add R1 R1_join 00:00:04:01:02:03 20.0.0.1/24 4000::1/64
ovn-nbctl lsp-add join r1-join -- set Logical_Switch_Port r1-join \
        type=router options:router-port=R1_join addresses='"00:00:04:01:02:03"'             
ovn-nbctl lrp-add R2 R2_join 00:00:04:01:02:04 20.0.0.2/24 4000::2/64
ovn-nbctl lsp-add join r2-join -- set Logical_Switch_Port r2-join \
        type=router options:router-port=R2_join addresses='"00:00:04:01:02:04"'
ovn-nbctl lrp-add R3 R3_join 00:00:04:01:02:05 20.0.0.3/24 4000::3/64
ovn-nbctl lsp-add join r3-join -- set Logical_Switch_Port r3-join \
        type=router options:router-port=R3_join addresses='"00:00:04:01:02:05"'

ovn-nbctl lr-route-add R2 192.168.0.0/16 20.0.0.1               
ovn-nbctl lr-route-add R3 192.168.0.0/16 20.0.0.1              
ovn-nbctl lr-route-add R2 2001::/64 4000::1            
ovn-nbctl lr-route-add R2 2002::/64 4000::1                  
ovn-nbctl lr-route-add R3 2001::/64 4000::1               
ovn-nbctl lr-route-add R3 2002::/64 4000::1                                                 
                                                                
ovn-nbctl lr-route-add R2 1.1.1.0/24 172.16.1.3                            
ovn-nbctl lr-route-add R3 1.1.1.0/24 172.17.1.4                
ovn-nbctl lr-route-add R2 1111::/64 3001::3                                            
ovn-nbctl lr-route-add R3 1111::/64 3002::4                     
                                                                           
ip netns add foo1                                           
ovs-vsctl add-port br-int foo1 -- set interface foo1 type=internal   
ip link set foo1 netns foo1                                           
ip netns exec foo1 ip link set foo1 address f0:00:00:01:02:03                
ip netns exec foo1 ip link set foo1 up                              
ip netns exec foo1 ip addr add 192.168.1.2/24 dev foo1                  
ip netns exec foo1 ip -6 addr add 2001::2/64 dev foo1                      
ip netns exec foo1 ip route add default via  192.168.1.1 dev foo1
ip netns exec foo1 ip -6 route add default via 2001::1 dev foo1      
ovs-vsctl set interface foo1 external_ids:iface-id=foo1            
ovn-nbctl lsp-add foo foo1 -- lsp-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2 2001::2"
                                                                     
ip netns add bar1                                                  
ip link add bar1 netns bar1 type veth peer name bar1_br                        
ip netns exec bar1 ip link set bar1 address f0:00:00:01:02:05        
ip netns exec bar1 ip link set bar1 up                             
ip netns exec bar1 ip addr add 192.168.2.2/24 dev bar1                         
ip netns exec bar1 ip -6 addr add 2002::2/64 dev bar1
ip netns exec bar1 ip route add default via 192.168.2.1 dev bar1
ip netns exec bar1 ip -6 route add default via 2002::1 dev bar1
ip link set bar1_br up                                 
ovs-vsctl add-port br-int bar1_br                            
ovs-vsctl set interface bar1_br external_ids:iface-id=bar1
ovn-nbctl lsp-add bar bar1 -- lsp-set-addresses bar1 "f0:00:00:01:02:05 192.168.2.2 2002::2"
                                                     
ovs-vsctl add-br br_alice                                      
ovs-vsctl add-br br_bob                                        
ovs-vsctl set open . external-ids:ovn-bridge-mappings=net_alice:br_alice,net_bob:br_bob
                                 
ovn-nbctl lsp-add alice ln_alice
ovn-nbctl lsp-set-type ln_alice localnet
ovn-nbctl lsp-set-addresses ln_alice unknown
ovn-nbctl lsp-set-options ln_alice network_name=net_alice

ip netns add alice1                                                 
ovs-vsctl add-port br_alice alice1 -- set interface alice1 type=internal
ip link set alice1 netns alice1                                            
ip netns exec alice1 ip link set alice1 address f0:00:00:01:02:04
ip netns exec alice1 ip link set alice1 up                           
ip netns exec alice1 ip addr add 172.16.1.3/24 dev alice1          
ip netns exec alice1 ip -6 addr add 3001::3/64 dev alice1                                   
ip netns exec alice1 ip route add default via 172.16.1.1 dev alice1  
ip netns exec alice1 ip -6 route add default via 3001::1 dev alice1
                                                                               
ovn-nbctl lsp-add bob ln_bob                                         
ovn-nbctl lsp-set-type ln_bob localnet                             
ovn-nbctl lsp-set-addresses ln_bob unknown                                     
ovn-nbctl lsp-set-options ln_bob network_name=net_bob
                                                                
ip netns add bob1                                              
ip link add bob1 netns bob1 type veth peer name bob1_br
ip netns exec bob1 ip link set bob1 address f0:00:00:01:02:06
ip netns exec bob1 ip link set bob1 up                    
ip netns exec bob1 ip addr add 172.17.1.4/24 dev bob1                                       
ip netns exec bob1 ip -6 addr add 3002::4/64 dev bob1
ip netns exec bob1 ip route add default via 172.17.1.1 dev bob1
ip netns exec bob1 ip -6 route add default via 3002::1 dev bob1
ip link set bob1_br up                                                                 
ovs-vsctl add-port br_bob bob1_br
                                
ip link add br_test type bridge         
ip link set br_test up                      
                                                         
ip link add a1 netns alice1 type veth peer name a1_br
ip link add b1 netns bob1 type veth peer name b1_br
ip link set a1_br master br_test                                        
ip link set b1_br master br_test
ip link set a1_br up                                             
ip link set b1_br up                      
ip netns exec alice1 ip link set a1 up                   
ip netns exec bob1 ip link set b1 up                     
ip netns exec alice1 ip addr add 1.1.1.1/24 dev a1                 
ip netns exec alice1 ip -6 addr add 1111::1/64 dev a1              
ip netns exec bob1 ip addr add 1.1.1.2/24 dev b1
ip netns exec bob1 ip -6 addr add 1111::2/64 dev b1
                                      
ip netns exec alice1 sysctl -w net.ipv4.conf.all.forwarding=1
ip netns exec bob1 sysctl -w net.ipv4.conf.all.forwarding=1
ip netns exec alice1 sysctl -w net.ipv6.conf.all.forwarding=1
ip netns exec bob1 sysctl -w net.ipv6.conf.all.forwarding=1

ip netns add server                             
ip link add s1 netns server type veth peer name s1_br
ip link set s1_br master br_test      
ip link set s1_br up                                         
ip netns exec server ip link set s1 up                     
ip netns exec server ip addr add 1.1.1.10/24 dev s1          
ip netns exec server ip route add default via 1.1.1.1 dev s1
ip netns exec server ip -6 addr add 1111::10/64 dev s1 
ip netns exec server ip -6 route add default via 1111::1 dev s1
ip netns exec server sysctl -w net.ipv4.conf.all.rp_filter=0
ip netns exec server sysctl -w net.ipv4.conf.default.rp_filter=0
                    
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.2
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 0.0.0.0/0 20.0.0.3
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::2
ovn-nbctl --ecmp-symmetric-reply lr-route-add R1 ::/0 4000::3

Comment 27 errata-xmlrpc 2020-09-16 16:01:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3769

Note You need to log in before you can comment on or make changes to this bug.