Bug 2064704

Summary: Wrong addition of VIPs to all logical router pods leading to triggering GARP on different locations
Product: Red Hat OpenStack Reporter: Luis Tomas Bolivar <ltomasbo>
Component: openstack-neutronAssignee: Luis Tomas Bolivar <ltomasbo>
Status: CLOSED ERRATA QA Contact: Maor <mblue>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: chrisw, ctrautma, dceara, jiji, jishi, lorenzo.bianconi, mmichels, mtomaska, nusiddiq, scohen
Target Milestone: gaKeywords: Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-18.2.1-0.20220317031925.ca458d9.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2053013
: 2064706 (view as bug list) Environment:
Last Closed: 2022-09-21 12:19:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2053013    
Bug Blocks: 2064706, 2064709    

Description Luis Tomas Bolivar 2022-03-16 11:46:44 UTC
When nat-addresses=router option is used, GARP are sent for all NAT and loadbalancer VIPs. However, there is a flag named "exclude-lb-vips-from-garp"[1], that could be added to the options on the router gateway port, ensuring no GARPs are sent for the load balancer VIPs.

[1] https://patchwork.ozlabs.org/project/ovn/patch/183edfc446633c5c38d7d7361089d34432c527dd.1645793899.git.lorenzo.bianconi@redhat.com/


+++ This bug was initially created as a clone of Bug #2053013 +++

Description of problem:
When a loadbalancer is created in an OSP tenant network (VIP and members), and that tenant networks is connected to a router, which in turns is connected to the provider network, the ovn loadbalancer gets associated to the ovn logical router. This includes also the cr-lrp port (patch port connecting the router and the provider network, in OSP world, the router gateway port), and it can be seen by the entry on the nat_address of that port, which includes the VIP of the loadbalalancer.

This may cause problems as that means ovn-controller will send GARPs for that (internal, tenant network) IP. There is nothing blocking different tenants in OSP to create a subnet with the same CIDR and then a loadbalancer with the same VIP. If that is the case, there may be several ovn-controllers generating GARPs on the provider network for the same IP, each one with the MAC of the logical router port belonging to each user. This could be a problem for the physical network infrastructure.


Steps to Reproduce:
1. Create a router in OSP and attach it to the provider network
2. Create a tenant network/subnet and connect it to the router
3. Create a Load Balancer in OSP, with the VIP in  that tenant network

Actual results:
Check the VIP of the loadbalancer is on the OVN SB Port_Binding table, at the nat_addresses of the patch port connecting the router to the provider network:

datapath            : e3a0a334-9a02-41c7-a64d-6ea747839808
external_ids        : {"neutron:cidrs"="172.24.100.181/24 2001:db8::f816:3eff:fe77:7f9c/64", "neutron:device_id"="335cd008-216f-4571-a685-b0de5a7ffe50", "neutron:device_owner"="network:router_gateway", "neutron:network_name"=neutron-d923b3db-500d-4241-95be-c3869c72b36a, "neutron:port_name"="", "neutron:project_id"="", "neutron:revision_number"="6", "neutron:security_group_ids"=""}                                                                                                                            
logical_port        : "add962d2-21ab-4733-b6ef-35538eff25a8"
mac                 : [router]
nat_addresses       : ["fa:16:3e:77:7f:9c 172.24.100.181 is_chassis_resident(\"cr-lrp-add962d2-21ab-4733-b6ef-35538eff25a8\")", "fa:16:3e:77:7f:9c 172.24.100.229 *20.0.0.98* 172.24.100.112 is_chassis_resident(\"cr-lrp-add962d2-21ab-4733-b6ef-35538eff25a8\")"]
options             : {peer=lrp-add962d2-21ab-4733-b6ef-35538eff25a8}
parent_port         : []
tag                 : []
tunnel_key          : 4
type                : patch
up                  : false
virtual_parent      : []


Expected results:
In the example, ip 20.0.0.98 should not be there as that belongs to an IP in a tenant network that should not be advertized (GARP) in the provider network.

--- Additional comment from Luis Tomas Bolivar on 2022-02-11 09:28:03 UTC ---

Adding one extra data point, this not only can affect physical network infrastructure, but it is also a security breach on OCP on OSP with Kuryr, as LoadBalancer VIPs (belonging to services of type ClusterIP) that are meant to not be externally reachable, could be reached from the provider network

--- Additional comment from Mark Michelson on 2022-02-11 15:45:43 UTC ---

Hi, I'm not 100% sure this is an OVN bug, and that this might be a configuration issue.

In the northbound database, on logical switch ports, you can set options:nat-addresses in order to determine which connected logical router NAT addresses should be used in GARPs that OVN sends. You can configure options:nat-addresses in two different ways.

Way 1: Explicitly list all of the NAT addresses of connected router ports that you want to advertise.
Way 2: Use the keyword "router" in order to have OVN automatically find all connected router ports and advertise all NAT addresses on those routers.

The key here is that both of these ways require the CMS to set an option on the logical switch port. OVN will not do anything without CMS input.

I suspect that OpenStack is using Way 2, and that is why all connected logical router NAT addresses are being taken into account. What probably should be done instead is to switch to Way 1. This way, you can customize which addresses OVN sends GARPs for.

--- Additional comment from Luis Tomas Bolivar on 2022-02-11 16:16:20 UTC ---

(In reply to Mark Michelson from comment #2)
> Hi, I'm not 100% sure this is an OVN bug, and that this might be a
> configuration issue.
> 
> In the northbound database, on logical switch ports, you can set
> options:nat-addresses in order to determine which connected logical router
> NAT addresses should be used in GARPs that OVN sends. You can configure
> options:nat-addresses in two different ways.
> 
> Way 1: Explicitly list all of the NAT addresses of connected router ports
> that you want to advertise.
> Way 2: Use the keyword "router" in order to have OVN automatically find all
> connected router ports and advertise all NAT addresses on those routers.
> 
> The key here is that both of these ways require the CMS to set an option on
> the logical switch port. OVN will not do anything without CMS input.
> 
> I suspect that OpenStack is using Way 2, and that is why all connected
> logical router NAT addresses are being taken into account. What probably
> should be done instead is to switch to Way 1. This way, you can customize
> which addresses OVN sends GARPs for.

It seems you are right and neutron is using the second way for the gateway ports: https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/commands.py#L408.

Stil, I'm reading this https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328286.html, and not completely clear the behaviour is the expected one. There it is stated than when router is used:
+                Gratuitous ARPs will be sent for all SNAT and DNAT external IP
+                addresses and for all load balancer IP addresses defined on the
+                <ref column="options" key="router-port"/>'s logical router,
+                using the <ref column="options" key="router-port"/>'s MAC
+                address.

Why using the SNAT/DNAT for the VMs/logical_ports, but not using it for the load balancers IP and exposing them directly? Perhaps what is "wrong" is to consider the load balancers here, right? In the same way as VMs without a SNAT/DNAT entry are not exposed, the same should happen for the loadbalancers that do not have an SNAT/DNAT entry

--- Additional comment from Mark Michelson on 2022-02-22 15:44:24 UTC ---

To ovn-northd, load balancers are DNATs. In many cases, OVN treats a load balancer VIP the same as a DNAT external address. There is no reason to configure a load balancer's VIP as a DNAT as well, since this is redundant. This is why they are advertised in GARPs if you've set options:nat-addresses=router on the logical switch port. ovn-northd has no reason to assume that load balancers on the connected routers should not be advertised.

As I mentioned in my previous comment, there is a workaround that will get the behavior you want. In the interest of being more user friendly, we could add new keyword options for options:nat-addresses so that we can advertise only DNAT/SNAT addresses, or only load balancer VIPs on the connected router. But that falls into RFE territory rather than a bug fix.

--- Additional comment from lorenzo bianconi on 2022-02-24 17:11:56 UTC ---

this bz seems a duplicated of bz2054394

--- Additional comment from Numan Siddique on 2022-02-25 00:54:28 UTC ---

@Luis - Is the load balancer associated to provider logical switch too ?

If so,  then I think ovn octavia driver should not do that.

Either way, I think it is almost duplicate of bz2054394 and Lorenzo's patch could address this issue.

--- Additional comment from Luis Tomas Bolivar on 2022-02-25 07:21:00 UTC ---

(In reply to Numan Siddique from comment #6)
> @Luis - Is the load balancer associated to provider logical switch too ?
> 
> If so,  then I think ovn octavia driver should not do that.

I can check that out and ensure ovn-octavia driver does not add the loadbalancer to the logical switch from the provider. This will still work if there is a FIP associated to that loadbalancer, right?

That said, based on conversation with Mark, it seems the culprit is using the router keyword at the nat-addresses option

> 
> Either way, I think it is almost duplicate of bz2054394 and Lorenzo's patch
> could address this issue.

Yep, it looks the same. At least the solution proposed is similar to what Mark proposed in this one: "we could add new keyword options for options:nat-addresses so that we can advertise only DNAT/SNAT addresses, or only load balancer VIPs on the connected router. But that falls into RFE territory rather than a bug fix."

To me, for either OSP, as well as for OVN-Kubernetes this looks like a security issue. So, I hope it can be backported. Services of type "ClusterIP" are not supposed to be reachable from outside the cluster, and due to having the GARP sent for their VIP, that is not the case.

--- Additional comment from errata-xmlrpc on 2022-03-01 21:11:41 UTC ---

This bug has been added to advisory RHBA-2022:89290 by Mark Michelson (mmichels)

--- Additional comment from errata-xmlrpc on 2022-03-01 21:11:45 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2022:89290-01
https://errata.devel.redhat.com/advisory/89290

--- Additional comment from Jianlin Shi on 2022-03-02 02:12:27 UTC ---

Hi Mark,

How is this bug solved? with the RFE bz2054394 or any other way?
if it is solved by bz2054394, then we can set this bug as duplicated, how do you think?

--- Additional comment from Mark Michelson on 2022-03-03 13:46:58 UTC ---

Yes, this is solved by https://bugzilla.redhat.com/show_bug.cgi?id=2054394, so you can mark this as a duplicate if you would like.

--- Additional comment from Jianlin Shi on 2022-03-04 01:36:46 UTC ---

Comment 5 Maor 2022-05-17 14:25:31 UTC
Verified on 'RHOS-17.0-RHEL-9-20220414.n.1' with 'openstack-neutron-18.2.1-0.20220317031925.ca458d9.el9ost'.

The first verification was done according to reproducing steps mentioned on comment 0.
The second verification was done in similar way to reproducing steps mentioned in ovn core BZ [1].

First Verification:

1) Create a router in OSP and attach it to the provider network.
(overcloud) [stack@undercloud-0 ~]$ . ~/overcloudrc && openstack router create router_A
(overcloud) [stack@undercloud-0 ~]$ openstack router set router_A --external-gateway public

2) Create a tenant network/subnet and connect it to the router.
(overcloud) [stack@undercloud-0 ~]$ openstack network create net_A
(overcloud) [stack@undercloud-0 ~]$ openstack subnet create subnet_A --network net_A --subnet-range 192.168.3.0/24
(overcloud) [stack@undercloud-0 ~]$ openstack router add subnet router_A subnet_A

3) Create a Load Balancer in OSP, with the VIP in that tenant network
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer create --vip-subnet-id subnet_A --enable

4) Verify Load Balancer IP is not advertised outside in the provider network, according to patch port mentioned in comment 0. 
[heat-admin@controller-0 ~]$ export SBDB=$(sudo ovs-vsctl get open . external_ids:ovn-remote | sed -e 's/\"//g')
[heat-admin@controller-0 ~]$ alias ovn-sbctl="sudo podman exec ovn_controller ovn-sbctl --db=$SBDB"
[heat-admin@controller-0 ~]$ ovn-sbctl list Port_Binding

In 'nat_addresses' output, verified no IP address related to load balancer:
...
_uuid               : dc0fe975-4a5f-414c-850b-6cf2279977a1
chassis             : []
datapath            : 53816f1b-aa7f-403e-9539-366bf878038e
encap               : []
external_ids        : {"neutron:cidrs"="10.0.0.198/24 2620:52:0:13b8::1000:3e/64", "neutron:device_id"="56f46dcb-251b-4a35-b8ae-e0951e57ebbc", "neutron:device_owner"="net
work:router_gateway", "neutron:network_name"=neutron-dfbcd78b-4c1c-4ee4-b477-e7fddb884403, "neutron:port_name"="", "neutron:project_id"="", "neutron:revision_number"="4", "neutron:security_group_ids"=""}
gateway_chassis     : []
ha_chassis_group    : []
logical_port        : "e3780ea1-17a8-4040-9b97-8315ac51cce6"
mac                 : [router]
nat_addresses       : ["fa:16:3e:2d:6a:d1 10.0.0.198 is_chassis_resident(\"cr-lrp-e3780ea1-17a8-4040-9b97-8315ac51cce6\")"]
options             : {peer=lrp-e3780ea1-17a8-4040-9b97-8315ac51cce6}
parent_port         : []
requested_chassis   : []
tag                 : []
tunnel_key          : 3
type                : patch
up                  : false
virtual_parent      : []
...


Second Verification:

Captured ARP packets on all interfaces of gateway controller (controller scheduled with highest priority), capture done similarly to tcpdump capture verification in core ovn BZ (comments 9-12) [1].
There was at least 1 minute of capture before and after load balancer creation.
All ARP packets captured didn't contain IP addresses related to internal network/subnet of load balancer.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2054394#c9

Comment 10 errata-xmlrpc 2022-09-21 12:19:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543