Bug 2169349

Summary: [ovn provider] Avoid use of ovn-metadaport for HM healt check packets
Product: Red Hat OpenStack Reporter: Fernando Royo <froyo>
Component: python-ovn-octavia-providerAssignee: Fernando Royo <froyo>
Status: CLOSED ERRATA QA Contact: Omer Schwartz <oschwart>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: gthiemon, jamsmith, mburns, mdemaced, oschwart
Target Milestone: gaKeywords: Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-ovn-octavia-provider-1.0.3-1.20230223161047.82a4691.el9ost Doc Type: Bug Fix
Doc Text:
Before this update, instances lost communication with the ovn-metadata-port because the load balancer health monitor replied to the ARP requests for the OVN metadata agent's IP, causing the request going to the metadata agent to be sent to another MAC address. With this update, the ovn-controller conducts back-end checks by using a dedicated port instead of the ovn-metadata-port. When establishing a health monitor for a load balancer pool, ensure that there is an available IP in the VIP load balancer's subnet. This port is distinct for each subnet, and various health monitors in the same subnet can reuse the port. Health monitor checks no longer impact ovn-metadata-port communications for instances.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 01:13:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fernando Royo 2023-02-13 11:39:32 UTC
Description of problem:

As soon a HM is added to the LB pool, the backend servers (members) change in their arp table the mac for the ovn-metadataport pointing to the one used by the HM to do the health checks (ng_global svc_monitor_mac, this is implemented in [1]

Just some additional context about the issue, for every backend IP in the load balancer for which health check is configured, a new row in the Service_Monitor table is created and according to that ovn-controller will periodically sends out the service monitor packets.


[1] https://github.com/ovn-org/ovn/blob/main/northd/ovn-northd.8.xml#L1431



How reproducible:
100%

Steps to Reproduce:
1. Create a LB with some backend members attached to the pool
2. Create a HM for LB pool
3. Try to comunicate from/to the backend members thought the ovn metadata port.
4. Also a new vm will not be able to use the ovn metadata port (e.g. include a bash script as ini

Actual results:
Backend servers loose the communication over the ovn-metadataport.

Expected results:
Backend servers don't loose the communication over the ovn-metadataport and HM health checks work.

Comment 4 Omer Schwartz 2023-05-05 09:00:17 UTC
Using the puddle RHOS-17.1-RHEL-9-20230426.n.1, I ran the following commands:
(overcloud) [stack@undercloud-0 ~]$ openstack port list | grep hm
| d7e6c687-6118-4d36-b764-ef8407a61dbb | ovn-lb-hm-576dfdfb-e8ea-4188-9b81-79b96472a3fb               | fa:16:3e:aa:20:71 | ip_address='10.0.64.3', subnet_id='576dfdfb-e8ea-4188-9b81-79b96472a3fb'                            | DOWN   |

We can see that the ovn-lb-hm port exists and uses ip_address='10.0.64.3', which should be the source ip the health monitor uses for each member.


Some details about the LB members:
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer status show lb_ovn
...
"members": [
                            {
                                "id": "7cd7ebe8-f73c-4a2a-a22f-2b44bd4b8c06",
                                "name": "tcp_member1",
                                "operating_status": "ONLINE",
                                "provisioning_status": "ACTIVE",
                                "address": "10.0.64.47",
                                "protocol_port": 8080
                            },
                            {
                                "id": "73e2dd4c-de26-4a87-8b3e-892d0c6f9b09",
                                "name": "tcp_member2",
                                "operating_status": "ONLINE",
                                "provisioning_status": "ACTIVE",
                                "address": "10.0.64.56",
                                "protocol_port": 8080
                            }
                        ]


(overcloud) [stack@undercloud-0 ~]$ ssh tripleo-admin: Permanently added 'controller-0.ctlplane' (ED25519) to the list of known hosts.
Register this system with Red Hat Insights: insights-client --register
Create an account or view all your systems at https://red.ht/insights-dashboard
Last login: Fri May  5 08:32:35 2023 from 192.168.24.1
[tripleo-admin@controller-0 ~]$ sudo bash
[root@controller-0 tripleo-admin]# podman exec -it -uroot ovn_controller ovn-sbctl list Service_Monitor
_uuid               : edfacd21-5a89-40ab-ab01-ac8adb0fc39a
external_ids        : {}
ip                  : "10.0.64.47"
logical_port        : "f44a701b-9376-4a89-b544-57eca790b79c"
options             : {failure_count="3", interval="10", success_count="4", timeout="5"}
port                : 8080
protocol            : tcp
src_ip              : "10.0.64.3"
src_mac             : "16:ed:b6:15:9c:6a"
status              : online

_uuid               : 305c3adf-a42e-4852-844c-8032aca7a8e1
external_ids        : {}
ip                  : "10.0.64.56"
logical_port        : "ef1b7570-4de5-41fc-88b2-6c4f5b033269"
options             : {failure_count="3", interval="10", success_count="4", timeout="5"}
port                : 8080
protocol            : tcp
src_ip              : "10.0.64.3"
src_mac             : "16:ed:b6:15:9c:6a"
status              : online

We can see that both members use the 10.0.64.3 source ip (src_ip), and also that the ip addresses match the ones we got from the loadbalancer status show command.



Communication via metadata-port is also possible with this fix:

(overcloud) [stack@undercloud-0 ~]$ ssh tripleo-admin
Warning: Permanently added 'compute-0.ctlplane' (ED25519) to the list of known hosts.
Register this system with Red Hat Insights: insights-client --register
Create an account or view all your systems at https://red.ht/insights-dashboard
Last login: Fri May  5 08:46:40 2023 from 192.168.24.1
[tripleo-admin@compute-0 ~]$ ip net
ovnmeta-89249e30-ff7c-4748-8279-39c5b8c21a09 (id: 0)
[tripleo-admin@compute-0 ~]$ sudo ip net e ovnmeta-89249e30-ff7c-4748-8279-39c5b8c21a09 ssh cirros.64.47
cirros.64.47's password: 
$ date
Fri May  5 09:52:39 UTC 2023

The ssh connection was executed successfully via the ovn-metadata-port.
The BZ looks good to me and I am moving its status to verified.

Comment 15 errata-xmlrpc 2023-08-16 01:13:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577