Bug 2169349 - [ovn provider] Avoid use of ovn-metadaport for HM healt check packets
Summary: [ovn provider] Avoid use of ovn-metadaport for HM healt check packets
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-ovn-octavia-provider
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 17.1
Assignee: Fernando Royo
QA Contact: Omer Schwartz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-13 11:39 UTC by Fernando Royo
Modified: 2023-08-16 01:14 UTC (History)
5 users (show)

Fixed In Version: python-ovn-octavia-provider-1.0.3-1.20230223161047.82a4691.el9ost
Doc Type: Bug Fix
Doc Text:
Before this update, instances lost communication with the ovn-metadata-port because the load balancer health monitor replied to the ARP requests for the OVN metadata agent's IP, causing the request going to the metadata agent to be sent to another MAC address. With this update, the ovn-controller conducts back-end checks by using a dedicated port instead of the ovn-metadata-port. When establishing a health monitor for a load balancer pool, ensure that there is an available IP in the VIP load balancer's subnet. This port is distinct for each subnet, and various health monitors in the same subnet can reuse the port. Health monitor checks no longer impact ovn-metadata-port communications for instances.
Clone Of:
Environment:
Last Closed: 2023-08-16 01:13:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 873426 0 None MERGED Avoid use of ovn metadata port IP for HM checks 2023-02-23 11:44:53 UTC
Red Hat Issue Tracker OSP-22285 0 None None None 2023-02-13 11:40:58 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:14:11 UTC

Description Fernando Royo 2023-02-13 11:39:32 UTC
Description of problem:

As soon a HM is added to the LB pool, the backend servers (members) change in their arp table the mac for the ovn-metadataport pointing to the one used by the HM to do the health checks (ng_global svc_monitor_mac, this is implemented in [1]

Just some additional context about the issue, for every backend IP in the load balancer for which health check is configured, a new row in the Service_Monitor table is created and according to that ovn-controller will periodically sends out the service monitor packets.


[1] https://github.com/ovn-org/ovn/blob/main/northd/ovn-northd.8.xml#L1431



How reproducible:
100%

Steps to Reproduce:
1. Create a LB with some backend members attached to the pool
2. Create a HM for LB pool
3. Try to comunicate from/to the backend members thought the ovn metadata port.
4. Also a new vm will not be able to use the ovn metadata port (e.g. include a bash script as ini

Actual results:
Backend servers loose the communication over the ovn-metadataport.

Expected results:
Backend servers don't loose the communication over the ovn-metadataport and HM health checks work.

Comment 4 Omer Schwartz 2023-05-05 09:00:17 UTC
Using the puddle RHOS-17.1-RHEL-9-20230426.n.1, I ran the following commands:
(overcloud) [stack@undercloud-0 ~]$ openstack port list | grep hm
| d7e6c687-6118-4d36-b764-ef8407a61dbb | ovn-lb-hm-576dfdfb-e8ea-4188-9b81-79b96472a3fb               | fa:16:3e:aa:20:71 | ip_address='10.0.64.3', subnet_id='576dfdfb-e8ea-4188-9b81-79b96472a3fb'                            | DOWN   |

We can see that the ovn-lb-hm port exists and uses ip_address='10.0.64.3', which should be the source ip the health monitor uses for each member.


Some details about the LB members:
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer status show lb_ovn
...
"members": [
                            {
                                "id": "7cd7ebe8-f73c-4a2a-a22f-2b44bd4b8c06",
                                "name": "tcp_member1",
                                "operating_status": "ONLINE",
                                "provisioning_status": "ACTIVE",
                                "address": "10.0.64.47",
                                "protocol_port": 8080
                            },
                            {
                                "id": "73e2dd4c-de26-4a87-8b3e-892d0c6f9b09",
                                "name": "tcp_member2",
                                "operating_status": "ONLINE",
                                "provisioning_status": "ACTIVE",
                                "address": "10.0.64.56",
                                "protocol_port": 8080
                            }
                        ]


(overcloud) [stack@undercloud-0 ~]$ ssh tripleo-admin: Permanently added 'controller-0.ctlplane' (ED25519) to the list of known hosts.
Register this system with Red Hat Insights: insights-client --register
Create an account or view all your systems at https://red.ht/insights-dashboard
Last login: Fri May  5 08:32:35 2023 from 192.168.24.1
[tripleo-admin@controller-0 ~]$ sudo bash
[root@controller-0 tripleo-admin]# podman exec -it -uroot ovn_controller ovn-sbctl list Service_Monitor
_uuid               : edfacd21-5a89-40ab-ab01-ac8adb0fc39a
external_ids        : {}
ip                  : "10.0.64.47"
logical_port        : "f44a701b-9376-4a89-b544-57eca790b79c"
options             : {failure_count="3", interval="10", success_count="4", timeout="5"}
port                : 8080
protocol            : tcp
src_ip              : "10.0.64.3"
src_mac             : "16:ed:b6:15:9c:6a"
status              : online

_uuid               : 305c3adf-a42e-4852-844c-8032aca7a8e1
external_ids        : {}
ip                  : "10.0.64.56"
logical_port        : "ef1b7570-4de5-41fc-88b2-6c4f5b033269"
options             : {failure_count="3", interval="10", success_count="4", timeout="5"}
port                : 8080
protocol            : tcp
src_ip              : "10.0.64.3"
src_mac             : "16:ed:b6:15:9c:6a"
status              : online

We can see that both members use the 10.0.64.3 source ip (src_ip), and also that the ip addresses match the ones we got from the loadbalancer status show command.



Communication via metadata-port is also possible with this fix:

(overcloud) [stack@undercloud-0 ~]$ ssh tripleo-admin
Warning: Permanently added 'compute-0.ctlplane' (ED25519) to the list of known hosts.
Register this system with Red Hat Insights: insights-client --register
Create an account or view all your systems at https://red.ht/insights-dashboard
Last login: Fri May  5 08:46:40 2023 from 192.168.24.1
[tripleo-admin@compute-0 ~]$ ip net
ovnmeta-89249e30-ff7c-4748-8279-39c5b8c21a09 (id: 0)
[tripleo-admin@compute-0 ~]$ sudo ip net e ovnmeta-89249e30-ff7c-4748-8279-39c5b8c21a09 ssh cirros.64.47
cirros.64.47's password: 
$ date
Fri May  5 09:52:39 UTC 2023

The ssh connection was executed successfully via the ovn-metadata-port.
The BZ looks good to me and I am moving its status to verified.

Comment 15 errata-xmlrpc 2023-08-16 01:13:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.