Description of problem: As soon a HM is added to the LB pool, the backend servers (members) change in their arp table the mac for the ovn-metadataport pointing to the one used by the HM to do the health checks (ng_global svc_monitor_mac, this is implemented in [1] Just some additional context about the issue, for every backend IP in the load balancer for which health check is configured, a new row in the Service_Monitor table is created and according to that ovn-controller will periodically sends out the service monitor packets. [1] https://github.com/ovn-org/ovn/blob/main/northd/ovn-northd.8.xml#L1431 How reproducible: 100% Steps to Reproduce: 1. Create a LB with some backend members attached to the pool 2. Create a HM for LB pool 3. Try to comunicate from/to the backend members thought the ovn metadata port. 4. Also a new vm will not be able to use the ovn metadata port (e.g. include a bash script as ini Actual results: Backend servers loose the communication over the ovn-metadataport. Expected results: Backend servers don't loose the communication over the ovn-metadataport and HM health checks work.
Using the puddle RHOS-17.1-RHEL-9-20230426.n.1, I ran the following commands: (overcloud) [stack@undercloud-0 ~]$ openstack port list | grep hm | d7e6c687-6118-4d36-b764-ef8407a61dbb | ovn-lb-hm-576dfdfb-e8ea-4188-9b81-79b96472a3fb | fa:16:3e:aa:20:71 | ip_address='10.0.64.3', subnet_id='576dfdfb-e8ea-4188-9b81-79b96472a3fb' | DOWN | We can see that the ovn-lb-hm port exists and uses ip_address='10.0.64.3', which should be the source ip the health monitor uses for each member. Some details about the LB members: (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer status show lb_ovn ... "members": [ { "id": "7cd7ebe8-f73c-4a2a-a22f-2b44bd4b8c06", "name": "tcp_member1", "operating_status": "ONLINE", "provisioning_status": "ACTIVE", "address": "10.0.64.47", "protocol_port": 8080 }, { "id": "73e2dd4c-de26-4a87-8b3e-892d0c6f9b09", "name": "tcp_member2", "operating_status": "ONLINE", "provisioning_status": "ACTIVE", "address": "10.0.64.56", "protocol_port": 8080 } ] (overcloud) [stack@undercloud-0 ~]$ ssh tripleo-admin: Permanently added 'controller-0.ctlplane' (ED25519) to the list of known hosts. Register this system with Red Hat Insights: insights-client --register Create an account or view all your systems at https://red.ht/insights-dashboard Last login: Fri May 5 08:32:35 2023 from 192.168.24.1 [tripleo-admin@controller-0 ~]$ sudo bash [root@controller-0 tripleo-admin]# podman exec -it -uroot ovn_controller ovn-sbctl list Service_Monitor _uuid : edfacd21-5a89-40ab-ab01-ac8adb0fc39a external_ids : {} ip : "10.0.64.47" logical_port : "f44a701b-9376-4a89-b544-57eca790b79c" options : {failure_count="3", interval="10", success_count="4", timeout="5"} port : 8080 protocol : tcp src_ip : "10.0.64.3" src_mac : "16:ed:b6:15:9c:6a" status : online _uuid : 305c3adf-a42e-4852-844c-8032aca7a8e1 external_ids : {} ip : "10.0.64.56" logical_port : "ef1b7570-4de5-41fc-88b2-6c4f5b033269" options : {failure_count="3", interval="10", success_count="4", timeout="5"} port : 8080 protocol : tcp src_ip : "10.0.64.3" src_mac : "16:ed:b6:15:9c:6a" status : online We can see that both members use the 10.0.64.3 source ip (src_ip), and also that the ip addresses match the ones we got from the loadbalancer status show command. Communication via metadata-port is also possible with this fix: (overcloud) [stack@undercloud-0 ~]$ ssh tripleo-admin Warning: Permanently added 'compute-0.ctlplane' (ED25519) to the list of known hosts. Register this system with Red Hat Insights: insights-client --register Create an account or view all your systems at https://red.ht/insights-dashboard Last login: Fri May 5 08:46:40 2023 from 192.168.24.1 [tripleo-admin@compute-0 ~]$ ip net ovnmeta-89249e30-ff7c-4748-8279-39c5b8c21a09 (id: 0) [tripleo-admin@compute-0 ~]$ sudo ip net e ovnmeta-89249e30-ff7c-4748-8279-39c5b8c21a09 ssh cirros.64.47 cirros.64.47's password: $ date Fri May 5 09:52:39 UTC 2023 The ssh connection was executed successfully via the ovn-metadata-port. The BZ looks good to me and I am moving its status to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577