Description of problem: Cannot ping floating ip until the router is pinged Version-Release number of selected component (if applicable): How reproducible: every time Steps to Reproduce: 1. if present, Delete OC instance/node + floating ip + router openstack server delete small_test01 openstack router remove subnet r1 $(openstack router show r1 -f json -c interfaces_info | jq -r .interfaces_info[0].subnet_id) openstack router delete r1 2. create new node can use nova create, will add link to script used as there are several commands in creating the instance (ping test occurs in the script as well, which is where the script/test is failing) 3. try to ping floating ip Actual results: cannot ping floating ip Expected results: can ping floating ip Additional info: if ping router first, then pinging floating ip works
I investigated that issue with Rodolfo today. We created new instance and router, according to the description of that BZ and we reproduced the issue. FIP was centralized. Router's gateway was on compute-3 and vm was on compute-0 in our case. When we were pinging FIP, ICMP requests were coming properly to the VM and it was replying. The problem is that ICMP reply was gone somewhere in the br-int and was never going to the compute-3 back. Rodolfo is investigating OF rules on compute-0 now but it seems like some OVN issue, not directly Neutron.
After some more investigation we pinged router's external port from undercloud and then, as is described in the bug description, ping to the FIP started working fine. We compared OF rules on the compute-0 when it wasn't working and when it was working fine. The only difference was 2 additional OF rules when it was working fine: cookie=0xc1ee105f, duration=95.663s, table=66, n_packets=93, n_bytes=9114, idle_age=0, priority=100,reg0=0xa0000fe,reg15=0x2,metadata=0x3 actions=mod_dl_dst:f2:ec:a5:6f:4e:6c,load:0x1->NXM_NX_REG10[6] cookie=0xc1ee105f, duration=95.663s, table=67, n_packets=0, n_bytes=0, idle_age=95, priority=100,arp,reg0=0xa0000fe,reg14=0x2,metadata=0x3,dl_src=f2:ec:a5:6f:4e:6c actions=load:0x1->NXM_NX_REG10[6] and it seems that ICMP reply was hitting that rule from table 66. I don't know what exactly is MAC f2:ec:a5:6f:4e:6c - it's for sure nothing related to Neutron directly. I'm moving this BZ to the OVN for now for investigation there as it seems like some OVN issue for me.
One more thing. OVN version which we used is: [root@controller-0 /]# rpm -qa | grep ovn ovn22.03-22.03.0-69.el9fdp.x86_64 rhosp-ovn-22.03-5.el9ost.noarch ovn22.03-central-22.03.0-69.el9fdp.x86_64 rhosp-ovn-central-22.03-5.el9ost.noarch [root@controller-0 /]# exit And OVS version: [root@controller-0 heat-admin]# rpm -qa | grep openvswitch openvswitch-selinux-extra-policy-1.0-31.el9fdp.noarch openvswitch2.17-2.17.0-32.1.el9fdp.x86_64 openstack-network-scripts-openvswitch2.17-10.11.1-3.el9ost.x86_64 rhosp-network-scripts-openvswitch-2.17-5.el9ost.noarch rhosp-openvswitch-2.17-5.el9ost.noarch
Hi, I'm doing some triage of this issue for the OVN team. I have some questions about the nature of the network setup here. First is the ping going from one VM to another on the overlay, or is the ping originating externally and coming into the network via a gateway router? If it's two VMs pinging each other, are they both attached to the same logical router? Finally, in order to properly reproduce/fix this issue, we will need the northbound database from the cluster where you see the failure occur.
@mmichels The ping is going from one VM to another. I am assuming they are both attached to the same logical router. I am not sure where to get the database. I am hoping @hjensas will be able to answer these questions definitively. Thank you!
(In reply to Mark Michelson from comment #9) > Hi, I'm doing some triage of this issue for the OVN team. I have some > questions about the nature of the network setup here. First is the ping > going from one VM to another on the overlay, or is the ping originating > externally and coming into the network via a gateway router? If it's two VMs > pinging each other, are they both attached to the same logical router? > The ping is not between OpenStack instances. The ping is from an external source, the source is L2 connected to the provider network where floating-ip is allocated. > Finally, in order to properly reproduce/fix this issue, we will need the > northbound database from the cluster where you see the failure occur. There is a running reproducer if you would like to troubleshoot this on a live system, see comment: 4 for details. I reproduced the issue again, and used the script soution article[1] from to get the OVN db content. In my case the router gateway was on compute-2, and the instance was running on compute-4. I captured the OVN db content both prior to pinging the router external gateway address, and after pinging the router external gateway address on both nodes (compute-2 and compute-4). I will upload the file to the BZ. ovn-db-content-RHBZ2119194 ├── compute-2 │ ├── compute-2-post-pinging-router-external-gateway-ovn-db-content.txt │ └── compute-2-pre-pinging-router-external-gateway-ovn-db-content.txt └── compute-4 ├── compute-4-post-pinging-router-external-gateway-ovn-db-content.txt └── compute-4-pre-pinging-router-external-gateway-ovn-db-content.txt [1] https://access.redhat.com/solutions/3776401
After some back and forth we decided to move it to OvS team as the problem is beyond my scope of expertise. I'll keep the priority as medium because there is a workaround.
Well, I finally got it... The problem is related to all packets that need slow patch actions, and need to egress an IPv6 tunnel. A patch was sent upstream including the reproducer: https://patchwork.ozlabs.org/project/openvswitch/list/?series=331619
The fix was accepted upstream and backported upstream all the way down to OVS2.13. We will pick this up automatically on the next FDP release. Will close the BZ for now.