Bug 2119194

Summary: Cannot ping new instance floating ip until router is pinged
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jason Paroly <jparoly>
Component: openvswitch2.17Assignee: Eelco Chaudron <echaudro>
Status: CLOSED NEXTRELEASE QA Contact: ovs-qe
Severity: unspecified Docs Contact:
Priority: medium    
Version: FDP 22.FCC: amusil, astupnik, chrisw, ctrautma, echaudro, fleitner, froyo, hewang, hjensas, jhsiao, jiji, jlibosva, jparker, mkrcmari, mmichels, ralongi, ralonsoh, scohen, skaplons, tredaelli, ykarel
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch2.17-2.17.0-60.el9fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-09 07:53:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2123168    

Description Jason Paroly 2022-08-17 20:45:53 UTC
Description of problem:
Cannot ping floating ip until the router is pinged

Version-Release number of selected component (if applicable):


How reproducible:
every time

Steps to Reproduce:
1. if present, Delete OC instance/node + floating ip + router
openstack server delete small_test01
openstack router remove subnet r1 $(openstack router show r1 -f json -c interfaces_info | jq -r .interfaces_info[0].subnet_id)
openstack router delete r1
2. create new node
can use nova create, will add link to script used as there are several commands in creating the instance (ping test occurs in the script as well, which is where the script/test is failing)
3. try to ping floating ip

Actual results:
cannot ping floating ip


Expected results:
can ping floating ip

Additional info:
if ping router first, then pinging floating ip works

Comment 6 Slawek Kaplonski 2022-08-25 10:26:49 UTC
I investigated that issue with Rodolfo today. We created new instance and router, according to the description of that BZ and we reproduced the issue.
FIP was centralized. Router's gateway was on compute-3 and vm was on compute-0 in our case. When we were pinging FIP, ICMP requests were coming properly to the VM and it was replying. The problem is that ICMP reply was gone somewhere in the br-int and was never going to the compute-3 back.
Rodolfo is investigating OF rules on compute-0 now but it seems like some OVN issue, not directly Neutron.

Comment 7 Slawek Kaplonski 2022-08-25 11:22:03 UTC
After some more investigation we pinged router's external port from undercloud and then, as is described in the bug description, ping to the FIP started working fine. We compared OF rules on the compute-0 when it wasn't working and when it was working fine. The only difference was 2 additional OF rules when it was working fine:

cookie=0xc1ee105f, duration=95.663s, table=66, n_packets=93, n_bytes=9114, idle_age=0, priority=100,reg0=0xa0000fe,reg15=0x2,metadata=0x3 actions=mod_dl_dst:f2:ec:a5:6f:4e:6c,load:0x1->NXM_NX_REG10[6]
cookie=0xc1ee105f, duration=95.663s, table=67, n_packets=0, n_bytes=0, idle_age=95, priority=100,arp,reg0=0xa0000fe,reg14=0x2,metadata=0x3,dl_src=f2:ec:a5:6f:4e:6c actions=load:0x1->NXM_NX_REG10[6]

and it seems that ICMP reply was hitting that rule from table 66.
I don't know what exactly is MAC f2:ec:a5:6f:4e:6c - it's for sure nothing related to Neutron directly.

I'm moving this BZ to the OVN for now for investigation there as it seems like some OVN issue for me.

Comment 8 Slawek Kaplonski 2022-08-25 11:23:22 UTC
One more thing. OVN version which we used is:

[root@controller-0 /]# rpm -qa | grep ovn
ovn22.03-22.03.0-69.el9fdp.x86_64
rhosp-ovn-22.03-5.el9ost.noarch
ovn22.03-central-22.03.0-69.el9fdp.x86_64
rhosp-ovn-central-22.03-5.el9ost.noarch
[root@controller-0 /]# 
exit

And OVS version:
[root@controller-0 heat-admin]# rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-31.el9fdp.noarch
openvswitch2.17-2.17.0-32.1.el9fdp.x86_64
openstack-network-scripts-openvswitch2.17-10.11.1-3.el9ost.x86_64
rhosp-network-scripts-openvswitch-2.17-5.el9ost.noarch
rhosp-openvswitch-2.17-5.el9ost.noarch

Comment 9 Mark Michelson 2022-08-31 13:54:52 UTC
Hi, I'm doing some triage of this issue for the OVN team. I have some questions about the nature of the network setup here. First is the ping going from one VM to another on the overlay, or is the ping originating externally and coming into the network via a gateway router? If it's two VMs pinging each other, are they both attached to the same logical router?

Finally, in order to properly reproduce/fix this issue, we will need the northbound database from the cluster where you see the failure occur.

Comment 10 Jason Paroly 2022-08-31 18:40:25 UTC
@mmichels The ping is going from one VM to another.  I am assuming they are both attached to the same logical router.  I am not sure where to get the database.  I am hoping @hjensas will be able to answer these questions definitively. Thank you!

Comment 11 Harald Jensås 2022-09-01 07:08:33 UTC
(In reply to Mark Michelson from comment #9)
> Hi, I'm doing some triage of this issue for the OVN team. I have some
> questions about the nature of the network setup here. First is the ping
> going from one VM to another on the overlay, or is the ping originating
> externally and coming into the network via a gateway router? If it's two VMs
> pinging each other, are they both attached to the same logical router?
> 

The ping is not between OpenStack instances.
The ping is from an external source, the source is L2 connected to the provider network where floating-ip is allocated.

> Finally, in order to properly reproduce/fix this issue, we will need the
> northbound database from the cluster where you see the failure occur.

There is a running reproducer if you would like to troubleshoot this on a live system, see comment: 4 for details.

I reproduced the issue again, and used the script soution article[1] from to get the OVN db content.

In my case the router gateway was on compute-2, and the instance was running on compute-4.

I captured the OVN db content both prior to pinging the router external gateway address, and after pinging the router external gateway address on both nodes (compute-2 and compute-4).

I will upload the file to the BZ.

ovn-db-content-RHBZ2119194
├── compute-2
│   ├── compute-2-post-pinging-router-external-gateway-ovn-db-content.txt
│   └── compute-2-pre-pinging-router-external-gateway-ovn-db-content.txt
└── compute-4
    ├── compute-4-post-pinging-router-external-gateway-ovn-db-content.txt
    └── compute-4-pre-pinging-router-external-gateway-ovn-db-content.txt


[1] https://access.redhat.com/solutions/3776401

Comment 30 Ales Musil 2022-10-13 12:51:35 UTC
After some back and forth we decided to move it to OvS team as the problem is beyond my scope of expertise. 
I'll keep the priority as medium because there is a workaround.

Comment 36 Eelco Chaudron 2022-12-07 16:22:14 UTC
Well, I finally got it... The problem is related to all packets that need slow patch actions, and need to egress an IPv6 tunnel.

A patch was sent upstream including the reproducer:

https://patchwork.ozlabs.org/project/openvswitch/list/?series=331619

Comment 37 Eelco Chaudron 2023-01-09 07:53:57 UTC
The fix was accepted upstream and backported upstream all the way down to OVS2.13. We will pick this up automatically on the next FDP release.

Will close the BZ for now.