2119194 – Cannot ping new instance floating ip until router is pinged

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2119194 - Cannot ping new instance floating ip until router is pinged

Summary: Cannot ping new instance floating ip until router is pinged

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	openvswitch2.17
Sub Component:
Version:	FDP 22.F
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Eelco Chaudron
QA Contact:	ovs-qe
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2123168
TreeView+	depends on / blocked

Reported:	2022-08-17 20:45 UTC by Jason Paroly
Modified:	2023-01-31 20:47 UTC (History)
CC List:	21 users (show)
Fixed In Version:	openvswitch2.17-2.17.0-60.el9fdp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-09 07:53:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-2248	0	None	None	None	2022-08-25 11:26:50 UTC
Red Hat Issue Tracker	OSP-18247	0	None	None	None	2022-08-17 20:47:00 UTC

Description Jason Paroly 2022-08-17 20:45:53 UTC

Description of problem:
Cannot ping floating ip until the router is pinged

Version-Release number of selected component (if applicable):


How reproducible:
every time

Steps to Reproduce:
1. if present, Delete OC instance/node + floating ip + router
openstack server delete small_test01
openstack router remove subnet r1 $(openstack router show r1 -f json -c interfaces_info | jq -r .interfaces_info[0].subnet_id)
openstack router delete r1
2. create new node
can use nova create, will add link to script used as there are several commands in creating the instance (ping test occurs in the script as well, which is where the script/test is failing)
3. try to ping floating ip

Actual results:
cannot ping floating ip


Expected results:
can ping floating ip

Additional info:
if ping router first, then pinging floating ip works

Comment 6 Slawek Kaplonski 2022-08-25 10:26:49 UTC

I investigated that issue with Rodolfo today. We created new instance and router, according to the description of that BZ and we reproduced the issue.
FIP was centralized. Router's gateway was on compute-3 and vm was on compute-0 in our case. When we were pinging FIP, ICMP requests were coming properly to the VM and it was replying. The problem is that ICMP reply was gone somewhere in the br-int and was never going to the compute-3 back.
Rodolfo is investigating OF rules on compute-0 now but it seems like some OVN issue, not directly Neutron.

Comment 7 Slawek Kaplonski 2022-08-25 11:22:03 UTC

After some more investigation we pinged router's external port from undercloud and then, as is described in the bug description, ping to the FIP started working fine. We compared OF rules on the compute-0 when it wasn't working and when it was working fine. The only difference was 2 additional OF rules when it was working fine:

cookie=0xc1ee105f, duration=95.663s, table=66, n_packets=93, n_bytes=9114, idle_age=0, priority=100,reg0=0xa0000fe,reg15=0x2,metadata=0x3 actions=mod_dl_dst:f2:ec:a5:6f:4e:6c,load:0x1->NXM_NX_REG10[6]
cookie=0xc1ee105f, duration=95.663s, table=67, n_packets=0, n_bytes=0, idle_age=95, priority=100,arp,reg0=0xa0000fe,reg14=0x2,metadata=0x3,dl_src=f2:ec:a5:6f:4e:6c actions=load:0x1->NXM_NX_REG10[6]

and it seems that ICMP reply was hitting that rule from table 66.
I don't know what exactly is MAC f2:ec:a5:6f:4e:6c - it's for sure nothing related to Neutron directly.

I'm moving this BZ to the OVN for now for investigation there as it seems like some OVN issue for me.

Comment 8 Slawek Kaplonski 2022-08-25 11:23:22 UTC

One more thing. OVN version which we used is:

[root@controller-0 /]# rpm -qa | grep ovn
ovn22.03-22.03.0-69.el9fdp.x86_64
rhosp-ovn-22.03-5.el9ost.noarch
ovn22.03-central-22.03.0-69.el9fdp.x86_64
rhosp-ovn-central-22.03-5.el9ost.noarch
[root@controller-0 /]# 
exit

And OVS version:
[root@controller-0 heat-admin]# rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-31.el9fdp.noarch
openvswitch2.17-2.17.0-32.1.el9fdp.x86_64
openstack-network-scripts-openvswitch2.17-10.11.1-3.el9ost.x86_64
rhosp-network-scripts-openvswitch-2.17-5.el9ost.noarch
rhosp-openvswitch-2.17-5.el9ost.noarch

Comment 9 Mark Michelson 2022-08-31 13:54:52 UTC

Hi, I'm doing some triage of this issue for the OVN team. I have some questions about the nature of the network setup here. First is the ping going from one VM to another on the overlay, or is the ping originating externally and coming into the network via a gateway router? If it's two VMs pinging each other, are they both attached to the same logical router?

Finally, in order to properly reproduce/fix this issue, we will need the northbound database from the cluster where you see the failure occur.

Comment 10 Jason Paroly 2022-08-31 18:40:25 UTC

@mmichels The ping is going from one VM to another.  I am assuming they are both attached to the same logical router.  I am not sure where to get the database.  I am hoping @hjensas will be able to answer these questions definitively. Thank you!

Comment 11 Harald Jensås 2022-09-01 07:08:33 UTC

(In reply to Mark Michelson from comment #9)
> Hi, I'm doing some triage of this issue for the OVN team. I have some
> questions about the nature of the network setup here. First is the ping
> going from one VM to another on the overlay, or is the ping originating
> externally and coming into the network via a gateway router? If it's two VMs
> pinging each other, are they both attached to the same logical router?
> 

The ping is not between OpenStack instances.
The ping is from an external source, the source is L2 connected to the provider network where floating-ip is allocated.

> Finally, in order to properly reproduce/fix this issue, we will need the
> northbound database from the cluster where you see the failure occur.

There is a running reproducer if you would like to troubleshoot this on a live system, see comment: 4 for details.

I reproduced the issue again, and used the script soution article[1] from to get the OVN db content.

In my case the router gateway was on compute-2, and the instance was running on compute-4.

I captured the OVN db content both prior to pinging the router external gateway address, and after pinging the router external gateway address on both nodes (compute-2 and compute-4).

I will upload the file to the BZ.

ovn-db-content-RHBZ2119194
├── compute-2
│   ├── compute-2-post-pinging-router-external-gateway-ovn-db-content.txt
│   └── compute-2-pre-pinging-router-external-gateway-ovn-db-content.txt
└── compute-4
    ├── compute-4-post-pinging-router-external-gateway-ovn-db-content.txt
    └── compute-4-pre-pinging-router-external-gateway-ovn-db-content.txt


[1] https://access.redhat.com/solutions/3776401

Comment 30 Ales Musil 2022-10-13 12:51:35 UTC

After some back and forth we decided to move it to OvS team as the problem is beyond my scope of expertise. 
I'll keep the priority as medium because there is a workaround.

Comment 36 Eelco Chaudron 2022-12-07 16:22:14 UTC

Well, I finally got it... The problem is related to all packets that need slow patch actions, and need to egress an IPv6 tunnel.

A patch was sent upstream including the reproducer:

https://patchwork.ozlabs.org/project/openvswitch/list/?series=331619

Comment 37 Eelco Chaudron 2023-01-09 07:53:57 UTC

The fix was accepted upstream and backported upstream all the way down to OVS2.13. We will pick this up automatically on the next FDP release.

Will close the BZ for now.

Note You need to log in before you can comment on or make changes to this bug.

amusil
astupnik
chrisw
ctrautma
echaudro
fleitner
froyo
hewang
hjensas
jhsiao
jiji
jlibosva
jparker
mkrcmari
mmichels
ralongi
ralonsoh
scohen
skaplons
tredaelli
ykarel