2262654 – [17.1][OVN][DVR][SNAT] 2 time 20 second of ping loss in case of controller come up after crash for snat ports

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2262654 - [17.1][OVN][DVR][SNAT] 2 time 20 second of ping loss in case of controller come up after crash for snat ports

Summary: [17.1][OVN][DVR][SNAT] 2 time 20 second of ping loss in case of controller co...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	x86_64
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Miro Tomaska
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-04 15:31 UTC by Luigi Tamagnone
Modified:	2025-01-10 09:45 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2025-01-10 09:43:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FDP-441	None	None	None	2024-03-01 19:00:44 UTC
Red Hat Issue Tracker	OSP-31357	None	None	None	2025-01-10 09:43:14 UTC
Red Hat Issue Tracker	OSP-33382	None	None	None	2025-01-10 09:45:00 UTC

Description Luigi Tamagnone 2024-02-04 15:31:10 UTC

Description of problem:
When a controller node crash is triggered SNAT IP first moves to a running controller and then moves back from a running controller (not crashed) to a running controller (crashed). We expect a low amount of packets lost in both situations, but seems that when the IP move back there are several packets lost.

Example:
- crash of a controller node at 02/02/2024 10:46:14 CET
 -> from a laptop that can reach SNAP IP assigned to controller node: 
   64 bytes from 10.x.10.198: icmp_seq=2091 ttl=252 time=22.0 ms
     3 packets lost
   64 bytes from 10.x.10.198: icmp_seq=2095 ttl=252 time=27.6 ms
 -> from the instance (No FIP) to external IP: 
   64 bytes from 10.x.11.254: icmp_seq=1245 ttl=63 time=1.09 ms
     3 packets lost
   64 bytes from 10.x.11.254: icmp_seq=1249 ttl=63 time=2.69 ms

So in case of a crash trigger, we can see 3 packets lost, not so bad. 


- controller node return UP 02/02/2024 10:54:56 CET
 -> from a laptop that can reach SNAP IP assigned to controller node: 
   64 bytes from 10.x.10.198: icmp_seq=2602 ttl=252 time=22.1 ms
     19 packets lost
   64 bytes from 10.x.10.198: icmp_seq=2622 ttl=252 time=21.7 ms
   ....
   64 bytes from 10.x.10.198: icmp_seq=2647 ttl=252 time=21.6 ms
     6 packets lost
   64 bytes from 10.x.10.198: icmp_seq=2655 ttl=252 time=22.8 ms
 -> from the instance (No FIP) to external IP: 
   64 bytes from 10.x.11.254: icmp_seq=1755 ttl=63 time=1.14 ms
     20 packets lost
   64 bytes from 10.x.11.254: icmp_seq=1776 ttl=63 time=4.26 ms

When the node comes back we can see more than 20 packets lost and in case of SNAT IP seems happened two times

Version-Release number of selected component (if applicable):
Red Hat Openstack 17.1 (RHOSP17.1)


Steps to Reproduce:
1. trigger controller crash with `echo c > /proc/sysrq-trigger`
2. start pinging the VM an external IP or from host external to RHOSP the SNAT IP
3. When the controller nodes to come up we can see several ping lost in specific interval.

Actual results:
we can see ping lost for some seconds.

Expected results:
1 to 3 ping lost.

Additional info:

Comment 9 Ihar Hrachyshka 2024-02-26 21:50:42 UTC

This may be related to the scenario described at https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052688.html and there's an attempt at a fix here: https://patchwork.ozlabs.org/project/ovn/patch/ZYHI8eQadEUVKHog@SIT-SDELAP1003.int.lidl.net/

Note You need to log in before you can comment on or make changes to this bug.