Bug 2154930 - [ovn][bgp] VM takes several minutes to obtain its IPv6 address (slaac - provider network)
Summary: [ovn][bgp] VM takes several minutes to obtain its IPv6 address (slaac - provi...
Keywords:
Status: MODIFIED
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn22.06
Version: RHEL 9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: lorenzo bianconi
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-19 17:00 UTC by Eduardo Olivares
Modified: 2023-07-13 07:35 UTC (History)
10 users (show)

Fixed In Version: ovn22.06-22.06.0-118.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2588 0 None None None 2023-01-02 15:01:19 UTC
Red Hat Issue Tracker OSP-20951 0 None None None 2022-12-19 17:13:24 UTC

Description Eduardo Olivares 2022-12-19 17:00:39 UTC
Description of problem:
A vlan provider network is created with an IPv4 subnet and an IPv6 subnet (with slaac).
A router is created and connected to that external network.
A port is created on that network.
A VM instance is created using that port (both RHEL9.0 and centos-stream-9 have been used).

We can access the instance using the following command from the compute node where the instance is spawned (root password needs to be configured on the image):
podman exec -ituroot nova_virtqemud virsh console <instance>

The instance obtains its IPv4 address during boot. However, its eth0 doesn't have an IPv6 global address (it has IPv4 and IPv6 link-local addresses).

The following flow exists in the compute since the moment the instance was created:
[root@cmp-2-0 ~]# ovs-ofctl dump-flows br-int |  grep "icmp_type=133.*20.01.0d.b9"
 cookie=0xfd928287, duration=2858.181s, table=16, n_packets=1, n_bytes=62, idle_age=2301, priority=50,icmp6,reg14=0x1,metadata=0x3,ipv6_dst=ff02::2,nw_ttl=255,icmp_type=133,icmp_code=0 actions=controller(userdata=00.00.00.08.00.00.00.00.00.01.de.10.00.00.00.65.86.00.00.00.ff.00.ff.ff.00.00.00.00.00.00.00.00.01.01.fa.16.3e.80.75.0b.05.01.00.00.00.00.05.dc.03.04.40.c0.ff.ff.ff.ff.ff.ff.ff.ff.00.00.00.00.20.01.0d.b9.00.00.00.00.00.00.00.00.00.00.00.00,pause),resubmit(,17)  



The previous flow had n_packets=0 for several minutes and the icmpv6 packets captured on the VM showed that it sends periodically RS messages that are not replied (these packets were captured after the VM was booted and it got its IPv4 address via DHCP):
11:03:50.955536 fa:16:3e:87:56:3f > 33:33:00:00:00:02, ethertype IPv6 (0x86dd), length 62: (flowlabel 0x320f8, hlim 255, next-header ICMPv6 (58) payload length: 8) fe80::f816:3eff:fe87:563f > ff02::2: [icmp6 sum ok] ICMP6, router solicitation, length 8
11:06:02.445531 fa:16:3e:87:56:3f > 33:33:00:00:00:02, ethertype IPv6 (0x86dd), length 62: (flowlabel 0x320f8, hlim 255, next-header ICMPv6 (58) payload length: 8) fe80::f816:3eff:fe87:563f > ff02::2: [icmp6 sum ok] ICMP6, router solicitation, length 8


Finally, one of the RS messages is replied with an RA message and then the IPv6 global address is added to the VM's eth0:
11:10:24.285042 fa:16:3e:87:56:3f > 33:33:00:00:00:02, ethertype IPv6 (0x86dd), length 62: (flowlabel 0x320f8, hlim 255, next-header ICMPv6 (58) payload length: 8) fe80::f816:3eff:fe87:563f > ff02::2: [icmp6 sum ok] ICMP6, router solicitation, length 8
11:10:24.287477 fa:16:3e:80:75:0b > fa:16:3e:87:56:3f, ethertype IPv6 (0x86dd), length 118: (flowlabel 0x320f8, hlim 255, next-header ICMPv6 (58) payload length: 64) fe80::f816:3eff:fe80:750b > fe80::f816:3eff:fe87:563f: [icmp6 sum ok] ICMP6, router advertisement, length 64
        hop limit 255, Flags [none], pref medium, router lifetime 65535s, reachable time 0ms, retrans timer 0ms
          source link-address option (1), length 8 (1): fa:16:3e:80:75:0b
          mtu option (5), length 8 (1):  1500
          prefix info option (3), length 32 (4): 2001:db9::/64, Flags [onlink, auto], valid time infinity, pref. time infinity



icmpv6 traffic was captured on the compute node where this VM was running and the unreplied RS messages were captured on the tap interface and on the br-vlan.107 interface - finally, the replied RS message and its corresponding RA were only captured on the tap interface.


Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20221130.n.1

How reproducible:
very often it takes more than 2 minutes to obtain the IPv6 prefix - sometimes it takes 6+ minutes

Steps to Reproduce:
run this playbook: https://gitlab.cee.redhat.com/eolivare/bgp-playbooks-templates/-/blob/master/playbooks/frrempest_only_vlan_provider.yml
using the following command: ansible-playbook -i `ir workspace inventory` playbooks/frrempest_only_vlan_provider.yml -vv


Actual results:
the VM receives its IPv6 prefix in a reasonable amount of time


Expected results:
it takes too long for the VM to receive its IPv6 preffix and several RS messages are not replied

Comment 2 Mark Michelson 2023-01-03 20:13:50 UTC
Thanks for the detailed report. You managed to hit several of the points I was going to ask about.

The thing that's most intriguing to me is that the VM is able to get an IPv4 DHCP address on boot. This implies that things are working as usual with OVS/OVN, and that this isn't a case of unusually heavy load or anything similar that might lead to a delayed response from OVN.

I think the trick here is to find out what is happening to the RS packets that the VM sends. Are they being dropped by an earlier table in br-int? Are the packets never reaching br-int in the first place?

The other big question I have here is if this issue is limited only to OVN/OVS 2.13. Does this same issue occur with newer versions?

Comment 3 Dumitru Ceara 2023-01-09 15:43:46 UTC
To add on top of what Mark's comment, could you please also attach the OVN NB/SB DBs and the OVS DB from the node where the VM runs?

Thanks!

Comment 4 lorenzo bianconi 2023-01-12 17:12:02 UTC
I tried to reproduce the issue with ovn upstream and it is working, ovn will immediately reply to a router solicitation with a router advertisement

Comment 5 Eduardo Olivares 2023-01-13 16:45:13 UTC
Apologies for not having answered before.
I have an environment that can be used to reproduce this bug and I'll work on it next Monday. I won't remove my needinfo until then.

We have only reproduced it on:
OSP17.0: ovn22.03-22.03.0-118.el9fdp
OSP17.1: ovn22.06-22.06.0-75.el9fdp

We don't have BGP jobs with earlier OSP releases.


Regarding the priority, IMO it's high because it's something important affecting BGP, but not critical because, after the VM obtains the IPv6 address, everything goes well. So, the only problem is the delay obtaining the IPv6, which is bit (may be 6 min), but it's not huge.

Comment 9 OVN Bot 2023-02-16 05:07:24 UTC
ovn22.12 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170298
ovn22.12 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170299
ovn22.09 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170301
ovn22.09 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170302
ovn22.06 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170303
ovn22.03 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170304
ovn22.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170305
ovn-2021 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170306
ovn-2021 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2170307


Note You need to log in before you can comment on or make changes to this bug.