Bug 1760179
Summary: | IPv6 address never assigned, possibly "linklocal6: waiting for link-local addresses failed due to timeout" | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ian Wienand <iwienand> |
Component: | NetworkManager | Assignee: | Lubomir Rintel <lkundrak> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 29 | CC: | bgalvani, dcbw, fgiudici, gnome-sig, john.j5live, lkundrak, mclasen, rhughes, rstrode, sandmann, tdecacqu |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-10-11 05:36:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Ian Wienand
2019-10-10 04:47:50 UTC
Created attachment 1624174 [details]
Boot where IPv6 is not configured
Created attachment 1624175 [details]
Boot where IPv6 is configured correctly
Created attachment 1624176 [details]
Boot where IPv6 is not configured (timestamps removed for side-by-side)
Created attachment 1624177 [details]
Boot where IPv6 is configured correctly (timestamps removed for side-by-side)
Created attachment 1624178 [details]
Boot where IPv6 is not configured, then a restart of the NetworkManager service
Created attachment 1624179 [details]
Increased link-local timeout
From the first log file (bad.txt): <debug> [1570668746.9669] platform: signal: address 6 added: fe80::f816:3eff:fe32:1068/64 lft forever pref forever lifetime 1-0[4294967295,4294967295] dev 2 flags permanent,noprefixroute,tentative src kernel ... <debug> [1570668765.3705] platform: signal: address 6 changed: fe80::f816:3eff:fe32:1068/64 lft forever pref forever lifetime 20-0[4294967295,4294967295] dev 2 flags permanent,noprefixroute src kernel So, yes, the problem is that the link-local address remains tentative for too long due to duplicate address detection done by kernel (~18 seconds while the timeout in NM is 15). It should usually take 1 or 2 seconds. Do you have special sysctl configuration for IPv6? What is the output of: sysctl -a --pattern 'net.ipv6.conf.(all|ens3)' ? > I have not exactly determined why the interface starts DOWN sometimes and UP other times? Perhaps something to do with ipv6 autoconfiguration and DAD requests from the kernel? Note I have no idea what's happening in the cloud provider that responds to these requests -- perhaps sometimes we get luckly and get a faster response but other times not? I don't think this is the case; kernel sends a neighbor discovery for the tentative address and if there is no response within 1 second it promotes the address to non-tentative. If somebody else is using the same address the address should get the dadfailed flag, which doesn't happen according to logs. > But this seems to match the emperical behaviour that a "service NetworkManager restart" consistently brings up the interface -- by this time the LL address is populated and NM continues to configure the interface. The timeout being too short means that on the initial try NM gives up, and leaves the interface unconfigured. > We run exactly the same VM images in quite a few different clouds and we only seem to hit this on a small minority of them. But I think that leads further credence to the idea that "in the wild" 15 seconds isn't enough time to wait for the LL address to be configured. Yes, that is strange, especially because after restarting NM the address becomes non-tentative much faster. But I think the interval only depends on kernel, not on external factors. (In reply to Beniamino Galvani from comment #7) >It should usually take 1 or 2 seconds. Now I know what question to ask :) I can certainly talk to the cloud provider about why the response might be taking so long. There might be logs on their end that will help. > Do you have special sysctl configuration for IPv6? No, nothing in particular > What is the output of: > > sysctl -a --pattern 'net.ipv6.conf.(all|ens3)' # sysctl -a --pattern 'net.ipv6.conf.(all|ens3)' net.ipv6.conf.all.accept_dad = 0 net.ipv6.conf.all.accept_ra = 1 net.ipv6.conf.all.accept_ra_defrtr = 1 net.ipv6.conf.all.accept_ra_from_local = 0 net.ipv6.conf.all.accept_ra_min_hop_limit = 1 net.ipv6.conf.all.accept_ra_mtu = 1 net.ipv6.conf.all.accept_ra_pinfo = 1 net.ipv6.conf.all.accept_ra_rt_info_max_plen = 0 net.ipv6.conf.all.accept_ra_rt_info_min_plen = 0 net.ipv6.conf.all.accept_ra_rtr_pref = 1 net.ipv6.conf.all.accept_redirects = 1 net.ipv6.conf.all.accept_source_route = 0 net.ipv6.conf.all.addr_gen_mode = 0 net.ipv6.conf.all.autoconf = 1 net.ipv6.conf.all.dad_transmits = 1 net.ipv6.conf.all.disable_ipv6 = 0 net.ipv6.conf.all.disable_policy = 0 net.ipv6.conf.all.drop_unicast_in_l2_multicast = 0 net.ipv6.conf.all.drop_unsolicited_na = 0 net.ipv6.conf.all.enhanced_dad = 1 net.ipv6.conf.all.force_mld_version = 0 net.ipv6.conf.all.force_tllao = 0 net.ipv6.conf.all.forwarding = 0 net.ipv6.conf.all.hop_limit = 64 net.ipv6.conf.all.ignore_routes_with_linkdown = 0 net.ipv6.conf.all.keep_addr_on_down = 0 net.ipv6.conf.all.max_addresses = 16 net.ipv6.conf.all.max_desync_factor = 600 net.ipv6.conf.all.mc_forwarding = 0 net.ipv6.conf.all.mldv1_unsolicited_report_interval = 10000 net.ipv6.conf.all.mldv2_unsolicited_report_interval = 1000 net.ipv6.conf.all.mtu = 1280 net.ipv6.conf.all.ndisc_notify = 0 net.ipv6.conf.all.ndisc_tclass = 0 net.ipv6.conf.all.optimistic_dad = 0 net.ipv6.conf.all.proxy_ndp = 0 net.ipv6.conf.all.regen_max_retry = 3 net.ipv6.conf.all.router_probe_interval = 60 net.ipv6.conf.all.router_solicitation_delay = 1 net.ipv6.conf.all.router_solicitation_interval = 4 net.ipv6.conf.all.router_solicitation_max_interval = 3600 net.ipv6.conf.all.router_solicitations = -1 net.ipv6.conf.all.seg6_enabled = 0 net.ipv6.conf.all.seg6_require_hmac = 0 net.ipv6.conf.all.suppress_frag_ndisc = 1 net.ipv6.conf.all.temp_prefered_lft = 86400 net.ipv6.conf.all.temp_valid_lft = 604800 net.ipv6.conf.all.use_oif_addrs_only = 0 net.ipv6.conf.all.use_optimistic = 0 net.ipv6.conf.all.use_tempaddr = 0 net.ipv6.conf.ens3.accept_dad = 1 net.ipv6.conf.ens3.accept_ra = 1 net.ipv6.conf.ens3.accept_ra_defrtr = 0 net.ipv6.conf.ens3.accept_ra_from_local = 0 net.ipv6.conf.ens3.accept_ra_min_hop_limit = 1 net.ipv6.conf.ens3.accept_ra_mtu = 1 net.ipv6.conf.ens3.accept_ra_pinfo = 0 net.ipv6.conf.ens3.accept_ra_rt_info_max_plen = 0 net.ipv6.conf.ens3.accept_ra_rt_info_min_plen = 0 net.ipv6.conf.ens3.accept_ra_rtr_pref = 0 net.ipv6.conf.ens3.accept_redirects = 1 net.ipv6.conf.ens3.accept_source_route = 0 net.ipv6.conf.ens3.addr_gen_mode = 1 net.ipv6.conf.ens3.autoconf = 1 net.ipv6.conf.ens3.dad_transmits = 1 net.ipv6.conf.ens3.disable_ipv6 = 0 net.ipv6.conf.ens3.disable_policy = 0 net.ipv6.conf.ens3.drop_unicast_in_l2_multicast = 0 net.ipv6.conf.ens3.drop_unsolicited_na = 0 net.ipv6.conf.ens3.enhanced_dad = 1 net.ipv6.conf.ens3.force_mld_version = 0 net.ipv6.conf.ens3.force_tllao = 0 net.ipv6.conf.ens3.forwarding = 0 net.ipv6.conf.ens3.hop_limit = 64 net.ipv6.conf.ens3.ignore_routes_with_linkdown = 0 net.ipv6.conf.ens3.keep_addr_on_down = 0 net.ipv6.conf.ens3.max_addresses = 16 net.ipv6.conf.ens3.max_desync_factor = 600 net.ipv6.conf.ens3.mc_forwarding = 0 net.ipv6.conf.ens3.mldv1_unsolicited_report_interval = 10000 net.ipv6.conf.ens3.mldv2_unsolicited_report_interval = 1000 net.ipv6.conf.ens3.mtu = 1450 net.ipv6.conf.ens3.ndisc_notify = 0 net.ipv6.conf.ens3.ndisc_tclass = 0 net.ipv6.conf.ens3.optimistic_dad = 0 net.ipv6.conf.ens3.proxy_ndp = 0 net.ipv6.conf.ens3.regen_max_retry = 3 net.ipv6.conf.ens3.router_probe_interval = 60 net.ipv6.conf.ens3.router_solicitation_delay = 30 net.ipv6.conf.ens3.router_solicitation_interval = 4 net.ipv6.conf.ens3.router_solicitation_max_interval = 3600 net.ipv6.conf.ens3.router_solicitations = -1 net.ipv6.conf.ens3.seg6_enabled = 0 net.ipv6.conf.ens3.seg6_require_hmac = 0 net.ipv6.conf.ens3.suppress_frag_ndisc = 1 net.ipv6.conf.ens3.temp_prefered_lft = 86400 net.ipv6.conf.ens3.temp_valid_lft = 604800 net.ipv6.conf.ens3.use_oif_addrs_only = 0 net.ipv6.conf.ens3.use_optimistic = 0 net.ipv6.conf.ens3.use_tempaddr = 0 (In reply to Beniamino Galvani from comment #8) > > I have not exactly determined why the interface starts DOWN sometimes and UP other times? Perhaps something to do with ipv6 autoconfiguration and DAD requests from the kernel? Note I have no idea what's happening in the cloud provider that responds to these requests -- perhaps sometimes we get luckly and get a faster response but other times not? > > I don't think this is the case; kernel sends a neighbor discovery for the > tentative address and if there is no response within 1 second it promotes > the address to non-tentative. If somebody else is using the same address the > address should get the dadfailed flag, which doesn't happen according to > logs. Hrm, it certainly remains tentative for longer than that. The host can actually boot, I can log in via ssh (over ipv4) and quickly do a "ip addr" and will see it change --- -bash-4.2# ssh root.48.15 Last login: Thu Oct 10 04:28:35 2019 from 2001:44b8:3177:ac00:b321:eb44:783:2320 [root@ianw-test-glean-centos ~]# ip addr ... 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000 link/ether fa:16:3e:32:10:68 brd ff:ff:ff:ff:ff:ff inet 192.168.48.15/24 brd 192.168.48.255 scope global dynamic noprefixroute ens3 valid_lft 86395sec preferred_lft 86395sec inet6 fe80::f816:3eff:fe32:1068/64 scope link tentative noprefixroute valid_lft forever preferred_lft forever --- > Yes, that is strange, especially because after restarting NM the address > becomes non-tentative much faster. But I think the interval only depends on > kernel, not on external factors. I think that when I can restart nm, the address is now permanent, so it hasn't really gone back into the tentative state? ("becomes non-tentative much faster"; i.e. when restarting after login it no longer needs to change state) Upon further debugging, I think that we have caused this by increasing the RA delay [1] The root cause is (again, I think) that our network configuration tool has made the interface UP before network-manager starts [2] (it is doing this in an attempt to probe which interfaces seem active and thus should be configured). This means in some cases the interface can have accepted an RA and have an ipv6 address assigned; after that nm will refuse to further configure the interface. We had increased the RA delay to try and work around this before we understood this behaviour (very similar to what's described in [3]) So I think this "bug" is a red herring related to all that. It still might be better if this was configurable, but I don't think it's worth effort. Hopefully this can be some breadcrumbs if anyone else experiences something similar. [1] https://opendev.org/openstack/diskimage-builder/src/commit/5b5385cf84a422b0394f6bd95d7700f2f8a9bf86/diskimage_builder/elements/simple-init/post-install.d/80-simple-init#L60 [2] https://review.opendev.org/#/c/688031/ [3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=755202 Ah, right. I didn't notice router_solicitation_delay also delays DAD. To avoid the problem with NM picking up the existing configuration, I think a better solution would be to flush addresses on the interface and bring it down. |