720188 – Managed DHCPv6 should not short-circuit SLAAC (causes a loop of device activation attempts/failures)

Bug 720188 - Managed DHCPv6 should not short-circuit SLAAC (causes a loop of device activation attempts/failures)

Summary: Managed DHCPv6 should not short-circuit SLAAC (causes a loop of device activa...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	NetworkManager
Sub Component:
Version:	15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Dan Williams
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-07-10 18:39 UTC by Tore Anderson
Modified:	2012-08-07 16:02 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-08-07 16:02:13 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
NetworkManager debug output (1022.42 KB, text/plain) 2011-07-10 18:39 UTC, Tore Anderson	no flags	Details
ICMPv6 RS/RA and DHCPv6 packet dump (45.62 KB, application/x-pcap) 2011-07-10 18:40 UTC, Tore Anderson	no flags	Details
Don't let managed DHCPv6 preempt SLAAC (803 bytes, patch) 2011-08-10 22:15 UTC, Tore Anderson	no flags	Details \| Diff
Show Obsolete (1) View All

Description Tore Anderson 2011-07-10 18:39:12 UTC

Created attachment 512107 [details]
NetworkManager debug output

Description of problem:

When attempting to connect to my network, which are using SLAAC and stateful DHCPv6 simultaneously for configuration, NetworkManager will activate the device, only to instantly fail it while logging a failure (reason 'ip-config-unavailable'), and start over. Given enough attempts (I've seen 50+ being necessary), it will finally connect and stay connected.



Version-Release number of selected component (if applicable):

NetworkManager-0.8.9997-5.git20110702.fc15.x86_64



How reproducible:

Happens more often than 9 out of 10 times, I'd say.



Steps to Reproduce:
1. Attempt to connect to the network with IPv6 mode automatic.


  
Actual results:

The activation will fail right after the systray applet has reported that the activation was successful, then the activation process wil restart again and again and again until it finally succeeds.



Expected results:

That the network connection would reliably be activated on the first attempt.



Additional info:

I'm attaching a syslog containing debug output from NetworkManager while reproducing the problem (it required 38 attempts to finally connect). I disabled IPv4 to reduce the noise in the logs, but it happens when the IPv4 mode is set to «Automatic», too (the network is dual-stacked with DHCPv4 service). I'm also attaching a tcpdump containing all ICMPv6 RS/RA and DHCPv6 packets seen on the wire during the same time.

I find one log message that is appearing right before the device is failing particularly suspicious, considering that the Valid Lifetime of the RA-provided address is exactly 30 days:

[nm-ip6-manager.c:510] nm
_ip6_device_sync_from_netlink(): (eth0): RA-provided address no longer valid

The router on the network is a ZyXEL P-2812HNU-F3. I also tried connecting Windows 7 to the same network and had no similar problems.

I'll be happy to provide any further information if necessary.

Tore

Comment 1 Tore Anderson 2011-07-10 18:40:04 UTC

Created attachment 512108 [details]
ICMPv6 RS/RA and DHCPv6 packet dump

Comment 2 Tore Anderson 2011-07-23 09:07:24 UTC

FYI, I still get this behaviour with NetworkManager-0.8.9997-6.git20110721.fc15.x86_64.

Tore

Comment 3 Tore Anderson 2011-08-09 20:40:47 UTC

I've been trying to figure out what's going on here, and I see that when the connections fail, NetworkManager is first removing the kernel-configured SLAAC addresses from the device. From the first attempt in the attached syslog:

<debug> [nm-system.c:222] sync_addresses(): (eth0): syncing addresses (family 10)
<debug> [nm-system.c:275] sync_addresses(): (eth0): removing address 2001:840:3033:10:230:1bff:febc:7f23/64'

Later, nm_ip6_device_sync_from_netlink() is called. It has a loop with the comment «Look for any IPv6 addresses the kernel may have set for the device» that walks the list of addresses on the device:

<debug> [nm-ip6-manager.c:417] nm_ip6_device_sync_from_netlink(): (eth0): syncing with netlink (ra_flags 0x80000070) (state/target 'got-address'/'got-address')
<debug> [nm-ip6-manager.c:436] nm_ip6_device_sync_from_netlink(): (eth0): netlink address: fe80::230:1bff:febc:7f23
<debug> [nm-ip6-manager.c:458] nm_ip6_device_sync_from_netlink(): (eth0): addresses synced (state got-address)

Since the SLAAC-assigned address was removed a bit earlier, the loop doesn't run across it, and therefore never sets the «found_other» boolean to TRUE.

However, a bit further down in the function, the «found_other» boolean is checked, and if it isn't set, NM considers it to «have disappeared for some reason», and therefore fails the connection:

/* If for some reason an RA-provided address disappeared, we need
 * to make sure we fail the connection as it's no longer valid.
 */

<debug> [nm-ip6-manager.c:510] nm_ip6_device_sync_from_netlink(): (eth0): RA-provided address no longer valid

The reason why the RA-provided address disappeared was because NM explicitly removed it moments earlier, so it all doesn't make much sense...

Question is, *why* did NM remove the RA-provided address in the first place? I haven't figured that out yet, but I will continue looking the next time I get the time to debug further.

When the connection finally succeeds, the RA-provided address isn't removed by NM. I don't know what is different about that activation that allows it to succeed. I suspect some kind of a race condition, though.

Tore

Comment 4 Tore Anderson 2011-08-10 22:11:43 UTC

I just posted a patch to the networkmanager mailing list, with the following description (will also attach the patch here for reference):

NetworkManager currently operates on the assumption that Managed
(Stateful) DHCPv6 preempts SLAAC. This is not the case; Managed DHCPv6
and SLAAC are completely orthogonal. My consumer-grade xDSL CPE (a ZyXEL
P-2812HNU-F3) does both at the same time by default, which is a
necessity to trigger the following bug:

Currently NetworkManager will abandon SLAAC activation if it sees that
Managed DHCPv6 is requested by the RA. As far as I have been able to
understand, this makes NetworkManager overlook the kernel-configured
SLAAC address, which in turn makes sync_addresses() remove it again at
a later stage, as it's being considered as an "unwanted alien" of some
sort.

However, right after the device activation has finished,
nm_ip6_device_sync_from_netlink() is run, which notices that the SLAAC
address has vanished, and figures (incorrectly) that it must have been
because the Valid Lifetime has reached zero and that the kernel has
therefore removed it. In response, nm_ip6_device_sync_from_netlink()
deactivates the entire interface, and the activation process starts over
again. Given enough attempts (more than a dozen most of the time, and
sometimes more than fifty has been necessary) NM will eventually manage
to permanently activate the interface, though I don't know exactly what
conditions are necessary for the activation to be a lasting success.

This patch fixes the problem completely for me, the device is now being
successfully activated on the first attempt every single time. It simply
removes the flawed assumption that Managed DHCPv6 short-circuits SLAAC,
and makes NM complete the SLAAC process regardless of Managed DHCPv6
being requested or not.

Tore

Comment 5 Tore Anderson 2011-08-10 22:15:27 UTC

Created attachment 517703 [details]
Don't let managed DHCPv6 preempt SLAAC

Comment 6 Tore Anderson 2011-08-10 22:43:39 UTC

Comment on attachment 517703 [details]
Don't let managed DHCPv6 preempt SLAAC

The patch turned out to be no good, as it breaks Managed DHCPv6 operation when there's no SLAAC at all.

I'm pretty sure I'm correct about the root cause why SLAAC+DHCPv6 operation is so unreliable, though...

Tore

Comment 7 Fedora End Of Life 2012-08-07 16:02:15 UTC

This message is a notice that Fedora 15 is now at end of life. Fedora
has stopped maintaining and issuing updates for Fedora 15. It is
Fedora's policy to close all bug reports from releases that are no
longer maintained. At this time, all open bugs with a Fedora 'version'
of '15' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we were unable to fix it before Fedora 15 reached end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged to click on
"Clone This Bug" (top right of this page) and open it against that
version of Fedora.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

The process we are following is described here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Note You need to log in before you can comment on or make changes to this bug.