Bug 1207730
Summary: | Continuous IPv6 router solicitation loop | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Petr Sklenar <psklenar> |
Component: | NetworkManager | Assignee: | Dan Williams <dcbw> |
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.1 | CC: | danw, dcbw, idonev, jklimes, lrintel, psklenar, rkhan, thaller, vbenes, vkadlcik |
Target Milestone: | rc | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Router Advertisements without DNS configuration can no longer cause Router Solicitations to be sent in quick succession.
|
Story Points: | --- |
Clone Of: | 1044757 | Environment: | |
Last Closed: | 2015-11-19 11:01:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1044757 | ||
Bug Blocks: |
Description
Petr Sklenar
2015-03-31 15:06:26 UTC
adding keyword regression as it wasn't happing before Could you attach the full /var/log/messages from that time? The partial log doesn't have information on the actual solicitations. Or: journalctl -b -u NetworkManager might give only the NM logs, if you are worried about exposing too much other stuff. Some analysis: 15:28:51 - the start of the solictiation loop 16:28:47 - end of the solictitation loop <<< nm gets restarted a bunch >>> 17:17:17 - eth0 has IPv6 configuration, which is assumed: 2620:52:0:2258:d836:d0ff:fe6e:5033/64 lft 2591939sec pref 604739sec fec0::f101:d836:d0ff:fe6e:5033/64 lft 78194sec pref 6194sec fe80::d836:d0ff:fe6e:5033/64 lft forever pref forever 17:17:17 - solicitation sent 17:17:17 - 1st RA received dhcp-level none gateway fe80::42f2:e9ff:fea0:8d33 pref 2 exp 1801 address 2620:52:0:2258:d836:d0ff:fe6e:5033 exp 3601 route 2620:52:0:2258::/64 via :: pref 0 exp 3601 dns_server fec0::f101:42f2:e9ff:fea0:8d33 exp 7201 17:17:18 - 2nd RA; from a different router (::fe), advertising the same prefix, with a greatly increased lifetime, and no DNS servers given dhcp-level none gateway fe80:52:0:2258::fe pref 2 exp 1801 address 2620:52:0:2258:d836:d0ff:fe6e:5033 exp 2592001 route 2620:52:0:2258::/64 via :: pref 0 exp 2592001 17:20:19 - 3rd RA; from ::ffe, only the address has changed <<< more RAs from ::ffe, but no solicitations are sent >>> 17:47:09 - gateway :8d33 finally times out 18:15:14 - RA received from ::ffe; the DNS servers from :8d33 are going to time out in 123 seconds 18:17:18 - the DNS servers from RA #1 have timed out, and a solicitation is sent 18:17:18 - RA received from :8d33; this includes new DNS servers and a new prefix with a much shorter lifetime than ::ffe, and since it's most recent, this new lifetime is preferred 18:17:18 - RA received from ::ffe, which updates the address lifetime to a huge value again 19:17:19 - DNS servers from :8d33 have timed out again, a solicitation is sent This run looks fine so far. ---------- Conclusions: - clearly there's something wrong with NM's RA option timeout handling here - the network has two IPv6 routers, which is not odd. What *is* odd is that (a) one router never sends announcements on its own (:8d33), and (b) the one that does send periodic announcements (::ffe) advertises the same prefix as :8d33 but doesn't send the DNS option, and it uses a huge address lifetime I'd love to know why IS/IT has this IPv6 router configuration, as it's bound to make things a bit confusing for clients. My guess is that duelling routers and possibly some lifetime=0 RAs triggered a bug in NM's timeout handling. Petr, do you think you could reproduce the situation if I gave you a NetworkManager build with some IPv6 debugging info turned on? (In reply to Dan Williams from comment #15) > Petr, do you think you could reproduce the situation if I gave you a > NetworkManager build with some IPv6 debugging info turned on? I guess yes, what should I setup? More refined theory: 1) Two routers send RAs. 2) One router sends DNS servers or domains, the other does not 3) The first router stops responding to solicitations, but the second continues to respond 4) When the DNS servers/domains reach 1/2 their lifetime, a solicitation is triggered to refresh them 5) The second router responds, but since its RA does not include any DNS servers or domains, the DNS servers/searches aren't refreshed. In any case, check_timestamps() gets called in response to any RA to clean up any stale information. 6) check_timestamps() walks through the DNS servers/domains and since they are still past 1/2 their lifetime (since they weren't refreshed), another solicitation is sent 7) go to step 5 In any case, to prove/disprove this theory, I've done a scratch build with RA debugging turned on: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=8964737 Grab those RPMs, and "rpm -Fvh" them (not Uvh, so you don't install new RPMs that you don't already have installed). Could you run that for a while and if you notice a loop, grab the logs? Thanks! The issue could also happen if the IPv6 router removes the DNS servers/domains from its RA, but keeps responding to solicitations. Fixes posted upstream as dcbw/rh1207730-rdisc-fixes. This is a rouge router: dhcp-level none gateway fe80::42f2:e9ff:fea0:8d33 pref 2 exp 1801 address 2620:52:0:2258:d836:d0ff:fe6e:5033 exp 3601 route 2620:52:0:2258::/64 via :: pref 0 exp 3601 dns_server fec0::f101:42f2:e9ff:fea0:8d33 exp 7201 I am trying to figure out who owns this machine and why he is doing this. The other RA is from the IT network switch. We do not send DNS information. Petr, any chance you've been able to test out the packages above and see if they fix the issue? (In reply to Dan Williams from comment #22) > Petr, any chance you've been able to test out the packages above and see if > they fix the issue? I am sorry that I didnt reply. I am on it right now, I let you know the results in day or two (In reply to Dan Williams from comment #19) > Fixes posted upstream as dcbw/rh1207730-rdisc-fixes. >> rdisc: split fake & linux test code; add testcases could you refactor the test files with the main() function to use nm-test-utils.h and NMTST_DEFINE()? Pushed a fixup for that. > rdisc: fix double-addition of gateways & routes if priority increases this seems right: + if (!new->lifetime) + return FALSE; g_array_insert_val (rdisc->addresses, i, *new); return TRUE; this seems wrong: + if (new->lifetime) + g_array_insert_val (rdisc->routes, CLAMP (insert_idx, 0, G_MAXINT), *new); return TRUE; (at several places) The linux test fails for me: $ sudo NMTST_DEBUG=debug src/rdisc/tests/test-rdisc-linux (src/rdisc/tests/test-rdisc-linux:32513): NetworkManager-WARNING **: <error> [1430239935.402838] [rdisc/nm-lndp-rdisc.c:68] send_rs(): (lo): cannot send router solicitation: -101. And the test-rdisc-fake test takes painfully long. Can we reduce the timeouts there? (In reply to Thomas Haller from comment #24) > (In reply to Dan Williams from comment #19) > > Fixes posted upstream as dcbw/rh1207730-rdisc-fixes. > > >> rdisc: split fake & linux test code; add testcases > > could you refactor the test files with the main() function to use > nm-test-utils.h and NMTST_DEFINE()? > > Pushed a fixup for that. Looks good, thanks. > > rdisc: fix double-addition of gateways & routes if priority increases > > > this seems right: > > + if (!new->lifetime) > + return FALSE; > g_array_insert_val (rdisc->addresses, i, *new); > return TRUE; > > this seems wrong: > > + if (new->lifetime) > + g_array_insert_val (rdisc->routes, CLAMP (insert_idx, 0, > G_MAXINT), *new); > return TRUE; > > (at several places) Yeah, true. That also means that we have to add some code to return TRUE if the lifetime=0 in a few places, which I've done in a fixup. In half of the functions both removal and addition would have ended at the bottom here, so I made it consistent in the fixup so that removal always returns TRUE. > The linux test fails for me: > > $ sudo NMTST_DEBUG=debug src/rdisc/tests/test-rdisc-linux > (src/rdisc/tests/test-rdisc-linux:32513): NetworkManager-WARNING **: <error> > [1430239935.402838] [rdisc/nm-lndp-rdisc.c:68] send_rs(): (lo): cannot send > router solicitation: -101. ENETUNREACH; I think that's because 'lo' doesn't actually have a valid IPv6LL address, so the packets can't go anywhere. I'm not sure there's a lot we can do about that, since 'lo' doesn't get an IPv6LL address ever. Running the test on a real ethernet interface with an IPv6LL address works though. Thoughts? > And the test-rdisc-fake test takes painfully long. Can we reduce the > timeouts there? I've reduced the timeouts on the first three tests, but I don't want to reduce it too much for the solicitation loop test because I want to make sure that it still works correctly even on a more loaded machine. In any case, the tests now take 20 seconds instead of much more. Is that OK? Repushed. The branch looks good to me and all tests pass. Merged upstream to git master and nm-1-0 The following commits from nm-1-0 should be backported to 7.2: 5955f82f01f2da4b0107be6bc4f7c65d26d77b24 rdisc: add missing chain up to parent finalize/dispose 0be367846614de8e9678934e7be7c70871f7c983 rdisc: move most RA processing logic into base class 272943db4ba28a1681282332861d273e4bcc6e13 rdisc: fix leak of DNS domains 39fd8f7683d9bbda09a3b69ef7a3e7927b96b851 rdisc: split fake & linux test code; add testcases 415b7b3e257c5d5526618b0d6a89a6d7fb235f98 rdisc: fix double-addition of gateways & routes if priority increases d96b05bd364a334f20b10f5322aabf01ba28423d rdisc: prevent solicitation loop for expiring DNS information (rh #1207730) (rh #1151665) I am not able to reproduce when using configuration from comment #0, commenting out RDNSS and DNSSL sections, killing radvd daemon and restarting it. After an hour there is still small number of solicitation from NM. No flood seen. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2315.html |