The Fedora QA team uses an automated test system called openQA: https://openqa.fedoraproject.org/ what openQA basically does is run virtual machines - using qemu user-mode networking - and then...test stuff in them. Most of the tests are installer tests. Every so often, a test which uses a network repository will fail because the network was not correctly brought up. anaconda fails to configure the network repository, and when openQA tries to send some log files out from the VM (by uploading them to a web server it runs for this purpose), it cannot reach the network either. I've been debugging this by creating a special openQA test which simply boots the installer, tries to upload a log file (to see if the network is working), and dumps out a bunch of diagnostic info via the serial console if it hits the bug. I've been running this test over and over, 8 runs at a time. It seems like we hit the failure approximately once in every 30 or 40 runs of the test. From the diagnostic info we got out, it seems like what happens is that NetworkManager tries to configure the interface via DHCP, but decides the DHCP request failed after a 45 second timeout. We tried a couple of variations on the test. First we modified the test so NetworkManager uses its internal DHCP client rather than dhclient (by using an anaconda updates.img which adds an NM config file to specify that setting). I tried ~100 runs with that configuration and did not get a single failure. Second we modified the test to boot to dracut's emergency shell, rather than anaconda, and run 'dhclient ens4' directly from the shell. Again I tried ~100 runs of that, and it successfully brought up the network every time. So it seems like this only happens with NM+dhclient. Finally, we got a failed run with as much diagnostic info as we could: https://openqa.stg.fedoraproject.org/tests/37112 all the info is in the serial console log: https://openqa.stg.fedoraproject.org/tests/37112/file/serial0.txt under IP ADDR is the output of 'ip addr'. Under ROUTE is the output of 'route'. From these you can see that the interface has no IPv4 address and there is no default route. Under JOURNAL is the output of 'journalctl --no-pager', and under SYSLOG is the contents of /tmp/syslog , where some messages which don't appear to reach the journal show up. Crucially, in /tmp/syslog , we can see that there is actually what looks like a perfectly successful dhclient session: 22:17:15,711 INFO dhclient: Created duid "\000\004\307\326\2422\011\036N\246\217N\251 \030\353\023\313". 22:17:15,711 INFO dhclient: DHCPDISCOVER on ens4 to 255.255.255.255 port 67 interval 3 (xid=0x6ad3f212) 22:17:15,712 INFO dhclient: DHCPREQUEST on ens4 to 255.255.255.255 port 67 (xid=0x6ad3f212) 22:17:15,712 INFO dhclient: DHCPOFFER from 10.0.2.2 22:17:15,712 INFO dhclient: DHCPACK from 10.0.2.2 (xid=0x6ad3f212) 22:17:15,721 INFO dhclient: bound to 10.0.2.15 -- renewal in 38317 seconds. This is indeed right where NM ran dhclient - you can see that from the journal NM logs: Sep 02 22:17:15 localhost NetworkManager[1600]: <debug> [1472854635.4244] dhcp4 (ens4): running: /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-ens4.pid -lf /var/lib/NetworkManager/dhclient-a84395a1-4721-30e0-9fbb-e422b26ac329-ens4.lease -cf /var/lib/NetworkManager/dhclient-ens4.conf ens4 Sep 02 22:17:15 localhost NetworkManager[1600]: <info> [1472854635.4387] dhcp4 (ens4): dhclient started with pid 1681 however, it seems like NM never realizes that dhclient worked, and instead it gives up and kills the dhclient process 45 seconds later: Sep 02 22:18:00 localhost NetworkManager[1600]: [0;1;39m<warn> [1472854680.5422] dhcp4 (ens4): request timed out[0m Sep 02 22:18:00 localhost NetworkManager[1600]: <info> [1472854680.5428] dhcp4 (ens4): state changed unknown -> timeout Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5428] device[0x56481bc0f9e0] (ens4): new DHCPv4 client state 2 Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5428] device[0x56481bc0f9e0] (ens4): DHCPv4 failed: timeout 1, num tries left 3 Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5428] device[0x56481bc0f9e0] (ens4): remove_pending_action (0): 'dhcp4' Sep 02 22:18:00 localhost NetworkManager[1600]: <info> [1472854680.5428] manager: startup complete Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5444] kill child process 'dhcp-client-ens4' (1681): waiting up to 500 milliseconds for process to terminate normally after sending SIGTERM (15)... Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5528] kill child process 'dhcp-client-ens4' (1681): after sending SIGTERM (15), process 1681 exited by signal 15 (8416 usec elapsed) Sep 02 22:18:00 localhost NetworkManager[1600]: <info> [1472854680.5533] dhcp4 (ens4): canceled DHCP transaction, DHCP client pid 1681 Sep 02 22:18:00 localhost NetworkManager[1600]: <info> [1472854680.5533] dhcp4 (ens4): state changed timeout -> done so the problem is somewhere in NM's communication with its dhclient process, I guess. thaller is looking into this at present. This bug seems to occur in both Fedora 25 and Fedora Rawhide tests. The failures have been occurring for some time now, certainly we have failures with the git20160621.072358da snapshot. I don't believe we've actually hit this failure with the final 1.4.0 yet, but it's only been in Rawhide a few days and is not in F25 yet, so that may just be blind luck. I'll run my reproducer test on a recent Rawhide image and confirm whether I can trigger the bug there too.
issue in our Fedora QA issue tracker - https://phab.qadevel.cloud.fedoraproject.org/T836
OK, yeah, can confirm I can trigger the failure with an image from today's Rawhide compose as well: https://openqa.stg.fedoraproject.org/tests/37129 https://openqa.stg.fedoraproject.org/tests/37129/file/serial0.txt looks like it happens just the same. That image would have NetworkManager 1.4.0-2.fc26 in it.
In the NM log: <debug> [1472854635.6994] bus-manager: (dhcp) accepted connection 0x56481bb96650 on private socket <debug> [1472854635.6999] dhcp4 (ens4): unmapped DHCP state 'PREINIT' <debug> [1472854635.7002] dhcp4 (ens4): DHCP reason 'PREINIT' -> state 'unknown' <debug> [1472854635.7004] bus-manager: (dhcp) closed connection 0x56481bb96650 on private socket <debug> [1472854635.7194] bus-manager: (dhcp) accepted connection 0x56481bb96750 on private socket <debug> [1472854635.7221] bus-manager: (dhcp) closed connection 0x56481bb96750 on private socket it seems like dhclient connected to NetworkManager but nothing happens. So, NMBusManager receives the new private connection and emits PRIVATE_CONNECTION_NEW signal (https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/nm-bus-manager.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n204) Then NMDhcpListener subscribes to the "Event" signal: https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/dhcp-manager/nm-dhcp-listener.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n148 I wonder what guarantees here that we don't miss the Event before subscribing to the signal? Especially, because nm-dhcp-helper just works right through: https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/dhcp-manager/nm-dhcp-helper.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n100
thaller asked me to try with a custom NM build which adds a sleep to try and work around what he thinks is the problem: http://koji.fedoraproject.org/koji/taskinfo?taskID=15474635 I've tried building and using an updates.img with those packages included and it takes effect, but network init seems to fail completely every time. Not sure if there's a bug in his patch or if it's a consequence of trying to update NM with an updates.img . I will see if I can build a new installer image with the updated NM build included instead.
OK, so I built a couple of netinst ISOs locally - one as a baseline, to make sure the issue still occurred with a self-built ISO, and one with that build of NM in it. And the results from that look hopeful. The baseline ISO does still reproduce the bug. And I've done >100 runs with the patched NM build and not hit the bug one time. So I think Thomas is on the right track.
fix on review: https://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?h=th/dhcp-helper-sync-notify-rh1372854
> dhcp-helper: refactor logging to use logging macros +#define _NMLOG(always_enabled, level, ...) \ + G_STMT_START { \ + if ((always_enabled) || _NMLOG_ENABLED (level)) { \ + GTimeVal _tv; \ + \ + g_get_current_time (&_tv); \ + g_print ("nm-dhcp-helper[%ld] %-7s [%ld.%04ld] " _NM_UTILS_MACRO_FIRST (__VA_ARGS__) "\n", \ + (long) getpid (), \ + nm_utils_syslog_to_str (level), \ + (long) _tv.tv_sec, (long) (_tv.tv_sec, (_tv.tv_usec + 50) / 100) \ Here there is an extra argument, and casts are not required: _tv.tv_sec, (_tv.tv_usec + 50) / 100 \ Also, _tv.tv_usec is guaranteed to be < 10^6, but (_tv.tv_usec + 50) is not, and can be displayed incorrectly (5 digits); the same applies also to _nm_log_impl() in nm-logging.c The rest LGTM.
(In reply to Beniamino Galvani from comment #7) > > dhcp-helper: refactor logging to use logging macros fixed and repushed
(In reply to Thomas Haller from comment #8) > (In reply to Beniamino Galvani from comment #7) > > > dhcp-helper: refactor logging to use logging macros > > fixed and repushed LGTM
this would be a build for f25 with the suggested fix for testing http://koji.fedoraproject.org/koji/taskinfo?taskID=15514439
The issue seems to be present since 2007. Strange that it seems to happen now as frequently as 1/25 tries. It seems something during the boot of the live-cd is special. Or maybe the failure rate is just too low for regular users to get a hold on the issue (also, because NM will retry DHCP after timeout). Anyway. I think this is the right fix and tests correctly for me.
Adam, any chance to re-test with build from comment 10?
yep, I can run a few tests with that build for sure. I suspect this is one of those things where a regular user would just go 'huh, that's odd', reboot, find it works OK and not really think more about it, or be able to report a useful bug. You kinda need something 'industrial scale' like openQA to a) be really annoyed by it and b) be able to debug it.
th/dhcp-helper-sync-notify-rh1372854 branch looks good to me.
sorry, i didn't get time to do this today, was working on another annoying openqa issue all day. will do it tomorrow am.
OK, I ran my test for this 120 times with the new NM, no failure. Fix looks good.
@Adam, thank you for confirming. Very helpful.
branch merged upstream: master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=42519abdbf6c9778a9f9421e704181b22d4df684 nm-1-4: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=0b182d0f450e7caac85096bddbeeef65bdabdd6a
if you could do f25 and rawhide builds (and an update for f25) that'd be awesome - I'd love to not have to worry about this problem on openQA any more :)
it's already in the making...
NetworkManager-1.4.0-3.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-ca1b48f89e
Thanks a lot! While you're fixing things, feel like taking a swing at https://bugzilla.redhat.com/show_bug.cgi?id=1374090 ? :)
NetworkManager-1.4.0-3.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-ca1b48f89e
NetworkManager-1.4.0-3.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.