Bug 1372854 - Occasional failure to correctly bring up network via dhclient in anaconda
Summary: Occasional failure to correctly bring up network via dhclient in anaconda
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: NetworkManager
Version: 25
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Thomas Haller
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1373276
TreeView+ depends on / blocked
 
Reported: 2016-09-02 22:43 UTC by Adam Williamson
Modified: 2016-09-09 21:46 UTC (History)
7 users (show)

Fixed In Version: NetworkManager-1.4.0-3.fc25
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1373276 (view as bug list)
Environment:
Last Closed: 2016-09-09 21:46:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Adam Williamson 2016-09-02 22:43:40 UTC
The Fedora QA team uses an automated test system called openQA:

https://openqa.fedoraproject.org/

what openQA basically does is run virtual machines - using qemu user-mode networking - and then...test stuff in them. Most of the tests are installer tests.

Every so often, a test which uses a network repository will fail because the network was not correctly brought up. anaconda fails to configure the network repository, and when openQA tries to send some log files out from the VM (by uploading them to a web server it runs for this purpose), it cannot reach the network either.

I've been debugging this by creating a special openQA test which simply boots the installer, tries to upload a log file (to see if the network is working), and dumps out a bunch of diagnostic info via the serial console if it hits the bug. I've been running this test over and over, 8 runs at a time. It seems like we hit the failure approximately once in every 30 or 40 runs of the test.

From the diagnostic info we got out, it seems like what happens is that NetworkManager tries to configure the interface via DHCP, but decides the DHCP request failed after a 45 second timeout.

We tried a couple of variations on the test. First we modified the test so NetworkManager uses its internal DHCP client rather than dhclient (by using an anaconda updates.img which adds an NM config file to specify that setting). I tried ~100 runs with that configuration and did not get a single failure.

Second we modified the test to boot to dracut's emergency shell, rather than anaconda, and run 'dhclient ens4' directly from the shell. Again I tried ~100 runs of that, and it successfully brought up the network every time.

So it seems like this only happens with NM+dhclient. Finally, we got a failed run with as much diagnostic info as we could:

https://openqa.stg.fedoraproject.org/tests/37112

all the info is in the serial console log:

https://openqa.stg.fedoraproject.org/tests/37112/file/serial0.txt

under IP ADDR is the output of 'ip addr'. Under ROUTE is the output of 'route'. From these you can see that the interface has no IPv4 address and there is no default route. Under JOURNAL is the output of 'journalctl --no-pager', and under SYSLOG is the contents of /tmp/syslog , where some messages which don't appear to reach the journal show up.

Crucially, in /tmp/syslog , we can see that there is actually what looks like a perfectly successful dhclient session:

22:17:15,711 INFO dhclient: Created duid "\000\004\307\326\2422\011\036N\246\217N\251 \030\353\023\313".
22:17:15,711 INFO dhclient: DHCPDISCOVER on ens4 to 255.255.255.255 port 67 interval 3 (xid=0x6ad3f212)
22:17:15,712 INFO dhclient: DHCPREQUEST on ens4 to 255.255.255.255 port 67 (xid=0x6ad3f212)
22:17:15,712 INFO dhclient: DHCPOFFER from 10.0.2.2
22:17:15,712 INFO dhclient: DHCPACK from 10.0.2.2 (xid=0x6ad3f212)
22:17:15,721 INFO dhclient: bound to 10.0.2.15 -- renewal in 38317 seconds.

This is indeed right where NM ran dhclient - you can see that from the journal NM logs:

Sep 02 22:17:15 localhost NetworkManager[1600]: <debug> [1472854635.4244] dhcp4 (ens4): running: /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-ens4.pid -lf /var/lib/NetworkManager/dhclient-a84395a1-4721-30e0-9fbb-e422b26ac329-ens4.lease -cf /var/lib/NetworkManager/dhclient-ens4.conf ens4
Sep 02 22:17:15 localhost NetworkManager[1600]: <info>  [1472854635.4387] dhcp4 (ens4): dhclient started with pid 1681

however, it seems like NM never realizes that dhclient worked, and instead it gives up and kills the dhclient process 45 seconds later:

Sep 02 22:18:00 localhost NetworkManager[1600]: [0;1;39m<warn>  [1472854680.5422] dhcp4 (ens4): request timed out[0m
Sep 02 22:18:00 localhost NetworkManager[1600]: <info>  [1472854680.5428] dhcp4 (ens4): state changed unknown -> timeout
Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5428] device[0x56481bc0f9e0] (ens4): new DHCPv4 client state 2
Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5428] device[0x56481bc0f9e0] (ens4): DHCPv4 failed: timeout 1, num tries left 3
Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5428] device[0x56481bc0f9e0] (ens4): remove_pending_action (0): 'dhcp4'
Sep 02 22:18:00 localhost NetworkManager[1600]: <info>  [1472854680.5428] manager: startup complete
Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5444] kill child process 'dhcp-client-ens4' (1681): waiting up to 500 milliseconds for process to terminate normally after sending SIGTERM (15)...
Sep 02 22:18:00 localhost NetworkManager[1600]: <debug> [1472854680.5528] kill child process 'dhcp-client-ens4' (1681): after sending SIGTERM (15), process 1681 exited by signal 15 (8416 usec elapsed)
Sep 02 22:18:00 localhost NetworkManager[1600]: <info>  [1472854680.5533] dhcp4 (ens4): canceled DHCP transaction, DHCP client pid 1681
Sep 02 22:18:00 localhost NetworkManager[1600]: <info>  [1472854680.5533] dhcp4 (ens4): state changed timeout -> done

so the problem is somewhere in NM's communication with its dhclient process, I guess. thaller is looking into this at present.

This bug seems to occur in both Fedora 25 and Fedora Rawhide tests. The failures have been occurring for some time now, certainly we have failures with the git20160621.072358da snapshot. I don't believe we've actually hit this failure with the final 1.4.0 yet, but it's only been in Rawhide a few days and is not in F25 yet, so that may just be blind luck. I'll run my reproducer test on a recent Rawhide image and confirm whether I can trigger the bug there too.

Comment 1 Adam Williamson 2016-09-02 22:44:06 UTC
issue in our Fedora QA issue tracker - https://phab.qadevel.cloud.fedoraproject.org/T836

Comment 2 Adam Williamson 2016-09-02 22:54:29 UTC
OK, yeah, can confirm I can trigger the failure with an image from today's Rawhide compose as well:

https://openqa.stg.fedoraproject.org/tests/37129
https://openqa.stg.fedoraproject.org/tests/37129/file/serial0.txt

looks like it happens just the same. That image would have NetworkManager 1.4.0-2.fc26 in it.

Comment 3 Thomas Haller 2016-09-02 22:59:26 UTC
In the NM log:


<debug> [1472854635.6994] bus-manager: (dhcp) accepted connection 0x56481bb96650 on private socket
<debug> [1472854635.6999] dhcp4 (ens4): unmapped DHCP state 'PREINIT'
<debug> [1472854635.7002] dhcp4 (ens4): DHCP reason 'PREINIT' -> state 'unknown'
<debug> [1472854635.7004] bus-manager: (dhcp) closed connection 0x56481bb96650 on private socket
<debug> [1472854635.7194] bus-manager: (dhcp) accepted connection 0x56481bb96750 on private socket
<debug> [1472854635.7221] bus-manager: (dhcp) closed connection 0x56481bb96750 on private socket


it seems like dhclient connected to NetworkManager but nothing happens.




So, NMBusManager receives the new private connection and emits PRIVATE_CONNECTION_NEW signal (https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/nm-bus-manager.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n204)

Then NMDhcpListener subscribes to the "Event" signal:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/dhcp-manager/nm-dhcp-listener.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n148


I wonder what guarantees here that we don't miss the Event before subscribing to the signal? Especially, because nm-dhcp-helper just works right through:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/dhcp-manager/nm-dhcp-helper.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n100

Comment 4 Adam Williamson 2016-09-02 23:48:28 UTC
thaller asked me to try with a custom NM build which adds a sleep to try and work around what he thinks is the problem:

http://koji.fedoraproject.org/koji/taskinfo?taskID=15474635

I've tried building and using an updates.img with those packages included and it takes effect, but network init seems to fail completely every time. Not sure if there's a bug in his patch or if it's a consequence of trying to update NM with an updates.img . I will see if I can build a new installer image with the updated NM build included instead.

Comment 5 Adam Williamson 2016-09-03 01:35:05 UTC
OK, so I built a couple of netinst ISOs locally - one as a baseline, to make sure the issue still occurred with a self-built ISO, and one with that build of NM in it. And the results from that look hopeful. The baseline ISO does still reproduce the bug. And I've done >100 runs with the patched NM build and not hit the bug one time. So I think Thomas is on the right track.

Comment 7 Beniamino Galvani 2016-09-06 08:25:01 UTC
> dhcp-helper: refactor logging to use logging macros

+#define _NMLOG(always_enabled, level, ...) \
+       G_STMT_START { \
+               if ((always_enabled) || _NMLOG_ENABLED (level)) { \
+                       GTimeVal _tv; \
+                       \
+                       g_get_current_time (&_tv); \
+                       g_print ("nm-dhcp-helper[%ld] %-7s [%ld.%04ld] " _NM_UTILS_MACRO_FIRST (__VA_ARGS__) "\n", \
+                                (long) getpid (), \
+                                nm_utils_syslog_to_str (level), \
+                                (long) _tv.tv_sec, (long) (_tv.tv_sec, (_tv.tv_usec + 50) / 100) \

Here there is an extra argument, and casts are not required:

                                 _tv.tv_sec, (_tv.tv_usec + 50) / 100 \

Also, _tv.tv_usec is guaranteed to be < 10^6, but (_tv.tv_usec + 50)
is not,	and can be displayed incorrectly (5 digits); the same applies
also to _nm_log_impl() in nm-logging.c

The rest LGTM.

Comment 8 Thomas Haller 2016-09-06 08:59:15 UTC
(In reply to Beniamino Galvani from comment #7)
> > dhcp-helper: refactor logging to use logging macros

fixed and repushed

Comment 9 Beniamino Galvani 2016-09-06 09:10:52 UTC
(In reply to Thomas Haller from comment #8)
> (In reply to Beniamino Galvani from comment #7)
> > > dhcp-helper: refactor logging to use logging macros
> 
> fixed and repushed

LGTM

Comment 10 Thomas Haller 2016-09-06 09:32:44 UTC
this would be a build for f25 with the suggested fix for testing http://koji.fedoraproject.org/koji/taskinfo?taskID=15514439

Comment 11 Thomas Haller 2016-09-06 11:45:24 UTC
The issue seems to be present since 2007. Strange that it seems to happen now as frequently as 1/25 tries. It seems something during the boot of the live-cd is special.
Or maybe the failure rate is just too low for regular users to get a hold on the issue (also, because NM will retry DHCP after timeout).

Anyway. I think this is the right fix and tests correctly for me.

Comment 12 Thomas Haller 2016-09-06 11:46:36 UTC
Adam, any chance to re-test with build from comment 10?

Comment 13 Adam Williamson 2016-09-06 16:58:12 UTC
yep, I can run a few tests with that build for sure.

I suspect this is one of those things where a regular user would just go 'huh, that's odd', reboot, find it works OK and not really think more about it, or be able to report a useful bug. You kinda need something 'industrial scale' like openQA to a) be really annoyed by it and b) be able to debug it.

Comment 14 Francesco Giudici 2016-09-06 17:44:23 UTC
th/dhcp-helper-sync-notify-rh1372854 branch looks good to me.

Comment 15 Adam Williamson 2016-09-07 06:42:00 UTC
sorry, i didn't get time to do this today, was working on another annoying openqa issue all day. will do it tomorrow am.

Comment 16 Adam Williamson 2016-09-07 21:43:57 UTC
OK, I ran my test for this 120 times with the new NM, no failure. Fix looks good.

Comment 17 Thomas Haller 2016-09-07 22:20:25 UTC
@Adam, thank you for confirming. Very helpful.

Comment 19 Adam Williamson 2016-09-07 22:31:30 UTC
if you could do f25 and rawhide builds (and an update for f25) that'd be awesome - I'd love to not have to worry about this problem on openQA any more :)

Comment 20 Thomas Haller 2016-09-07 22:48:47 UTC
it's already in the making...

Comment 21 Fedora Update System 2016-09-07 23:46:06 UTC
NetworkManager-1.4.0-3.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-ca1b48f89e

Comment 22 Adam Williamson 2016-09-07 23:54:18 UTC
Thanks a lot! While you're fixing things, feel like taking a swing at https://bugzilla.redhat.com/show_bug.cgi?id=1374090 ? :)

Comment 23 Fedora Update System 2016-09-09 06:25:45 UTC
NetworkManager-1.4.0-3.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-ca1b48f89e

Comment 24 Fedora Update System 2016-09-09 21:46:26 UTC
NetworkManager-1.4.0-3.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.