Bug 1372854
| Summary: | Occasional failure to correctly bring up network via dhclient in anaconda | |||
|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> | |
| Component: | NetworkManager | Assignee: | Thomas Haller <thaller> | |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 25 | CC: | awilliam, bgalvani, dcbw, fgiudici, lkundrak, psimerda, thaller | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | NetworkManager-1.4.0-3.fc25 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1373276 (view as bug list) | Environment: | ||
| Last Closed: | 2016-09-09 21:46:31 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1373276 | |||
|
Description
Adam Williamson
2016-09-02 22:43:40 UTC
issue in our Fedora QA issue tracker - https://phab.qadevel.cloud.fedoraproject.org/T836 OK, yeah, can confirm I can trigger the failure with an image from today's Rawhide compose as well: https://openqa.stg.fedoraproject.org/tests/37129 https://openqa.stg.fedoraproject.org/tests/37129/file/serial0.txt looks like it happens just the same. That image would have NetworkManager 1.4.0-2.fc26 in it. In the NM log: <debug> [1472854635.6994] bus-manager: (dhcp) accepted connection 0x56481bb96650 on private socket <debug> [1472854635.6999] dhcp4 (ens4): unmapped DHCP state 'PREINIT' <debug> [1472854635.7002] dhcp4 (ens4): DHCP reason 'PREINIT' -> state 'unknown' <debug> [1472854635.7004] bus-manager: (dhcp) closed connection 0x56481bb96650 on private socket <debug> [1472854635.7194] bus-manager: (dhcp) accepted connection 0x56481bb96750 on private socket <debug> [1472854635.7221] bus-manager: (dhcp) closed connection 0x56481bb96750 on private socket it seems like dhclient connected to NetworkManager but nothing happens. So, NMBusManager receives the new private connection and emits PRIVATE_CONNECTION_NEW signal (https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/nm-bus-manager.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n204) Then NMDhcpListener subscribes to the "Event" signal: https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/dhcp-manager/nm-dhcp-listener.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n148 I wonder what guarantees here that we don't miss the Event before subscribing to the signal? Especially, because nm-dhcp-helper just works right through: https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/dhcp-manager/nm-dhcp-helper.c?id=673282ece652a2b8d55892f3ee2b78e67b652b65#n100 thaller asked me to try with a custom NM build which adds a sleep to try and work around what he thinks is the problem: http://koji.fedoraproject.org/koji/taskinfo?taskID=15474635 I've tried building and using an updates.img with those packages included and it takes effect, but network init seems to fail completely every time. Not sure if there's a bug in his patch or if it's a consequence of trying to update NM with an updates.img . I will see if I can build a new installer image with the updated NM build included instead. OK, so I built a couple of netinst ISOs locally - one as a baseline, to make sure the issue still occurred with a self-built ISO, and one with that build of NM in it. And the results from that look hopeful. The baseline ISO does still reproduce the bug. And I've done >100 runs with the patched NM build and not hit the bug one time. So I think Thomas is on the right track. fix on review: https://cgit.freedesktop.org/NetworkManager/NetworkManager/log/?h=th/dhcp-helper-sync-notify-rh1372854 > dhcp-helper: refactor logging to use logging macros
+#define _NMLOG(always_enabled, level, ...) \
+ G_STMT_START { \
+ if ((always_enabled) || _NMLOG_ENABLED (level)) { \
+ GTimeVal _tv; \
+ \
+ g_get_current_time (&_tv); \
+ g_print ("nm-dhcp-helper[%ld] %-7s [%ld.%04ld] " _NM_UTILS_MACRO_FIRST (__VA_ARGS__) "\n", \
+ (long) getpid (), \
+ nm_utils_syslog_to_str (level), \
+ (long) _tv.tv_sec, (long) (_tv.tv_sec, (_tv.tv_usec + 50) / 100) \
Here there is an extra argument, and casts are not required:
_tv.tv_sec, (_tv.tv_usec + 50) / 100 \
Also, _tv.tv_usec is guaranteed to be < 10^6, but (_tv.tv_usec + 50)
is not, and can be displayed incorrectly (5 digits); the same applies
also to _nm_log_impl() in nm-logging.c
The rest LGTM.
(In reply to Beniamino Galvani from comment #7) > > dhcp-helper: refactor logging to use logging macros fixed and repushed (In reply to Thomas Haller from comment #8) > (In reply to Beniamino Galvani from comment #7) > > > dhcp-helper: refactor logging to use logging macros > > fixed and repushed LGTM this would be a build for f25 with the suggested fix for testing http://koji.fedoraproject.org/koji/taskinfo?taskID=15514439 The issue seems to be present since 2007. Strange that it seems to happen now as frequently as 1/25 tries. It seems something during the boot of the live-cd is special. Or maybe the failure rate is just too low for regular users to get a hold on the issue (also, because NM will retry DHCP after timeout). Anyway. I think this is the right fix and tests correctly for me. Adam, any chance to re-test with build from comment 10? yep, I can run a few tests with that build for sure. I suspect this is one of those things where a regular user would just go 'huh, that's odd', reboot, find it works OK and not really think more about it, or be able to report a useful bug. You kinda need something 'industrial scale' like openQA to a) be really annoyed by it and b) be able to debug it. th/dhcp-helper-sync-notify-rh1372854 branch looks good to me. sorry, i didn't get time to do this today, was working on another annoying openqa issue all day. will do it tomorrow am. OK, I ran my test for this 120 times with the new NM, no failure. Fix looks good. @Adam, thank you for confirming. Very helpful. branch merged upstream: master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=42519abdbf6c9778a9f9421e704181b22d4df684 nm-1-4: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=0b182d0f450e7caac85096bddbeeef65bdabdd6a if you could do f25 and rawhide builds (and an update for f25) that'd be awesome - I'd love to not have to worry about this problem on openQA any more :) it's already in the making... NetworkManager-1.4.0-3.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-ca1b48f89e Thanks a lot! While you're fixing things, feel like taking a swing at https://bugzilla.redhat.com/show_bug.cgi?id=1374090 ? :) NetworkManager-1.4.0-3.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-ca1b48f89e NetworkManager-1.4.0-3.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report. |