Bug 1069695
| Summary: | NetworkManager dies reproducibly on docker container start | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Stephen Tweedie <sct> | ||||
| Component: | NetworkManager | Assignee: | Dan Williams <dcbw> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Desktop QE <desktop-qa-list> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 7.0 | CC: | dcbw, fge, jeder, jklimes, jkoten, kdube, mjenner, rkhan, sct, thaller, vbenes | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | NetworkManager-0.9.9.1-2.git20140228.el7 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-06-13 11:52:00 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1069814 | ||||||
| Attachments: |
|
||||||
|
Description
Stephen Tweedie
2014-02-25 14:25:46 UTC
Created attachment 867446 [details]
/var/log/messages extract
/var/log/messages extract from the start of the docker runs to the point of the assert failure.
NM creates a default-DHCP connection for vethdwrZl4 (because it's currently a subclass of 'ethernet' devices, and because NetworkManager-config-server isn't installed to suppress creation of the default DHCP connections). It then activates that connection, but the device gets removed before that can happen triggering an assertion. Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> (vethdwrZl4): created default wired connection 'Wired connection 1' Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> (vethdwrZl4): device state change: unavailable -> disconnected (reason 'none') [20 30 0] Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> Auto-activating connection 'Wired connection 1'. Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> (vethdwrZl4): device state change: disconnected -> unmanaged (reason 'removed') [30 10 36] Feb 25 14:12:08 rhel7 NetworkManager: ** Feb 25 14:12:08 rhel7 NetworkManager: ERROR:nm-manager.c:2768:_internal_activate_device: assertion failed: (connection_needs_virtual_device (connection)) I have reproduced this issue and verified that the fixes from bug 1058843 work around the issue. I also have a few fixes that should be applied to be 100% sure the issue is fixed. Please test RPMs here: http://people.redhat.com/dcbw/NetworkManager/rh1069695/ and let me know if this fixes the issue. (In reply to Dan Williams from comment #5) > Please test RPMs here: > > http://people.redhat.com/dcbw/NetworkManager/rh1069695/ > > and let me know if this fixes the issue. Yes, looks like it's fixed to me --- 100 containers and no ill effects. I'll try a few more but first signs are good. Would anyone mind if I made this bug public? We try to attach bug #s to our upstream commits, and if the bug isn't public then that's meaningless for the NM community. I don't see any non-public info in this bug, please let me know if you disagree. Let me know if making this bug public is OK. Thanks! Some additional fixes (not strictly necessary) posted to upstream branch dcbw/handle-activate-dev-remove. (In reply to Dan Williams from comment #7) > Would anyone mind if I made this bug public? We try to attach bug #s to our > upstream commits, and if the bug isn't public then that's meaningless for > the NM community. I don't see any non-public info in this bug, please let > me know if you disagree. > > Let me know if making this bug public is OK. Thanks! Sure, fine by me. --Stephen This bug is now public. To be 100% clear, the original crash in this bug is fixed by patches for bug 1058843 and that will be in the next snapshot, NetworkManager-0.9.9.1-1 and later. I'm keeping this bug open for reviews on the dcbw/activate-dev-remove until that's merged upstream and gets into a build. This branch contains some additional fixes for a bug that the docker use-case could experience in very specific circumstances, but should be very rare. These patches look correct to me pushed a small fixup for comments:
fixup! core: ensure ActiveConnections stay alive over activation paths
> core: correctly handle pre-activation dependency failure (rh #1069695)
_internal_activate_generic() log message could be more specific about the connection, device, etc.
Otherwise the code looks good to me.
(In reply to Jirka Klimes from comment #13) > pushed a small fixup for comments: > fixup! core: ensure ActiveConnections stay alive over activation paths Thanks, squashed. > > core: correctly handle pre-activation dependency failure (rh #1069695) > _internal_activate_generic() log message could be more specific about the > connection, device, etc. Unfortunately we don't know in the manager whether the failure was due to a device being removed or unavailable, or whether a master connection failed. We'd have to have some kind of additional information on the ActiveConnection for that. So I'll leave that for later. Branch merged to git master. *** Bug 1059297 has been marked as a duplicate of this bug. *** *** Bug 1074423 has been marked as a duplicate of this bug. *** *** Bug 1059297 has been marked as a duplicate of this bug. *** I cannot see any crashes when those 100 docker instances are upped. Not all devices are connected but that's probably different issue. This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |