Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1069695

Summary:

NetworkManager dies reproducibly on docker container start

Product:

Red Hat Enterprise Linux 7

Reporter:

Stephen Tweedie <sct>

Component:

NetworkManager

Assignee:

Dan Williams <dcbw>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Desktop QE <desktop-qa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.0

CC:

dcbw, fge, jeder, jklimes, jkoten, kdube, mjenner, rkhan, sct, thaller, vbenes

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

NetworkManager-0.9.9.1-2.git20140228.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-13 11:52:00 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1069814

Attachments:

Description	Flags
/var/log/messages extract	none

Description Stephen Tweedie 2014-02-25 14:25:46 UTC

Description of problem:
NetworkManager dies reproducibly when starting multiple docker containers, with the error

ERROR:nm-manager.c:2768:_internal_activate_device: assertion failed: (connection_needs_virtual_device (connection))

Version-Release number of selected component (if applicable):
NetworkManager-0.9.9.0-39.git20140131.el7.x86_64
using (from EPEL)
docker-io-0.8.0-3.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
Install docker:
# yum install http://dl.fedoraproject.org/pub/epel/beta/7/x86_64/epel-release-7-0.1.noarch.rpm
# systemctl enable docker

Download Fedora images:
# docker pull fedora

Create a set of simple docker network-enabled containers:

# time for n in `seq 1 100` ; do docker run -d -t -i sct/fedora bash; done

This runs 100 shells from the docker runtime in the background.

Actual results:
NetworkManager rapidly dies, 100% reproducible for me


Additional info:

Environment is 2vcpu rhel7 uptodate (nightly) VM on a f20 host, with default networking in host and guest.

Networking initialised by docker:
"ip r"
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.42.1 

"ip a"
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether fe:00:1d:90:09:fd brd ff:ff:ff:ff:ff:ff
    inet 172.17.42.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::c0b0:53ff:fecb:f0f5/64 scope link 
       valid_lft forever preferred_lft forever

is the docker port on the host; each docker container gets an additional veth bridged to this similar to

5: vethcfcDAD: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master docker0 state UP qlen 1000
    link/ether fe:05:90:96:c3:91 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc05:90ff:fe96:c391/64 scope link 
       valid_lft forever preferred_lft forever

"brctl show"
bridge name     bridge id               STP enabled     interfaces
docker0         8000.fe001d9009fd       no              veth0QyZRr
                                                        veth16xWmR
                                                        veth1pSrD5
                                                        veth2GA0ji
etc

Comment 1 Stephen Tweedie 2014-02-25 14:27:05 UTC

Created attachment 867446 [details]
/var/log/messages extract

/var/log/messages extract from the start of the docker runs to the point of the assert failure.

Comment 5 Dan Williams 2014-02-26 19:57:08 UTC

NM creates a default-DHCP connection for vethdwrZl4 (because it's currently a subclass of 'ethernet' devices, and because NetworkManager-config-server isn't installed to suppress creation of the default DHCP connections).  It then activates that connection, but the device gets removed before that can happen triggering an assertion.

Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> (vethdwrZl4): created default wired connection 'Wired connection 1'
Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> (vethdwrZl4): device state change: unavailable -> disconnected (reason 'none') [20 30 0]
Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> Auto-activating connection 'Wired connection 1'.
Feb 25 14:12:08 rhel7 NetworkManager[739]: <info> (vethdwrZl4): device state change: disconnected -> unmanaged (reason 'removed') [30 10 36]
Feb 25 14:12:08 rhel7 NetworkManager: **
Feb 25 14:12:08 rhel7 NetworkManager: ERROR:nm-manager.c:2768:_internal_activate_device: assertion failed: (connection_needs_virtual_device (connection))

I have reproduced this issue and verified that the fixes from bug 1058843 work around the issue.  I also have a few fixes that should be applied to be 100% sure the issue is fixed.

Please test RPMs here:

http://people.redhat.com/dcbw/NetworkManager/rh1069695/

and let me know if this fixes the issue.

Comment 6 Stephen Tweedie 2014-02-27 10:59:51 UTC

(In reply to Dan Williams from comment #5)
> Please test RPMs here:
> 
> http://people.redhat.com/dcbw/NetworkManager/rh1069695/
> 
> and let me know if this fixes the issue.

Yes, looks like it's fixed to me --- 100 containers and no ill effects.  I'll try a few more but first signs are good.

Comment 7 Dan Williams 2014-02-27 18:07:38 UTC

Would anyone mind if I made this bug public?  We try to attach bug #s to our upstream commits, and if the bug isn't public then that's meaningless for the NM community.  I don't see any non-public info in this bug, please let me know if you disagree.

Let me know if making this bug public is OK.  Thanks!

Comment 8 Dan Williams 2014-02-27 18:34:30 UTC

Some additional fixes (not strictly necessary) posted to upstream branch dcbw/handle-activate-dev-remove.

Comment 9 Stephen Tweedie 2014-02-27 18:46:07 UTC

(In reply to Dan Williams from comment #7)
> Would anyone mind if I made this bug public?  We try to attach bug #s to our
> upstream commits, and if the bug isn't public then that's meaningless for
> the NM community.  I don't see any non-public info in this bug, please let
> me know if you disagree.
> 
> Let me know if making this bug public is OK.  Thanks!

Sure, fine by me.

--Stephen

Comment 10 Dan Williams 2014-02-28 16:07:05 UTC

This bug is now public.

Comment 11 Dan Williams 2014-02-28 16:11:16 UTC

To be 100% clear, the original crash in this bug is fixed by patches for bug 1058843 and that will be in the next snapshot, NetworkManager-0.9.9.1-1 and later.

I'm keeping this bug open for reviews on the dcbw/activate-dev-remove until that's merged upstream and gets into a build.  This branch contains some additional fixes for a bug that the docker use-case could experience in very specific circumstances, but should be very rare.

Comment 12 Thomas Haller 2014-03-03 12:27:52 UTC

These patches look correct to me

Comment 13 Jirka Klimes 2014-03-04 12:04:58 UTC

pushed a small fixup for comments:
fixup! core: ensure ActiveConnections stay alive over activation paths

> core: correctly handle pre-activation dependency failure (rh #1069695)
_internal_activate_generic() log message could be more specific about the connection, device, etc.

Otherwise the code looks good to me.

Comment 14 Dan Williams 2014-03-04 18:42:16 UTC

(In reply to Jirka Klimes from comment #13)
> pushed a small fixup for comments:
> fixup! core: ensure ActiveConnections stay alive over activation paths

Thanks, squashed.

> > core: correctly handle pre-activation dependency failure (rh #1069695)
> _internal_activate_generic() log message could be more specific about the
> connection, device, etc.

Unfortunately we don't know in the manager whether the failure was due to a device being removed or unavailable, or whether a master connection failed.  We'd have to have some kind of additional information on the ActiveConnection for that.  So I'll leave that for later.

Branch merged to git master.

Comment 15 Jirka Klimes 2014-03-06 09:47:56 UTC

*** Bug 1059297 has been marked as a duplicate of this bug. ***

Comment 17 Jirka Klimes 2014-03-10 09:35:16 UTC

*** Bug 1074423 has been marked as a duplicate of this bug. ***

Comment 18 Radek Bíba 2014-03-13 09:48:43 UTC

*** Bug 1059297 has been marked as a duplicate of this bug. ***

Comment 19 Vladimir Benes 2014-03-19 08:56:16 UTC

I cannot see any crashes when those 100 docker instances are upped. Not all devices are connected but that's probably different issue.

Comment 20 Ludek Smid 2014-06-13 11:52:00 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.