Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1068621

Summary:

network.service and NetworkManager both try to activate interfaces at startup

Product:

Red Hat Enterprise Linux 7

Reporter:

Jan Tluka <jtluka>

Component:

initscripts

Assignee:

initscripts Maintenance Team <initscripts-maint-list>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Jan Ščotka <jscotka>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.0

CC:

bcl, bgoncalv, danw, dcbw, dkupka, jscotka, jtluka, lnykryn, mkovarik, pspacek, svenkatr, swadeley, vbenes, vincent.y.chen

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

initscripts-9.49.15-1.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-13 12:24:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

722240, 782468, 1020613, 1025505, 1050219, 1057960, 1061221, 1062567, 1062801, 1063932, 1066200, 1067873, 1069502, 1070517, 1070921, 1073810, 1075057, 1077078

Attachments:

Description	Flags
bootup messages	none
patch to initscripts git	none
patch	none

Description Jan Tluka 2014-02-21 14:04:29 UTC

Description of problem:

When I configure bond device via network-scripts and reboot the system the device won't get configured until I manually load bonding kernel module.

# cat /etc/sysconfig/network-scripts/ifcfg-bond0 
DEVICE=bond0
IPADDR=192.168.2.1
NETMASK=255.255.255.0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
NM_CONTROLLED=yes
BONDING_OPTS="mode=active-backup"

# cat /etc/sysconfig/network-scripts/ifcfg-ens10 
DEVICE=ens10
HWADDR=52:54:00:02:00:01
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
USERCTL=no
NM_CONTROLLED=yes

Version-Release number of selected component (if applicable):

# rpm -qa NetworkManager
NetworkManager-0.9.9.0-38.git20140131.el7.x86_64

How reproducible:
Everytime

Steps to Reproduce:
1. prepare network-scripts as stated in description
2. reboot system
3. check the device is created and configured
# ip l show dev bond0

Actual results:
bond device is not created

Expected results:
bond device is created and configured

Additional info:

There seems to be another bug after I manually load the bonding driver IP4 address is not configured on the device.

Comment 1 Dan Williams 2014-02-25 16:22:04 UTC

Can you attach /var/log/messages from the system from a bootup that shows the problem?

Comment 2 Jan Tluka 2014-02-25 17:57:27 UTC

Created attachment 867570 [details]
bootup messages

Here's the log from bootup. I've reproduced on different system so slave device name has changed.

Comment 3 Dan Williams 2014-02-26 22:14:59 UTC

Feb 25 18:49:33 localhost network: Bringing up interface bond0:  Error: no device found for connection 'System bond0'.

The error message comes from nmcli, but nmcli should know that this is a software device for which a connection *can* be started even though the device doesn't yet exist.  So I think it's an nmcli bug?

Comment 4 Dan Winship 2014-03-04 19:14:07 UTC

The problem here (besides the probably-incorrect error message dcbw pointed out) seems to be that NM and network.service are both trying to activate all of the ONBOOT interfaces on boot. network.service recognizes that the devices are NM-controlled, and so uses "nmcli con up" to bring them, but this still causes problems, because (a) it tries to bring some of them up when NM isn't ready for them, and (b) if NM has already brought the device up, then it interprets the second attempt as a request to take the device down and then bring it back up again. The latter is suspected to be the cause of several other bugs involving iSCSI breakage (bug 1029677, bug 1058270, bug 1026777, bug 1066963).

The fix, we believe, is that network.service shouldn't try to start/stop NM-controlled interfaces at all, if NM is running. I'll attach a patch for that.

(This probably would have been noticed sooner except that it seems that possibly network.service is not actually getting started on some systems? qv bug 1003936 comment 10)

Comment 5 Dan Winship 2014-03-04 19:15:01 UTC

Created attachment 870605 [details]
patch to initscripts git

Comment 6 Lukáš Nykrýn 2014-03-05 09:42:32 UTC

Hmm, I must say that I don't like this approach. I am afraid that if we do that, we will end up with regression bugs from customers that service network stop will not stop the network.

But yes this has to be fixed somehow, but I need to think about it more.

Comment 7 Lukáš Nykrýn 2014-03-05 11:10:56 UTC

> (b) if NM has already brought the device
> up, then it interprets the second attempt as a request to take the device
> down and then bring it back up again.

BTW why are you doing that? I don't find that logical at all. I would expect that it will be a noop.

Comment 8 Lukáš Nykrýn 2014-03-05 11:38:17 UTC

I have discussed that with an architect from devexp and I would propose this solution. 
We will add Before=network.service to the NetworkManager-wait-online.service. That will fix the (a). For the (b) I really think that if the device is up it should be a noop. If you really insist that this is correct behavior, than we could add a check for it in ifup.

Comment 9 Dan Winship 2014-03-05 15:38:41 UTC

(In reply to Lukáš Nykrýn from comment #6)
> Hmm, I must say that I don't like this approach. I am afraid that if we do
> that, we will end up with regression bugs from customers that service
> network stop will not stop the network.

The "stop" side is not currently causing any problems, so we could let "stop" keep acting on all interfaces, and only change "start". Although then I guess "service network stop; do stuff...; service network start" would lose.

(In reply to Lukáš Nykrýn from comment #7)
> > (b) if NM has already brought the device
> > up, then it interprets the second attempt as a request to take the device
> > down and then bring it back up again.
> 
> BTW why are you doing that? I don't find that logical at all. I would expect
> that it will be a noop.

Not 100% sure. It's been that way longer than I've been hacking on NM. One thing it's nice for is that it provides the equivalent of the Windows "Repair Connection" functionality; just re-click on the active connection in nm-applet, and it will do a DHCP renew, etc.

We have talked about changing this functionality though. But at one point in this cycle we accidentally broke it, and QA immediately noticed because it broke some of their testing scripts. So I'm not sure we can/should really change it at this point. (dcbw may have thoughts on this?)

(In reply to Lukáš Nykrýn from comment #8)
> I have discussed that with an architect from devexp and I would propose this
> solution. 
> We will add Before=network.service to the
> NetworkManager-wait-online.service. That will fix the (a).

I don't think that works in all cases, because NetworkManager-wait-online is only active if there is a service on the system that depends on network.target (which currently might always be the case, but in the future everyone is supposed to be clever and deal with network changes at runtime instead).

(Maybe there is some way in systemd syntax to say "network.service should require-and-come-after NetworkManager-wait-online if NetworkManager is active, but not if it isn't".)

> For the (b) I
> really think that if the device is up it should be a noop. If you really
> insist that this is correct behavior, than we could add a check for it in
> ifup.

I agree that we at least need to fix ifup's semantics to work how it used to. We can add a flag to nmcli saying "only activate if not already active", and ifup could then pass that flag. (And then if we changed NM's activation semantics in the future, that flag would just silently become a no-op.)

But then (whether we changed NM or we changed ifup) network.service would still log "Bringing up interface eth0", etc, even though it wasn't actually doing anything. Which is why it seemed more correct to me to change network.service rather than ifup.

Comment 10 Dan Williams 2014-03-05 16:40:08 UTC

(In reply to Lukáš Nykrýn from comment #8)
> I have discussed that with an architect from devexp and I would propose this
> solution. 
> We will add Before=network.service to the
> NetworkManager-wait-online.service. That will fix the (a). For the (b) I
> really think that if the device is up it should be a noop. If you really
> insist that this is correct behavior, than we could add a check for it in
> ifup.

We have two cases here if the device is already active:

1) a request to restart the *same* connection on the device

2) a request to start a different connection on the device


I'm fine with making #1 a NOP, because the only reason we allowed this in the first place (long long ago) was that drivers sucked and sometimes just died, but reactivating the device made things work again, so it was quick shortcut to work around kernel bugs.  Those bugs are mostly fixed, and there's no real reason to keep this shortcut around.

Making #2 a NOP or an error *would* be a problem, because no NetworkManager client (nmcli, KDE, GNOME, nmtui, nm-applet, etc) explicitly disconnects the device and then reactivates it with the new connection.  This could also cause problems at startup when NetworkManager assumes the existing configuration of the device, but for whatever reason the assumed connection does not match the ifcfg ONBOOT=yes connection.  In that case, there are two different connections, and the one that the 'network' service will be starting is different than what's active on the device, and this would be allowed.  This could break things like eg iSCSI or network mounted /usr.

We could easily modify 'ifup' itself to do something like:

nmcli -t -f GENERAL.STATE dev show $DEVICE | grep -v "connecting\|connected"

and if that returns anything, the device is not active and can be started by the network service without conflict.

Otherwise, danw's suggestion of a flag for nmcli would work too.

Comment 11 Dan Williams 2014-03-05 16:41:16 UTC

Clarification: the regex I posted would match "disconnected" so we'd have to ensure that the grep only attempted to match "connecting" and "connected" as full words.

Comment 12 Lukáš Nykrýn 2014-03-11 11:34:46 UTC

Created attachment 873067 [details]
patch

I will include attached patch. It ensures that ifup will not call nmcli if the device is handled by NM if the device is in connected or connecting state. Also I have modified LSB header of the network initscript to ensure that there will be some After dependency in systemd between network and NM.

Comment 14 Dan Williams 2014-03-11 15:41:33 UTC

Thanks Lukáš!

Comment 15 Dan Williams 2014-03-11 16:17:10 UTC

Do you want another ':' between ${1} and 'connecting'?

 LANG=C nmcli -t --fields device,state  dev status 2>/dev/null | grep -q "^\(${1}:connected\)\|\(${1}connecting\)$"

Otherwise I don't think the regex would not correctly match 'connecting' states...

Comment 16 Dan Williams 2014-03-11 16:39:58 UTC

Sorry for the double-negative there, to be clearer I mean:

I don't think the regex would match 'connecting' states without the ':' between "${1}:connecting" right?

Comment 17 Dan Williams 2014-03-12 14:29:40 UTC

Replying to  myself, the actual patch used (https://git.fedorahosted.org/cgit/initscripts.git/commit/?id=2f00d21f7d0bf74de4d06d26a4475b91da90a4f7 ) does add the ':'.  The patch attached to this bug is a previous version.

Comment 18 Dan Williams 2014-03-13 19:42:35 UTC

*** Bug 1035487 has been marked as a duplicate of this bug. ***

Comment 19 Dan Williams 2014-03-13 19:44:53 UTC

*** Bug 1070557 has been marked as a duplicate of this bug. ***

Comment 21 Dan Williams 2014-03-20 22:09:51 UTC

*** Bug 1073409 has been marked as a duplicate of this bug. ***

Comment 22 Dan Williams 2014-03-20 22:12:28 UTC

*** Bug 1058270 has been marked as a duplicate of this bug. ***

Comment 23 Ludek Smid 2014-06-13 12:24:49 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Comment 25 Petr Spacek 2014-07-16 13:23:48 UTC

Could you backport the fix from your comment #12 to Fedora, please? It seems that we have hit the same problem in our automated testing infrastructure built on top of Fedora 20.

Comment 26 Lukáš Nykrýn 2014-07-22 14:22:01 UTC

Why are you using network initscript and NM in fedora together? By default network is not enabled and we are trying to push people to leave it that way.