Bug 1068621
| Summary: | network.service and NetworkManager both try to activate interfaces at startup | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jan Tluka <jtluka> | ||||||||
| Component: | initscripts | Assignee: | initscripts Maintenance Team <initscripts-maint-list> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jan Ščotka <jscotka> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 7.0 | CC: | bcl, bgoncalv, danw, dcbw, dkupka, jscotka, jtluka, lnykryn, mkovarik, pspacek, svenkatr, swadeley, vbenes, vincent.y.chen | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | initscripts-9.49.15-1.el7 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2014-06-13 12:24:49 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 722240, 782468, 1020613, 1025505, 1050219, 1057960, 1061221, 1062567, 1062801, 1063932, 1066200, 1067873, 1069502, 1070517, 1070921, 1073810, 1075057, 1077078 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Jan Tluka
2014-02-21 14:04:29 UTC
Can you attach /var/log/messages from the system from a bootup that shows the problem? Created attachment 867570 [details]
bootup messages
Here's the log from bootup. I've reproduced on different system so slave device name has changed.
Feb 25 18:49:33 localhost network: Bringing up interface bond0: Error: no device found for connection 'System bond0'. The error message comes from nmcli, but nmcli should know that this is a software device for which a connection *can* be started even though the device doesn't yet exist. So I think it's an nmcli bug? The problem here (besides the probably-incorrect error message dcbw pointed out) seems to be that NM and network.service are both trying to activate all of the ONBOOT interfaces on boot. network.service recognizes that the devices are NM-controlled, and so uses "nmcli con up" to bring them, but this still causes problems, because (a) it tries to bring some of them up when NM isn't ready for them, and (b) if NM has already brought the device up, then it interprets the second attempt as a request to take the device down and then bring it back up again. The latter is suspected to be the cause of several other bugs involving iSCSI breakage (bug 1029677, bug 1058270, bug 1026777, bug 1066963). The fix, we believe, is that network.service shouldn't try to start/stop NM-controlled interfaces at all, if NM is running. I'll attach a patch for that. (This probably would have been noticed sooner except that it seems that possibly network.service is not actually getting started on some systems? qv bug 1003936 comment 10) Created attachment 870605 [details]
patch to initscripts git
Hmm, I must say that I don't like this approach. I am afraid that if we do that, we will end up with regression bugs from customers that service network stop will not stop the network. But yes this has to be fixed somehow, but I need to think about it more. > (b) if NM has already brought the device
> up, then it interprets the second attempt as a request to take the device
> down and then bring it back up again.
BTW why are you doing that? I don't find that logical at all. I would expect that it will be a noop.
I have discussed that with an architect from devexp and I would propose this solution. We will add Before=network.service to the NetworkManager-wait-online.service. That will fix the (a). For the (b) I really think that if the device is up it should be a noop. If you really insist that this is correct behavior, than we could add a check for it in ifup. (In reply to Lukáš Nykrýn from comment #6) > Hmm, I must say that I don't like this approach. I am afraid that if we do > that, we will end up with regression bugs from customers that service > network stop will not stop the network. The "stop" side is not currently causing any problems, so we could let "stop" keep acting on all interfaces, and only change "start". Although then I guess "service network stop; do stuff...; service network start" would lose. (In reply to Lukáš Nykrýn from comment #7) > > (b) if NM has already brought the device > > up, then it interprets the second attempt as a request to take the device > > down and then bring it back up again. > > BTW why are you doing that? I don't find that logical at all. I would expect > that it will be a noop. Not 100% sure. It's been that way longer than I've been hacking on NM. One thing it's nice for is that it provides the equivalent of the Windows "Repair Connection" functionality; just re-click on the active connection in nm-applet, and it will do a DHCP renew, etc. We have talked about changing this functionality though. But at one point in this cycle we accidentally broke it, and QA immediately noticed because it broke some of their testing scripts. So I'm not sure we can/should really change it at this point. (dcbw may have thoughts on this?) (In reply to Lukáš Nykrýn from comment #8) > I have discussed that with an architect from devexp and I would propose this > solution. > We will add Before=network.service to the > NetworkManager-wait-online.service. That will fix the (a). I don't think that works in all cases, because NetworkManager-wait-online is only active if there is a service on the system that depends on network.target (which currently might always be the case, but in the future everyone is supposed to be clever and deal with network changes at runtime instead). (Maybe there is some way in systemd syntax to say "network.service should require-and-come-after NetworkManager-wait-online if NetworkManager is active, but not if it isn't".) > For the (b) I > really think that if the device is up it should be a noop. If you really > insist that this is correct behavior, than we could add a check for it in > ifup. I agree that we at least need to fix ifup's semantics to work how it used to. We can add a flag to nmcli saying "only activate if not already active", and ifup could then pass that flag. (And then if we changed NM's activation semantics in the future, that flag would just silently become a no-op.) But then (whether we changed NM or we changed ifup) network.service would still log "Bringing up interface eth0", etc, even though it wasn't actually doing anything. Which is why it seemed more correct to me to change network.service rather than ifup. (In reply to Lukáš Nykrýn from comment #8) > I have discussed that with an architect from devexp and I would propose this > solution. > We will add Before=network.service to the > NetworkManager-wait-online.service. That will fix the (a). For the (b) I > really think that if the device is up it should be a noop. If you really > insist that this is correct behavior, than we could add a check for it in > ifup. We have two cases here if the device is already active: 1) a request to restart the *same* connection on the device 2) a request to start a different connection on the device I'm fine with making #1 a NOP, because the only reason we allowed this in the first place (long long ago) was that drivers sucked and sometimes just died, but reactivating the device made things work again, so it was quick shortcut to work around kernel bugs. Those bugs are mostly fixed, and there's no real reason to keep this shortcut around. Making #2 a NOP or an error *would* be a problem, because no NetworkManager client (nmcli, KDE, GNOME, nmtui, nm-applet, etc) explicitly disconnects the device and then reactivates it with the new connection. This could also cause problems at startup when NetworkManager assumes the existing configuration of the device, but for whatever reason the assumed connection does not match the ifcfg ONBOOT=yes connection. In that case, there are two different connections, and the one that the 'network' service will be starting is different than what's active on the device, and this would be allowed. This could break things like eg iSCSI or network mounted /usr. We could easily modify 'ifup' itself to do something like: nmcli -t -f GENERAL.STATE dev show $DEVICE | grep -v "connecting\|connected" and if that returns anything, the device is not active and can be started by the network service without conflict. Otherwise, danw's suggestion of a flag for nmcli would work too. Clarification: the regex I posted would match "disconnected" so we'd have to ensure that the grep only attempted to match "connecting" and "connected" as full words. Created attachment 873067 [details]
patch
I will include attached patch. It ensures that ifup will not call nmcli if the device is handled by NM if the device is in connected or connecting state. Also I have modified LSB header of the network initscript to ensure that there will be some After dependency in systemd between network and NM.
Thanks Lukáš! Do you want another ':' between ${1} and 'connecting'?
LANG=C nmcli -t --fields device,state dev status 2>/dev/null | grep -q "^\(${1}:connected\)\|\(${1}connecting\)$"
Otherwise I don't think the regex would not correctly match 'connecting' states...
Sorry for the double-negative there, to be clearer I mean:
I don't think the regex would match 'connecting' states without the ':' between "${1}:connecting" right?
Replying to myself, the actual patch used (https://git.fedorahosted.org/cgit/initscripts.git/commit/?id=2f00d21f7d0bf74de4d06d26a4475b91da90a4f7 ) does add the ':'. The patch attached to this bug is a previous version. *** Bug 1035487 has been marked as a duplicate of this bug. *** *** Bug 1070557 has been marked as a duplicate of this bug. *** *** Bug 1073409 has been marked as a duplicate of this bug. *** *** Bug 1058270 has been marked as a duplicate of this bug. *** This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. Could you backport the fix from your comment #12 to Fedora, please? It seems that we have hit the same problem in our automated testing infrastructure built on top of Fedora 20. Why are you using network initscript and NM in fedora together? By default network is not enabled and we are trying to push people to leave it that way. |