Bug 1450444 - NM wrongly delays startup complete with 'carrier wait'
Summary: NM wrongly delays startup complete with 'carrier wait'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Thomas Haller
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-12 14:54 UTC by Thomas Haller
Modified: 2017-08-01 09:28 UTC (History)
8 users (show)

Fixed In Version: NetworkManager-1.8.0-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-01 09:28:57 UTC


Attachments (Terms of Use)
logfile (9.23 KB, text/plain)
2017-05-12 14:54 UTC, Thomas Haller
no flags Details
[log] show the issue with current master (31.78 KB, text/plain)
2017-05-15 11:20 UTC, Thomas Haller
no flags Details
[log] logfile with the fix (32.09 KB, text/plain)
2017-05-15 11:20 UTC, Thomas Haller
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:2299 normal SHIPPED_LIVE Moderate: NetworkManager and libnl3 security, bug fix and enhancement update 2017-08-01 12:40:28 UTC

Description Thomas Haller 2017-05-12 14:54:19 UTC
Created attachment 1278197 [details]
logfile

NetworkManger-wait-online.service blocks boot (depending on configuration).

NetworkManger-wait-online waits until NM declares "startup complete" in the logfile. As long as there are pending actions (as we call them) for a device, startup complete is not yet reached.

NM might wrongly queue "carrier wait" pending action, even if the device already has carrier. As a result, the action lingers there until timeout and blocks boot by almost 5 seconds.

Comment 1 Thomas Haller 2017-05-12 14:58:37 UTC
please review th/device-carrier-wait-rh1450444 branch

Comment 2 Beniamino Galvani 2017-05-12 20:39:37 UTC
LGTM

Comment 3 Thomas Haller 2017-05-15 11:20:08 UTC
Created attachment 1278920 [details]
[log] show the issue with current master

Comment 4 Thomas Haller 2017-05-15 11:20:59 UTC
Created attachment 1278921 [details]
[log] logfile with the fix

Note that startup-complete happens much early.

Comment 5 Beniamino Galvani 2017-05-15 13:09:49 UTC
At a second sight, I don't understand why do we need to start the
timeout when the carrier is up... is it because we don't always
refresh the link in nm_device_bring_up() and so priv->carrier might be
out of sync? If so, I think this could be easily solved by refreshing
the link in nm_device_bring_up() before the check?

Now we add the pending action later when the carrier goes down, but at
that point we may already have emitted startup-complete.

Comment 6 Thomas Haller 2017-05-15 14:03:02 UTC
(In reply to Beniamino Galvani from comment #5)
> At a second sight, I don't understand why do we need to start the
> timeout when the carrier is up... is it because we don't always
> refresh the link in nm_device_bring_up() and so priv->carrier might be
> out of sync? If so, I think this could be easily solved by refreshing
> the link in nm_device_bring_up() before the check?

Note sure that we really *need* it.

I think when you set the link up, the NIC or the ethernet switch might shortly take away carrier.

I think the current solution is a bit more complicated, but would avoid problems in that case.


> Now we add the pending action later when the carrier goes down, but at
> that point we may already have emitted startup-complete.

"pending-actions" have only one purpose, to delay startup complete. Once startup complete is reached, those pending actions are useless (but not harmful).

It might be a problem that we have carrier and declare startup complete too early. Before, we would always add additional 5 seconds in that case. After the patch, we must make sure to have some other pending actions that delay startup. I think we do that already, for example NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing.

Comment 7 Beniamino Galvani 2017-05-15 14:17:17 UTC
(In reply to Thomas Haller from comment #6)
> Note sure that we really *need* it.
> 
> I think when you set the link up, the NIC or the ethernet switch might
> shortly take away carrier.

> I think the current solution is a bit more complicated, but would avoid
> problems in that case.

> It might be a problem that we have carrier and declare startup complete too
> early. Before, we would always add additional 5 seconds in that case. After
> the patch, we must make sure to have some other pending actions that delay
> startup. I think we do that already, for example
> NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing.

Ok, then looks good to me.

Comment 8 Lubomir Rintel 2017-05-15 15:34:19 UTC
LGTM

Comment 11 Vladimir Benes 2017-06-28 14:26:33 UTC
NetworkManager.x86_64 1:1.4.0-20.el7_3
čen 28 09:17:52 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:17:57 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

NetworkManager-1.8.0-1.el7.x86_64
čen 28 09:48:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:48:42 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

NetworkManager-1.8.0-9.el7.x86_64
čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

so it's fixed!

Comment 12 errata-xmlrpc 2017-08-01 09:28:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2299


Note You need to log in before you can comment on or make changes to this bug.