Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1450444 - NM wrongly delays startup complete with 'carrier wait'
NM wrongly delays startup complete with 'carrier wait'
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager (Show other bugs)
7.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Thomas Haller
Desktop QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-05-12 10:54 EDT by Thomas Haller
Modified: 2017-08-01 05:28 EDT (History)
8 users (show)

See Also:
Fixed In Version: NetworkManager-1.8.0-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-01 05:28:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logfile (9.23 KB, text/plain)
2017-05-12 10:54 EDT, Thomas Haller
no flags Details
[log] show the issue with current master (31.78 KB, text/plain)
2017-05-15 07:20 EDT, Thomas Haller
no flags Details
[log] logfile with the fix (32.09 KB, text/plain)
2017-05-15 07:20 EDT, Thomas Haller
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:2299 normal SHIPPED_LIVE Moderate: NetworkManager and libnl3 security, bug fix and enhancement update 2017-08-01 08:40:28 EDT

  None (edit)
Description Thomas Haller 2017-05-12 10:54:19 EDT
Created attachment 1278197 [details]
logfile

NetworkManger-wait-online.service blocks boot (depending on configuration).

NetworkManger-wait-online waits until NM declares "startup complete" in the logfile. As long as there are pending actions (as we call them) for a device, startup complete is not yet reached.

NM might wrongly queue "carrier wait" pending action, even if the device already has carrier. As a result, the action lingers there until timeout and blocks boot by almost 5 seconds.
Comment 1 Thomas Haller 2017-05-12 10:58:37 EDT
please review th/device-carrier-wait-rh1450444 branch
Comment 2 Beniamino Galvani 2017-05-12 16:39:37 EDT
LGTM
Comment 3 Thomas Haller 2017-05-15 07:20 EDT
Created attachment 1278920 [details]
[log] show the issue with current master
Comment 4 Thomas Haller 2017-05-15 07:20 EDT
Created attachment 1278921 [details]
[log] logfile with the fix

Note that startup-complete happens much early.
Comment 5 Beniamino Galvani 2017-05-15 09:09:49 EDT
At a second sight, I don't understand why do we need to start the
timeout when the carrier is up... is it because we don't always
refresh the link in nm_device_bring_up() and so priv->carrier might be
out of sync? If so, I think this could be easily solved by refreshing
the link in nm_device_bring_up() before the check?

Now we add the pending action later when the carrier goes down, but at
that point we may already have emitted startup-complete.
Comment 6 Thomas Haller 2017-05-15 10:03:02 EDT
(In reply to Beniamino Galvani from comment #5)
> At a second sight, I don't understand why do we need to start the
> timeout when the carrier is up... is it because we don't always
> refresh the link in nm_device_bring_up() and so priv->carrier might be
> out of sync? If so, I think this could be easily solved by refreshing
> the link in nm_device_bring_up() before the check?

Note sure that we really *need* it.

I think when you set the link up, the NIC or the ethernet switch might shortly take away carrier.

I think the current solution is a bit more complicated, but would avoid problems in that case.


> Now we add the pending action later when the carrier goes down, but at
> that point we may already have emitted startup-complete.

"pending-actions" have only one purpose, to delay startup complete. Once startup complete is reached, those pending actions are useless (but not harmful).

It might be a problem that we have carrier and declare startup complete too early. Before, we would always add additional 5 seconds in that case. After the patch, we must make sure to have some other pending actions that delay startup. I think we do that already, for example NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing.
Comment 7 Beniamino Galvani 2017-05-15 10:17:17 EDT
(In reply to Thomas Haller from comment #6)
> Note sure that we really *need* it.
> 
> I think when you set the link up, the NIC or the ethernet switch might
> shortly take away carrier.

> I think the current solution is a bit more complicated, but would avoid
> problems in that case.

> It might be a problem that we have carrier and declare startup complete too
> early. Before, we would always add additional 5 seconds in that case. After
> the patch, we must make sure to have some other pending actions that delay
> startup. I think we do that already, for example
> NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing.

Ok, then looks good to me.
Comment 8 Lubomir Rintel 2017-05-15 11:34:19 EDT
LGTM
Comment 11 Vladimir Benes 2017-06-28 10:26:33 EDT
NetworkManager.x86_64 1:1.4.0-20.el7_3
čen 28 09:17:52 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:17:57 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

NetworkManager-1.8.0-1.el7.x86_64
čen 28 09:48:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:48:42 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

NetworkManager-1.8.0-9.el7.x86_64
čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

so it's fixed!
Comment 12 errata-xmlrpc 2017-08-01 05:28:57 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2299

Note You need to log in before you can comment on or make changes to this bug.