RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1450444 - NM wrongly delays startup complete with 'carrier wait'
Summary: NM wrongly delays startup complete with 'carrier wait'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Thomas Haller
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-12 14:54 UTC by Thomas Haller
Modified: 2017-08-01 09:28 UTC (History)
8 users (show)

Fixed In Version: NetworkManager-1.8.0-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-01 09:28:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logfile (9.23 KB, text/plain)
2017-05-12 14:54 UTC, Thomas Haller
no flags Details
[log] show the issue with current master (31.78 KB, text/plain)
2017-05-15 11:20 UTC, Thomas Haller
no flags Details
[log] logfile with the fix (32.09 KB, text/plain)
2017-05-15 11:20 UTC, Thomas Haller
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:2299 0 normal SHIPPED_LIVE Moderate: NetworkManager and libnl3 security, bug fix and enhancement update 2017-08-01 12:40:28 UTC

Description Thomas Haller 2017-05-12 14:54:19 UTC
Created attachment 1278197 [details]
logfile

NetworkManger-wait-online.service blocks boot (depending on configuration).

NetworkManger-wait-online waits until NM declares "startup complete" in the logfile. As long as there are pending actions (as we call them) for a device, startup complete is not yet reached.

NM might wrongly queue "carrier wait" pending action, even if the device already has carrier. As a result, the action lingers there until timeout and blocks boot by almost 5 seconds.

Comment 1 Thomas Haller 2017-05-12 14:58:37 UTC
please review th/device-carrier-wait-rh1450444 branch

Comment 2 Beniamino Galvani 2017-05-12 20:39:37 UTC
LGTM

Comment 3 Thomas Haller 2017-05-15 11:20:08 UTC
Created attachment 1278920 [details]
[log] show the issue with current master

Comment 4 Thomas Haller 2017-05-15 11:20:59 UTC
Created attachment 1278921 [details]
[log] logfile with the fix

Note that startup-complete happens much early.

Comment 5 Beniamino Galvani 2017-05-15 13:09:49 UTC
At a second sight, I don't understand why do we need to start the
timeout when the carrier is up... is it because we don't always
refresh the link in nm_device_bring_up() and so priv->carrier might be
out of sync? If so, I think this could be easily solved by refreshing
the link in nm_device_bring_up() before the check?

Now we add the pending action later when the carrier goes down, but at
that point we may already have emitted startup-complete.

Comment 6 Thomas Haller 2017-05-15 14:03:02 UTC
(In reply to Beniamino Galvani from comment #5)
> At a second sight, I don't understand why do we need to start the
> timeout when the carrier is up... is it because we don't always
> refresh the link in nm_device_bring_up() and so priv->carrier might be
> out of sync? If so, I think this could be easily solved by refreshing
> the link in nm_device_bring_up() before the check?

Note sure that we really *need* it.

I think when you set the link up, the NIC or the ethernet switch might shortly take away carrier.

I think the current solution is a bit more complicated, but would avoid problems in that case.


> Now we add the pending action later when the carrier goes down, but at
> that point we may already have emitted startup-complete.

"pending-actions" have only one purpose, to delay startup complete. Once startup complete is reached, those pending actions are useless (but not harmful).

It might be a problem that we have carrier and declare startup complete too early. Before, we would always add additional 5 seconds in that case. After the patch, we must make sure to have some other pending actions that delay startup. I think we do that already, for example NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing.

Comment 7 Beniamino Galvani 2017-05-15 14:17:17 UTC
(In reply to Thomas Haller from comment #6)
> Note sure that we really *need* it.
> 
> I think when you set the link up, the NIC or the ethernet switch might
> shortly take away carrier.

> I think the current solution is a bit more complicated, but would avoid
> problems in that case.

> It might be a problem that we have carrier and declare startup complete too
> early. Before, we would always add additional 5 seconds in that case. After
> the patch, we must make sure to have some other pending actions that delay
> startup. I think we do that already, for example
> NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing.

Ok, then looks good to me.

Comment 8 Lubomir Rintel 2017-05-15 15:34:19 UTC
LGTM

Comment 11 Vladimir Benes 2017-06-28 14:26:33 UTC
NetworkManager.x86_64 1:1.4.0-20.el7_3
čen 28 09:17:52 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:17:57 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

NetworkManager-1.8.0-1.el7.x86_64
čen 28 09:48:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:48:42 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

NetworkManager-1.8.0-9.el7.x86_64
čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online...
čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online.

so it's fixed!

Comment 12 errata-xmlrpc 2017-08-01 09:28:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2299


Note You need to log in before you can comment on or make changes to this bug.