Red Hat Bugzilla – Bug 1450444
NM wrongly delays startup complete with 'carrier wait'
Last modified: 2017-08-01 05:28:57 EDT
Created attachment 1278197 [details] logfile NetworkManger-wait-online.service blocks boot (depending on configuration). NetworkManger-wait-online waits until NM declares "startup complete" in the logfile. As long as there are pending actions (as we call them) for a device, startup complete is not yet reached. NM might wrongly queue "carrier wait" pending action, even if the device already has carrier. As a result, the action lingers there until timeout and blocks boot by almost 5 seconds.
please review th/device-carrier-wait-rh1450444 branch
LGTM
Created attachment 1278920 [details] [log] show the issue with current master
Created attachment 1278921 [details] [log] logfile with the fix Note that startup-complete happens much early.
At a second sight, I don't understand why do we need to start the timeout when the carrier is up... is it because we don't always refresh the link in nm_device_bring_up() and so priv->carrier might be out of sync? If so, I think this could be easily solved by refreshing the link in nm_device_bring_up() before the check? Now we add the pending action later when the carrier goes down, but at that point we may already have emitted startup-complete.
(In reply to Beniamino Galvani from comment #5) > At a second sight, I don't understand why do we need to start the > timeout when the carrier is up... is it because we don't always > refresh the link in nm_device_bring_up() and so priv->carrier might be > out of sync? If so, I think this could be easily solved by refreshing > the link in nm_device_bring_up() before the check? Note sure that we really *need* it. I think when you set the link up, the NIC or the ethernet switch might shortly take away carrier. I think the current solution is a bit more complicated, but would avoid problems in that case. > Now we add the pending action later when the carrier goes down, but at > that point we may already have emitted startup-complete. "pending-actions" have only one purpose, to delay startup complete. Once startup complete is reached, those pending actions are useless (but not harmful). It might be a problem that we have carrier and declare startup complete too early. Before, we would always add additional 5 seconds in that case. After the patch, we must make sure to have some other pending actions that delay startup. I think we do that already, for example NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing.
(In reply to Thomas Haller from comment #6) > Note sure that we really *need* it. > > I think when you set the link up, the NIC or the ethernet switch might > shortly take away carrier. > I think the current solution is a bit more complicated, but would avoid > problems in that case. > It might be a problem that we have carrier and declare startup complete too > early. Before, we would always add additional 5 seconds in that case. After > the patch, we must make sure to have some other pending actions that delay > startup. I think we do that already, for example > NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing. Ok, then looks good to me.
merged upstream: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=8ca05e9fd2dd1919180962bb0953ac6aa5536d9a
NetworkManager.x86_64 1:1.4.0-20.el7_3 čen 28 09:17:52 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online... čen 28 09:17:57 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online. NetworkManager-1.8.0-1.el7.x86_64 čen 28 09:48:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online... čen 28 09:48:42 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online. NetworkManager-1.8.0-9.el7.x86_64 čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online... čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online. so it's fixed!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2299