Bug 1450444
| Summary: | NM wrongly delays startup complete with 'carrier wait' | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Thomas Haller <thaller> | ||||||||
| Component: | NetworkManager | Assignee: | Thomas Haller <thaller> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 7.3 | CC: | atragler, bgalvani, fgiudici, lrintel, rkhan, sukulkar, thaller, vbenes | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | NetworkManager-1.8.0-2.el7 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-08-01 09:28:57 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
please review th/device-carrier-wait-rh1450444 branch LGTM Created attachment 1278920 [details]
[log] show the issue with current master
Created attachment 1278921 [details]
[log] logfile with the fix
Note that startup-complete happens much early.
At a second sight, I don't understand why do we need to start the timeout when the carrier is up... is it because we don't always refresh the link in nm_device_bring_up() and so priv->carrier might be out of sync? If so, I think this could be easily solved by refreshing the link in nm_device_bring_up() before the check? Now we add the pending action later when the carrier goes down, but at that point we may already have emitted startup-complete. (In reply to Beniamino Galvani from comment #5) > At a second sight, I don't understand why do we need to start the > timeout when the carrier is up... is it because we don't always > refresh the link in nm_device_bring_up() and so priv->carrier might be > out of sync? If so, I think this could be easily solved by refreshing > the link in nm_device_bring_up() before the check? Note sure that we really *need* it. I think when you set the link up, the NIC or the ethernet switch might shortly take away carrier. I think the current solution is a bit more complicated, but would avoid problems in that case. > Now we add the pending action later when the carrier goes down, but at > that point we may already have emitted startup-complete. "pending-actions" have only one purpose, to delay startup complete. Once startup complete is reached, those pending actions are useless (but not harmful). It might be a problem that we have carrier and declare startup complete too early. Before, we would always add additional 5 seconds in that case. After the patch, we must make sure to have some other pending actions that delay startup. I think we do that already, for example NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing. (In reply to Thomas Haller from comment #6) > Note sure that we really *need* it. > > I think when you set the link up, the NIC or the ethernet switch might > shortly take away carrier. > I think the current solution is a bit more complicated, but would avoid > problems in that case. > It might be a problem that we have carrier and declare startup complete too > early. Before, we would always add additional 5 seconds in that case. After > the patch, we must make sure to have some other pending actions that delay > startup. I think we do that already, for example > NM_PENDING_ACTION_AUTOACTIVATE). But this needs testing. Ok, then looks good to me. LGTM merged upstream: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=8ca05e9fd2dd1919180962bb0953ac6aa5536d9a NetworkManager.x86_64 1:1.4.0-20.el7_3 čen 28 09:17:52 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online... čen 28 09:17:57 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online. NetworkManager-1.8.0-1.el7.x86_64 čen 28 09:48:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online... čen 28 09:48:42 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online. NetworkManager-1.8.0-9.el7.x86_64 čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Starting Network Manager Wait Online... čen 28 09:19:37 ip-10-101-150-49.eu-west-1.compute.internal systemd[1]: Started Network Manager Wait Online. so it's fixed! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2299 |
Created attachment 1278197 [details] logfile NetworkManger-wait-online.service blocks boot (depending on configuration). NetworkManger-wait-online waits until NM declares "startup complete" in the logfile. As long as there are pending actions (as we call them) for a device, startup complete is not yet reached. NM might wrongly queue "carrier wait" pending action, even if the device already has carrier. As a result, the action lingers there until timeout and blocks boot by almost 5 seconds.