Bug 1851103 - Use of NetworkManager-wait-online.service in rhcos-growpart.service
Summary: Use of NetworkManager-wait-online.service in rhcos-growpart.service
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Jonathan Lebon
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-25 15:47 UTC by Micah Abbott
Modified: 2021-02-24 15:13 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Removed functionality
Doc Text:
The `rhcos-growpart.service` has been removed in favor of configuring disks via Ignition at install time. Users that need to change disk configuration after the initial install of RHCOS should reprovision their systems with the necessary disk configuration changes.
Clone Of: 1846169
Environment:
Last Closed: 2021-02-24 15:12:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift os pull 484 0 None closed overlay: nuke rhcos-growpart 2021-02-15 22:05:14 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:13:02 UTC

Description Micah Abbott 2020-06-25 15:47:43 UTC
Cloned for rhcos-growpart.service


+++ This bug was initially created as a clone of Bug #1846169 +++

RHCOS currently (44.81.202004260825-0) makes use of NetworkManager-wait-online.service to try to ensure that the network is up before starting a couple of services:

    $ systemctl list-dependencies --reverse NetworkManager-wait-online.service
    NetworkManager-wait-online.service
    ● └─network-online.target
    ●   ├─console-login-helper-messages-issuegen.service
    ●   └─rhcos-growpart.service

This is largely considered to be bad practice (more detail can be found on https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/) and we should move away from this paradigm. There is the following note in rhcos-growpart.service, so this will probably resolve itself, but the issuegen service deserves another look:

    This is a hack; in the future we'll just resize in the initramfs like on FCOS.

I'm filing this issue because I've seen this service fail sporadically in the past and I'm seeing it pretty reliably on Packet.net. As far as I can tell, it doesn't result in any failures, though it's definitely a distraction and potentially a red herring.

--- Additional comment from Micah Abbott on 2020-06-11 16:08:13 UTC ---

Per the first comment, there doesn't seem to be any impact to the cluster health, but is a potential red herring during debug/triage.  So setting this as a low prioriry/low severity targeted for 4.6.

--- Additional comment from Robert Fairley on 2020-06-17 15:44:37 UTC ---

Thanks for reporting, will be looking into this next sprint to see if there is any problem with issuegen. I have seen issuegen fail sometimes with other failing services, there may be some incorrect systemd dependency that issuegen has.

--- Additional comment from Dusty Mabe on 2020-06-22 16:08:47 UTC ---

Hey Alex, What do you suggest as a fix? console-login-helper-messages-issuegen.service and rhcos-growpart.service already specify deps directly on network-online.target, which is what https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ tells us to do IIUC.

--- Additional comment from Colin Walters on 2020-06-22 16:57:19 UTC ---

The CLHM one I feel like we should fix by having the console output dynamically regenerate if something changes.

CLHM is actually part of slowing down the default FCOS login speed because of this.

That said for RHCOS of course we've only *grown* the set of things that depend on this since e.g.
https://github.com/openshift/machine-config-operator/pull/1206

--- Additional comment from Dusty Mabe on 2020-06-22 17:51:07 UTC ---

(In reply to Colin Walters from comment #4)
> The CLHM one I feel like we should fix by having the console output
> dynamically regenerate if something changes.

I don't know a whole lot about how console output works so i may be speaking incorrectly. Wouldn't we still have to wait some amount of time because we only want to output to the console once? Right now we have `Before=systemd-user-sessions.service` in console-login-helper-messages-issuegen.service which I assume is necessary if we only want to output to the console once.

--- Additional comment from Robert Fairley on 2020-06-22 21:25:03 UTC ---

(In reply to Dusty Mabe from comment #5)
> (In reply to Colin Walters from comment #4)
> > The CLHM one I feel like we should fix by having the console output
> > dynamically regenerate if something changes.
> 
> I don't know a whole lot about how console output works so i may be speaking
> incorrectly. Wouldn't we still have to wait some amount of time because we
> only want to output to the console once? Right now we have
> `Before=systemd-user-sessions.service` in
> console-login-helper-messages-issuegen.service which I assume is necessary
> if we only want to output to the console once.

Yes, the `Before=systemd-user-sessions.service` was there to help ensure the final generated `issue` message snippet in the CLHM issuegen service would be generated before the serial console started and displayed the issue I believe. (There's still no guarantee of this using Before=, so it still races). CLHM has this currently, and the service in Container Linux also had this https://github.com/coreos/init/blob/a1dbdc3a956e82b45ae756c27ad0981afaaef60c/systemd/system/issuegen.service#L3.

(In reply to Colin Walters from comment #4)
> The CLHM one I feel like we should fix by having the console output
> dynamically regenerate if something changes.
> 
> CLHM is actually part of slowing down the default FCOS login speed because
> of this.
> 
> That said for RHCOS of course we've only *grown* the set of things that
> depend on this since e.g.
> https://github.com/openshift/machine-config-operator/pull/1206

We recently merged https://github.com/coreos/console-login-helper-messages/pull/47 which would let services writing a generated issue snippet do `After=console-login-helper-messages-issuegen.path`, which would guarantee the issue message gets regenerated upon file write in `/run/console-login-helper-messages/issue.d` (as long as file writes are below the systemd start rate limit) (this hasn't landed in the RPM yet though). With a brief check just now, `agetty --reload` looks promising to have agetty reload the displayed issue message upon dropping a new snippet in `/run/console-login-helper-messages/issue.d`, so issuegen could call this and we could write to the console multiple times before login. That way, we should be able to remove `Before=systemd-user-sessions.service`. I'll see how well this works, and will check if this reduces login time.

--- Additional comment from Robert Fairley on 2020-06-22 21:30:20 UTC ---

The part of `/usr/libexec/console-login-helper-messages/issuegen` that generates the combined `issue` file written to the serial console doesn't need to depend on `network-online.target` for regenerating the issue, but the part that finds the interfaces and IP address information does. Currently these parts are together in the same script, called by `console-login-helper-messages-issuegen.service`, but some work has been started to split these out: https://github.com/coreos/console-login-helper-messages/pull/43. This may help avoid the unexpected dependency/distraction noted in the report. Will try to recreate the failure scenario and see if the splitting out helps to fix this.

--- Additional comment from Robert Fairley on 2020-06-22 21:33:58 UTC ---

Actually, the interface information generation is triggered by a udev rule currently, so with the .path unit mentioned above, splitting the udev script, and using `agetty --reload`, it may be possible to remove the dependency on `network-online.target` altogether.

--- Additional comment from Colin Walters on 2020-06-23 00:14:13 UTC ---

> I don't know a whole lot about how console output works so i may be speaking incorrectly. Wouldn't we still have to wait some amount of time because we only want to output to the console once?

On non-serial consoles (e.g. physical screens) we can refresh the display.

On serial consoles...yeah.  It'd add to the visual noise.

Ultimately though the network status can change at any time, so whatever we output to the console can become a lie (e.g. in DHCP cases, our IP will can change etc.)

> Actually, the interface information generation is triggered by a udev rule currently, so with the .path unit mentioned above, splitting the udev script, and using `agetty --reload`, it may be possible to remove the dependency on `network-online.target` altogether.

That'd be great!

(To be honest I didn't dive deep into debugging our bootup speed but...last I compared the difference with a stock Debian cloud image was *shocking* - we take ~7s to get to a login prompt, the Debian cloud image was ~1s.  I think we simply are doing way way more overall, including having an initramfs at all but still, serializing logins on the network is exactly something that the Debian image *isn't* doing)

--- Additional comment from Robert Fairley on 2020-06-25 01:04:25 UTC ---

Fix to remove the network-online.target dependency: https://github.com/coreos/console-login-helper-messages/pull/49

This turned out to be quite simple, not requiring the other changes I mentioned in above comments (though the other changes would still be improvements). The network-online.target dependency also hadn't been present in Container Linux's version of issuegen.

If there is a reliable reproducer with these services failing together, I can try to verify that this fix avoids that. Otherwise, verifying that `/usr/lib/systemd/system/console-login-helper-messages-issuegen.service` does not include the `network-online.target` dependency, and that the `issue` message at the console is still correct looking something like below, once the change has been released in an RPM, should be enough.

```
Fedora 32 (Cloud Edition)
Kernel 5.6.6-300.fc32.x86_64 on an x86_64 (ttyS0)

SSH host key: SHA256:JpNaIK3B6i9dxt9jN+OsAxZKGZ4YdU9PDtJ2ir3mffs (RSA)
SSH host key: SHA256:r1zPg/Ou0B+gYx3BiUPW6CqOzIooTqKLd5gTVcQHMT8 (ECDSA)
SSH host key: SHA256:cP4tasb8uMZx/R3deht6wZrfqFANzW4Aa9zgHxjfneo (ED25519)
eth0: 10.0.2.15 fec0::5054:ff:fe12:3456
```

Comment 5 Micah Abbott 2020-09-14 20:03:46 UTC
We are unable to address this as part of 4.6; moving to 4.7

Comment 6 Micah Abbott 2020-10-05 13:14:45 UTC
We are working on higher priority items for the 4.6 release; marking for UpcomingSprint

Comment 7 Micah Abbott 2020-10-25 18:41:48 UTC
We are working on higher priority items for the 4.6 release; marking for UpcomingSprint

Comment 8 Sohan Kunkerkar 2020-12-04 17:40:22 UTC
We are working on higher priority items for the 4.7 release; marking for UpcomingSprint

Comment 9 Benjamin Gilbert 2021-01-09 12:40:40 UTC
rhcos-growpart.service can be removed once we have an upgrade path in place for the legacy LUKS volume.

We're unable to address this for 4.7.  Dropping target release.

Comment 10 Benjamin Gilbert 2021-01-14 19:37:43 UTC
Correction: rhcos-growpart.service should run from the original bootimage, before updating into new machine-os-content, so this is not blocked on an upgrade path.

Comment 11 Micah Abbott 2021-02-04 17:45:47 UTC
We're removing the service as part of 4.7; this will no longer be an issue.

Comment 13 Benjamin Gilbert 2021-02-06 00:38:39 UTC
Red Hat Enterprise Linux CoreOS 47.83.202102051942-0
  Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html

---
[core@coreos ~]$ systemctl list-dependencies --reverse NetworkManager-wait-online.service
NetworkManager-wait-online.service
● └─network-online.target
[core@coreos ~]$ systemctl status rhcos-growpart.service
Unit rhcos-growpart.service could not be found.

Comment 16 errata-xmlrpc 2021-02-24 15:12:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.