Bug 1846169

Summary: Use of NetworkManager-wait-online.service in console-login-helper-messages-issuegen.service
Product: OpenShift Container Platform Reporter: Alex Crawford <crawford>
Component: RHCOSAssignee: Micah Abbott <miabbott>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: low Docs Contact:
Priority: low    
Version: 4.4CC: bbreard, dustymabe, imcleod, jligon, miabbott, nstielau, walters
Target Milestone: ---Keywords: UpcomingSprint
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1851103 (view as bug list) Environment:
Last Closed: 2020-10-27 16:06:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Crawford 2020-06-10 23:42:13 UTC
RHCOS currently (44.81.202004260825-0) makes use of NetworkManager-wait-online.service to try to ensure that the network is up before starting a couple of services:

    $ systemctl list-dependencies --reverse NetworkManager-wait-online.service
    NetworkManager-wait-online.service
    ● └─network-online.target
    ●   ├─console-login-helper-messages-issuegen.service
    ●   └─rhcos-growpart.service

This is largely considered to be bad practice (more detail can be found on https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/) and we should move away from this paradigm. There is the following note in rhcos-growpart.service, so this will probably resolve itself, but the issuegen service deserves another look:

    This is a hack; in the future we'll just resize in the initramfs like on FCOS.

I'm filing this issue because I've seen this service fail sporadically in the past and I'm seeing it pretty reliably on Packet.net. As far as I can tell, it doesn't result in any failures, though it's definitely a distraction and potentially a red herring.

Comment 1 Micah Abbott 2020-06-11 16:08:13 UTC
Per the first comment, there doesn't seem to be any impact to the cluster health, but is a potential red herring during debug/triage.  So setting this as a low prioriry/low severity targeted for 4.6.

Comment 2 Robert Fairley 2020-06-17 15:44:37 UTC
Thanks for reporting, will be looking into this next sprint to see if there is any problem with issuegen. I have seen issuegen fail sometimes with other failing services, there may be some incorrect systemd dependency that issuegen has.

Comment 3 Dusty Mabe 2020-06-22 16:08:47 UTC
Hey Alex, What do you suggest as a fix? console-login-helper-messages-issuegen.service and rhcos-growpart.service already specify deps directly on network-online.target, which is what https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ tells us to do IIUC.

Comment 4 Colin Walters 2020-06-22 16:57:19 UTC
The CLHM one I feel like we should fix by having the console output dynamically regenerate if something changes.

CLHM is actually part of slowing down the default FCOS login speed because of this.

That said for RHCOS of course we've only *grown* the set of things that depend on this since e.g.
https://github.com/openshift/machine-config-operator/pull/1206

Comment 5 Dusty Mabe 2020-06-22 17:51:07 UTC
(In reply to Colin Walters from comment #4)
> The CLHM one I feel like we should fix by having the console output
> dynamically regenerate if something changes.

I don't know a whole lot about how console output works so i may be speaking incorrectly. Wouldn't we still have to wait some amount of time because we only want to output to the console once? Right now we have `Before=systemd-user-sessions.service` in console-login-helper-messages-issuegen.service which I assume is necessary if we only want to output to the console once.

Comment 6 Robert Fairley 2020-06-22 21:25:03 UTC
(In reply to Dusty Mabe from comment #5)
> (In reply to Colin Walters from comment #4)
> > The CLHM one I feel like we should fix by having the console output
> > dynamically regenerate if something changes.
> 
> I don't know a whole lot about how console output works so i may be speaking
> incorrectly. Wouldn't we still have to wait some amount of time because we
> only want to output to the console once? Right now we have
> `Before=systemd-user-sessions.service` in
> console-login-helper-messages-issuegen.service which I assume is necessary
> if we only want to output to the console once.

Yes, the `Before=systemd-user-sessions.service` was there to help ensure the final generated `issue` message snippet in the CLHM issuegen service would be generated before the serial console started and displayed the issue I believe. (There's still no guarantee of this using Before=, so it still races). CLHM has this currently, and the service in Container Linux also had this https://github.com/coreos/init/blob/a1dbdc3a956e82b45ae756c27ad0981afaaef60c/systemd/system/issuegen.service#L3.

(In reply to Colin Walters from comment #4)
> The CLHM one I feel like we should fix by having the console output
> dynamically regenerate if something changes.
> 
> CLHM is actually part of slowing down the default FCOS login speed because
> of this.
> 
> That said for RHCOS of course we've only *grown* the set of things that
> depend on this since e.g.
> https://github.com/openshift/machine-config-operator/pull/1206

We recently merged https://github.com/coreos/console-login-helper-messages/pull/47 which would let services writing a generated issue snippet do `After=console-login-helper-messages-issuegen.path`, which would guarantee the issue message gets regenerated upon file write in `/run/console-login-helper-messages/issue.d` (as long as file writes are below the systemd start rate limit) (this hasn't landed in the RPM yet though). With a brief check just now, `agetty --reload` looks promising to have agetty reload the displayed issue message upon dropping a new snippet in `/run/console-login-helper-messages/issue.d`, so issuegen could call this and we could write to the console multiple times before login. That way, we should be able to remove `Before=systemd-user-sessions.service`. I'll see how well this works, and will check if this reduces login time.

Comment 7 Robert Fairley 2020-06-22 21:30:20 UTC
The part of `/usr/libexec/console-login-helper-messages/issuegen` that generates the combined `issue` file written to the serial console doesn't need to depend on `network-online.target` for regenerating the issue, but the part that finds the interfaces and IP address information does. Currently these parts are together in the same script, called by `console-login-helper-messages-issuegen.service`, but some work has been started to split these out: https://github.com/coreos/console-login-helper-messages/pull/43. This may help avoid the unexpected dependency/distraction noted in the report. Will try to recreate the failure scenario and see if the splitting out helps to fix this.

Comment 8 Robert Fairley 2020-06-22 21:33:58 UTC
Actually, the interface information generation is triggered by a udev rule currently, so with the .path unit mentioned above, splitting the udev script, and using `agetty --reload`, it may be possible to remove the dependency on `network-online.target` altogether.

Comment 9 Colin Walters 2020-06-23 00:14:13 UTC
> I don't know a whole lot about how console output works so i may be speaking incorrectly. Wouldn't we still have to wait some amount of time because we only want to output to the console once?

On non-serial consoles (e.g. physical screens) we can refresh the display.

On serial consoles...yeah.  It'd add to the visual noise.

Ultimately though the network status can change at any time, so whatever we output to the console can become a lie (e.g. in DHCP cases, our IP will can change etc.)

> Actually, the interface information generation is triggered by a udev rule currently, so with the .path unit mentioned above, splitting the udev script, and using `agetty --reload`, it may be possible to remove the dependency on `network-online.target` altogether.

That'd be great!

(To be honest I didn't dive deep into debugging our bootup speed but...last I compared the difference with a stock Debian cloud image was *shocking* - we take ~7s to get to a login prompt, the Debian cloud image was ~1s.  I think we simply are doing way way more overall, including having an initramfs at all but still, serializing logins on the network is exactly something that the Debian image *isn't* doing)

Comment 10 Robert Fairley 2020-06-25 01:04:25 UTC
Fix to remove the network-online.target dependency: https://github.com/coreos/console-login-helper-messages/pull/49

This turned out to be quite simple, not requiring the other changes I mentioned in above comments (though the other changes would still be improvements). The network-online.target dependency also hadn't been present in Container Linux's version of issuegen.

If there is a reliable reproducer with these services failing together, I can try to verify that this fix avoids that. Otherwise, verifying that `/usr/lib/systemd/system/console-login-helper-messages-issuegen.service` does not include the `network-online.target` dependency, and that the `issue` message at the console is still correct looking something like below, once the change has been released in an RPM, should be enough.

```
Fedora 32 (Cloud Edition)
Kernel 5.6.6-300.fc32.x86_64 on an x86_64 (ttyS0)

SSH host key: SHA256:JpNaIK3B6i9dxt9jN+OsAxZKGZ4YdU9PDtJ2ir3mffs (RSA)
SSH host key: SHA256:r1zPg/Ou0B+gYx3BiUPW6CqOzIooTqKLd5gTVcQHMT8 (ECDSA)
SSH host key: SHA256:cP4tasb8uMZx/R3deht6wZrfqFANzW4Aa9zgHxjfneo (ED25519)
eth0: 10.0.2.15 fec0::5054:ff:fe12:3456
```

Comment 11 Micah Abbott 2020-06-25 15:48:45 UTC
Cloned to https://bugzilla.redhat.com/show_bug.cgi?id=1851103 for rhcos-growpart

Comment 13 Robert Fairley 2020-06-25 18:39:52 UTC
https://github.com/coreos/console-login-helper-messages/pull/49 now merged.

Comment 14 Micah Abbott 2020-06-25 19:09:42 UTC
Gotta keep this in POST until we land the fix in an RHCOS build.

Comment 15 Dusty Mabe 2020-06-25 19:44:38 UTC
(In reply to Dusty Mabe from comment #3)
> Hey Alex, What do you suggest as a fix?
> console-login-helper-messages-issuegen.service and rhcos-growpart.service
> already specify deps directly on network-online.target, which is what
> https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ tells us to
> do IIUC.

I think the answer to my question above is that I was missing what Alex was pointing to in https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/. Originally I thought he was pointing to:

"If you need to delay you service after the network is up, include After=network-online.target Wants=network-online.target"

but I now think he was pointing at:

"If you are a developer, instead of wondering what to do about network.target, please just fix your program to be friendly to dynamically changing network configuration." 

which makes a little more sense.

Comment 16 Colin Walters 2020-06-25 20:53:15 UTC
> "If you are a developer, instead of wondering what to do about network.target, please just fix your program to be friendly to dynamically changing network configuration." 
> which makes a little more sense.

Yep, exactly.

(But like I said for us I think by far the biggest offenders are crio/kubelet but we can at least chase them once we've fixed the OS)

Comment 19 Michael Nguyen 2020-07-10 19:32:06 UTC
Verified on 46.82.202007071437-0 which is included in OCP 4.6.0-0.nightly-2020-07-10-135238

ibm-p8-kvm-03-guest-02 login: 
Red Hat Enterprise Linux CoreOS 46.82.202007071437-0 (Ootpa) 4.6
SSH host key: SHA256:QIvTXsYzAGvmiqetm7k0t13YS/+8Xyq8YfJcK3szAz8 (ECDSA)
SSH host key: SHA256:fpwcG3CIb+9xE+2ain6NfLkwK4zhU6qQobTXITNLXKE (ED25519)
SSH host key: SHA256:M/Ykj2Cj73CPDBdWrQOCddfbGF1ODYe5Ueiz6vhk15g (RSA)
ens3: 10.0.2.15 fec0::5054:ff:fe12:3456
ibm-p8-kvm-03-guest-02 login: core
Password: 
Red Hat Enterprise Linux CoreOS 46.82.202007071437-0
  Part of OpenShift 4.6, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.6/architecture/architecture-rhcos.html

---
[core@ibm-p8-kvm-03-guest-02 ~]$ rpm-ostree status
State: idle
Deployments:
* ostree://a36f1c5bfdb54f8f9a8164688bd219fcfe983b8c173e7fd5365269692cd0dbe4
                   Version: 46.82.202007071437-0 (2020-07-07T14:41:41Z)
uegen.service vm-03-guest-02 ~]$ systemctl cat console-login-helper-messages-issuegen.service
# /usr/lib/systemd/system/console-login-helper-messages-issuegen.service
[Unit]
Description=Generate /run/issue.d/console-login-helper-messages.issue
Documentation=https://github.com/coreos/console-login-helper-messages
Before=systemd-user-sessions.service
After=sshd-keygen.target

[Service]
Type=oneshot
# This service expects to be started repeatedly after exiting (whenever a new
# runtime issue snippet is dropped into the directory watched by
# `console-login-helper-messages-issuegen.path`). `RemainAfterExit` is thus
# explicitly set to `no`, so that the service can re-execute (otherwise,
# `start` on this service after it has exited once before does not have any
# effect).
RemainAfterExit=no
ExecStart=/usr/libexec/console-login-helper-messages/issuegen

[Install]
WantedBy=multi-user.target

Comment 23 errata-xmlrpc 2020-10-27 16:06:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196