Bug 1839923 - Very slow initial boot of rhcos installer kernel and initramfs on T440s laptop
Summary: Very slow initial boot of rhcos installer kernel and initramfs on T440s laptop
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.4
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
: 4.6.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard: Singleton
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-26 03:04 UTC by Steven Ellis
Modified: 2023-01-19 15:17 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-10 13:55:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Steven Ellis 2020-05-26 03:04:57 UTC
Description of problem:

PXE based bare metal install onto T440s laptop is incredibly slow and almost unusable

Version-Release number of the following components:
openshift-install 4.4.3
Client Version: 4.4.3

rhcos-4.4.3-x86_64-installer-initramfs.x86_64.img
rhcos-4.4.3-x86_64-installer-kernel-x86_64

How reproducible:

Steps to Reproduce:
1. UEFI PXE Boot T440s
2. Grub picks up correct kernel and initramfs
3. After "input: TPPS/2" boot entry boot process slows to a crawl

Actual results:

I will attach a screen shot of the boot process

Expected results:

I can UEFI PXE boot the primary NUC devices in my lab without issue. The T440s was to be used as an additional worker node.

Additional info:

Comment 3 Steven Ellis 2020-05-26 04:50:08 UTC
It appears that the initial install will complete if I remove the following from the PXE Grub

console=tty0 console=ttyS0

Likewise when I first boot the laptop with the initial RAW Metal Core OS Image I again need to remove the same options from grub or the boot will slow down to a total crawl.

Comment 4 Steven Ellis 2020-05-26 04:53:14 UTC
Further checking has now confirmed I needed to remove the entry for

console=ttyS0

Comment 6 Colin Walters 2020-05-26 13:18:59 UTC
This might be related to entropy:
https://bugzilla.redhat.com/show_bug.cgi?id=1778762
Try adding `random.trust_cpu=on` to the kernel cmdline?

Comment 7 Stephen Benjamin 2020-05-26 15:55:01 UTC
This doesn't appear to be related to openshift-installer or baremetal IPI platform.

Comment 8 Micah Abbott 2020-05-26 19:57:05 UTC
Marking as low priority and targeting for 4.6; while we would like to support RHCOS + OCP on all kinds of hardware, running on a laptop isn't a primary target for us.  Let us know if the the suggestion from comment #6 helps the situation.

Comment 10 Colin Walters 2020-06-10 23:00:00 UTC
I think you usually want
console=ttyS0,115200n8
because without the speed set it may be much slower.  That's what we have in the default kernel arguments.

Though probably on physical hardware like this we should have an obvious option to remove it via coreos-installer, since it's not necessary.

So...if removing it from your PXE config fixes it, do we need this bug?

Comment 11 Steven Ellis 2020-06-16 07:10:17 UTC
Colin - I've been doing some more testing with OCP 4.4 and 4.5-rc1

This doesn't just impact the bootstrap over PXE. The default GRUB config also has "console=ttyS0,115200n8" which means once we've pulled the bare metal image onto the hardware we hit the same issue again.

At present on each re-boot I have to manually remove the value from the grub config, and when IGN completes the setup of the node I can SSH in and remove it under /boot/....

I'm surprised no one else has ever hit this with hardware testing - 

Having a dig around it appears this issue occurs with other linux platorms
 - https://github.com/cloud-hypervisor/cloud-hypervisor/issues/163

Not sure how to progress this further

Comment 12 Colin Walters 2020-06-17 17:41:21 UTC
I believe there may be a real bug here but currently this is a "Singleton" issue - only one known instance.

You can help us by doing PXE boots of Fedora CoreOS, traditional RHEL, or Debian/Ubuntu or whatever and gather data as to whether this is somehow truly specific to RHCOS or whether it's hardware related.

Comment 14 Steven Ellis 2020-06-25 22:02:41 UTC
New tests with Fedora CoreOS and RHEL8


menuentry 'Fedora CoreOS tftp dhcp debug'  {

  linuxefi pxelinux/coreos/fedora-coreos-31.20200127.3.0-live-kernel-x86_64 rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=sda ip=dhcp
  initrdefi pxelinux/coreos/fedora-coreos-31.20200127.3.0-live-initramfs.x86_64.img

}

Still hangs/delay during boot.

If I remove console=ttyS0 boot proceeds normally


For RHEL8 I'm using the vmlinuz/initrd from the RHEL 8.2 ISO

menuentry 'RHEL 8 T440s serial console tftp' {
  set root=(http,10.1.10.9)

  linuxefi /repo/rhel8/images/pxeboot/vmlinuz rd.neednet=1 console=tty0 console=ttyS0 inst.stage2=http://10.1.10.9/repo/rhel8/  ip=dhcp
  initrdefi /repo/rhel8/images/pxeboot/initrd.img
}

This also experiences the same issues on boot if I have a console=ttyS0 entry

Comment 15 Colin Walters 2020-07-10 13:55:33 UTC
Moving this to https://github.com/coreos/fedora-coreos-tracker/issues/567

Comment 16 Nils Koenig 2022-06-30 09:55:18 UTC
(In reply to Micah Abbott from comment #8)
> Marking as low priority and targeting for 4.6; while we would like to
> support RHCOS + OCP on all kinds of hardware, running on a laptop isn't a
> primary target for us.  Let us know if the the suggestion from comment #6
> helps the situation.

I see this behavior also on "real" servers (Lenovo x3850 X6) so I woudn't say it's only happening on laptop HW and therefore low prio.

Comment 17 Peter Lauterbach 2023-01-19 15:17:52 UTC
I would advocate to raise the priority on this defect. This also affected my ASUS PN50 (mini PC) SNO install.
This mini PC runs the latest Fedora, Fedora CoreOS, and Silverblue without any issues. But every OpenShift SNO Assisted Installer deployment failed with the default serial console settings, because the node got into a state where it could not reply to the Assisted Installer in a timely fashion.

I discovered pcfe's blog, that this bz the root cause of the failure.
https://blog.pcfe.net/hugo/posts/2021-08-22-asus-pn50-ocp4-worker/

Manually intercepting the CoreOS boot, and removing the serial console from the kernel args allowed SNO to install properly, after months of failures with OCP 4,10 and OCP 4.11.

The good news is OCP 4.12 removes this serial console boot args, so SNO installs should work fine, but the underlying problem of the slow boot remains on these broken machines remains a mystery.

I can make this machine available for diagnostic and troubleshooting, please advise of there is a minimal set of logs and troubleshooting steps needs to isolate the root cause.


Note You need to log in before you can comment on or make changes to this bug.