Description of problem: PXE based bare metal install onto T440s laptop is incredibly slow and almost unusable Version-Release number of the following components: openshift-install 4.4.3 Client Version: 4.4.3 rhcos-4.4.3-x86_64-installer-initramfs.x86_64.img rhcos-4.4.3-x86_64-installer-kernel-x86_64 How reproducible: Steps to Reproduce: 1. UEFI PXE Boot T440s 2. Grub picks up correct kernel and initramfs 3. After "input: TPPS/2" boot entry boot process slows to a crawl Actual results: I will attach a screen shot of the boot process Expected results: I can UEFI PXE boot the primary NUC devices in my lab without issue. The T440s was to be used as an additional worker node. Additional info:
It appears that the initial install will complete if I remove the following from the PXE Grub console=tty0 console=ttyS0 Likewise when I first boot the laptop with the initial RAW Metal Core OS Image I again need to remove the same options from grub or the boot will slow down to a total crawl.
Further checking has now confirmed I needed to remove the entry for console=ttyS0
This might be related to entropy: https://bugzilla.redhat.com/show_bug.cgi?id=1778762 Try adding `random.trust_cpu=on` to the kernel cmdline?
This doesn't appear to be related to openshift-installer or baremetal IPI platform.
Marking as low priority and targeting for 4.6; while we would like to support RHCOS + OCP on all kinds of hardware, running on a laptop isn't a primary target for us. Let us know if the the suggestion from comment #6 helps the situation.
I think you usually want console=ttyS0,115200n8 because without the speed set it may be much slower. That's what we have in the default kernel arguments. Though probably on physical hardware like this we should have an obvious option to remove it via coreos-installer, since it's not necessary. So...if removing it from your PXE config fixes it, do we need this bug?
Colin - I've been doing some more testing with OCP 4.4 and 4.5-rc1 This doesn't just impact the bootstrap over PXE. The default GRUB config also has "console=ttyS0,115200n8" which means once we've pulled the bare metal image onto the hardware we hit the same issue again. At present on each re-boot I have to manually remove the value from the grub config, and when IGN completes the setup of the node I can SSH in and remove it under /boot/.... I'm surprised no one else has ever hit this with hardware testing - Having a dig around it appears this issue occurs with other linux platorms - https://github.com/cloud-hypervisor/cloud-hypervisor/issues/163 Not sure how to progress this further
I believe there may be a real bug here but currently this is a "Singleton" issue - only one known instance. You can help us by doing PXE boots of Fedora CoreOS, traditional RHEL, or Debian/Ubuntu or whatever and gather data as to whether this is somehow truly specific to RHCOS or whether it's hardware related.
New tests with Fedora CoreOS and RHEL8 menuentry 'Fedora CoreOS tftp dhcp debug' { linuxefi pxelinux/coreos/fedora-coreos-31.20200127.3.0-live-kernel-x86_64 rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=sda ip=dhcp initrdefi pxelinux/coreos/fedora-coreos-31.20200127.3.0-live-initramfs.x86_64.img } Still hangs/delay during boot. If I remove console=ttyS0 boot proceeds normally For RHEL8 I'm using the vmlinuz/initrd from the RHEL 8.2 ISO menuentry 'RHEL 8 T440s serial console tftp' { set root=(http,10.1.10.9) linuxefi /repo/rhel8/images/pxeboot/vmlinuz rd.neednet=1 console=tty0 console=ttyS0 inst.stage2=http://10.1.10.9/repo/rhel8/ ip=dhcp initrdefi /repo/rhel8/images/pxeboot/initrd.img } This also experiences the same issues on boot if I have a console=ttyS0 entry
Moving this to https://github.com/coreos/fedora-coreos-tracker/issues/567
(In reply to Micah Abbott from comment #8) > Marking as low priority and targeting for 4.6; while we would like to > support RHCOS + OCP on all kinds of hardware, running on a laptop isn't a > primary target for us. Let us know if the the suggestion from comment #6 > helps the situation. I see this behavior also on "real" servers (Lenovo x3850 X6) so I woudn't say it's only happening on laptop HW and therefore low prio.
I would advocate to raise the priority on this defect. This also affected my ASUS PN50 (mini PC) SNO install. This mini PC runs the latest Fedora, Fedora CoreOS, and Silverblue without any issues. But every OpenShift SNO Assisted Installer deployment failed with the default serial console settings, because the node got into a state where it could not reply to the Assisted Installer in a timely fashion. I discovered pcfe's blog, that this bz the root cause of the failure. https://blog.pcfe.net/hugo/posts/2021-08-22-asus-pn50-ocp4-worker/ Manually intercepting the CoreOS boot, and removing the serial console from the kernel args allowed SNO to install properly, after months of failures with OCP 4,10 and OCP 4.11. The good news is OCP 4.12 removes this serial console boot args, so SNO installs should work fine, but the underlying problem of the slow boot remains on these broken machines remains a mystery. I can make this machine available for diagnostic and troubleshooting, please advise of there is a minimal set of logs and troubleshooting steps needs to isolate the root cause.