1443690 – PXELinux BIOS reboot loop on Ubuntu 16.04

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1443690 - PXELinux BIOS reboot loop on Ubuntu 16.04

Summary: PXELinux BIOS reboot loop on Ubuntu 16.04

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	pre-dev-freeze
Target Release:	---
Assignee:	Ladi Prosek
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-19 18:20 UTC by Yamakasi
Modified:	2021-06-10 12:13 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-31 07:10:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
PXE Menu (3.75 KB, image/png) 2017-04-19 18:20 UTC, Yamakasi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1421656	0	medium	CLOSED	Host is stuck after provisioning on rhv 4.X	2023-09-15 00:01:16 UTC

Internal Links: 1421656

Description Yamakasi 2017-04-19 18:20:12 UTC

Created attachment 1272701 [details]
PXE Menu

Description of problem:

When I have build a successfull host using Foreman it gets the PXE
Menu showed in the attachment.

This menu with the first option selected gets into a reboot loop:

Chanload into bootloader on the first disk



When I manually switch to:

Chanload into bootloader on the first disk - alternative

the host boots as normal.


Why does this happens just at once on Debian/Ubuntu and not on Fedora/CentOS ?


Version-Release number of selected component (if applicable):

oVirt Engine Version: 4.1.1.8-1.el7.centos
Foreman Version 1.14.2
vdsm-4.18.13-1.el7.centos


How reproducible:

Provision VM from teh Foreman

Steps to Reproduce:
1. Create VM on Foreman
2. Build it
3. Check console for booting after successfull build

Actual results:

VM keeps rebooting after PXE menu finished counting down to 0

Expected results:

Booting VM

Additional info:

Comment 1 Yaniv Kaul 2017-04-20 06:37:28 UTC

That sounds like a QEMU issue, not oVirt.
In any case, please provide logs. Specifically, vdsm.log will let us know what is the libvirt command line (if the disk is bootable, etc.

If it's guest OS specific, sounds perhaps it's a Foreman / deployment issue?

Comment 2 Michal Skrivanek 2017-04-20 07:30:26 UTC

the PXE menu is provided by Foreman, isn't it?
Does the Ubuntu VM have multiple disks? Is the first one set as "bootable" in oVirt GUI?

Comment 3 Yamakasi 2017-04-20 09:11:31 UTC

Yes the menu is provided by Foreman.

The VM has one disk and is bootable (enabled).

I will check the logs and post them later on. Any info I can check more would be great.

Comment 4 Yaniv Kaul 2017-04-20 09:20:59 UTC

Can you check your configuration with the Foreman community?

Comment 5 Yamakasi 2017-04-20 09:33:33 UTC

I already checked it with them and discussed with Lukas Zapletal and he told me that they already seen this for a couple of times on oVirt as well but didn't report it upstream. Earlier this was working fine, foreman is not updated in the time between.

Comment 6 Yaniv Kaul 2017-04-20 09:42:45 UTC

(In reply to Yamakasi from comment #5)
> I already checked it with them and discussed with Lukas Zapletal and he told
> me that they already seen this for a couple of times on oVirt as well but
> didn't report it upstream. Earlier this was working fine, foreman is not
> updated in the time between.

Perhaps it's worthwhile reporting it upstream on Foreman, someone may know the issue better?

Comment 7 Yamakasi 2017-04-20 09:59:33 UTC

They report back to oVirt as it should be a QEMU/libvirt issue, they also seen it on KVM. To be honest I think it's @ their side as well but as the alternative way works they think it's a oVirt issue they they don't bother.

This is the PXE menu they create for a normal (previously working) PXELinux BIOS

alternative boots normally when selected manually.

DEFAULT menu
PROMPT 0
MENU TITLE PXE Menu
TIMEOUT 200
TOTALTIMEOUT 6000
ONTIMEOUT local


LABEL local
  MENU LABEL Chainload into bootloader on the first disk
  MENU DEFAULT
  LOCALBOOT 0

LABEL local_legacy
  MENU LABEL Chainload into bootloader on the first disk - alternative
  COM32 chain.c32
  APPEND hd0

Comment 8 Lukas Zapletal 2017-04-20 10:43:03 UTC

Hey guys, let me jump in. There are several Foreman users already reporting that the chainloading via LOCALBOOT no longer works in oVirt, it's a regression but I am not able to tell from which version this happens. I was not able to reproduce with libvirt myself, I only have 3.5 oVirt which was working fine.

Now, we found that COM32 chainbooting does work in these cases and that is the workaround we tell users to use for now. But we would like someone from virt group to take a look and tell why this regressed.

If you say it's something in QEMU, let's just flip this to correct RHEL BZ component and see what guys are gonna tell us. Thanks for help!

Comment 10 Yamakasi 2017-04-20 13:25:26 UTC

The temporary fix (provided by Lukas Zapletal) is:

Change the Foreman provisioning template: 

pxelinux_default_local_boot.erb

...
TOTALTIMEOUT 6000
ONTIMEOUT local
...

To:

...
TOTALTIMEOUT 6000
ONTIMEOUT local_legacy
...


Then it boots normal.

Comment 12 Lukas Zapletal 2017-04-21 08:53:42 UTC

Yamakasi, can you please provide us details of your VM that was not booting?

virsh dumpxml vm_name

Also provide us version of libvirt and other components:

rpm -qa | egrep 'virt|kvm|qemu|bios'

I am shooting in the dark with this. Michal do you need more information? I am not familiar much with RHEV. I assume we are interested in chipset and bios of the VM which libvirt should provide.

Comment 13 Michal Skrivanek 2017-04-21 08:58:47 UTC

ideally yes, as I don't expect this has much to do with oVirt, if we can reproduce  in simple way with plain QEMU or without Foreman then we can isolate this to the right component.

Comment 14 Yamakasi 2017-04-21 09:17:37 UTC

I have emailed Yaniv the startdetails of the VM. As there were hostnames in it I emailed it him personally. Yaniv, can you check those ?

I need to create some other test VM as I'm moving already futher with the one I was seeing this on.

Comment 15 Lukas Zapletal 2017-04-21 09:29:12 UTC

Reproducer should be easy, create a VM with PXE configuration and put this into pxelinux.cfg/default:

DEFAULT menu
PROMPT 0
MENU TITLE PXE Menu
TIMEOUT 200
TOTALTIMEOUT 6000
ONTIMEOUT local

LABEL local
  MENU LABEL Chainload into bootloader on the first disk
  MENU DEFAULT
  LOCALBOOT 0

If this is problem, I could create an ISO file that would simulate this behavior without PXE.

Comment 16 Yamakasi 2017-04-21 10:17:28 UTC

As described that is the problem as local_legacy fixes it.

Comment 17 Yaniv Kaul 2017-05-08 11:30:23 UTC

Looking at this, I don't see how oVirt is involved in this. Looks like a QEMU/Ubuntu issue to me.

Comment 18 Lukas Zapletal 2017-05-09 11:04:09 UTC

Reassigning to platform, component qemu.

Can you guys take a look on this BZ and tell us what can be wrong? This is regression we see in PXELinux (SYSLINUX). In short, LOCALBOOT option no longer works with QEMU from RHEL7 or RHEV4. This is known to be buggy with some hardware but it was working in virt environment previously:

http://www.syslinux.org/wiki/index.php?title=Hardware_Compatibility#LOCALBOOT

Comment 20 Ladi Prosek 2017-05-17 14:12:23 UTC

I have tested this on a few QEMU and iPXE builds and it worked everywhere. Can you guys give me version information and VM details as requested in comment #12? Thanks!

Comment 21 Ladi Prosek 2017-05-17 14:18:35 UTC

(In reply to Ladi Prosek from comment #20)
> I have tested this on a few QEMU and iPXE builds and it worked everywhere.
> Can you guys give me version information and VM details as requested in
> comment #12? Thanks!

Actually - is iPXE involved in this at all? In my setup I let a VM boot into iPXE and serve pxelinux from a DHCP/TFTP server. Please provide as much information as possible, ideally a VM disk image + QEMU command line. Thanks!

Comment 22 Lukas Zapletal 2017-05-17 15:15:34 UTC

Hey, iPXE is indeed not involved at all, this is PXELinux issue and it's LOCALBOOT statement compatiblity. It's part of SYSLINUX package in RHEL.

Comment 23 Ladi Prosek 2017-05-17 15:32:05 UTC

(In reply to Lukas Zapletal from comment #22)
> Hey, iPXE is indeed not involved at all, this is PXELinux issue and it's
> LOCALBOOT statement compatiblity. It's part of SYSLINUX package in RHEL.

Right - the canonical way of loading pxelinux seems to be DHCP / over network. What exactly does the VM boot into and how does it load pxelinux?

Comment 24 Lukas Zapletal 2017-05-17 15:58:23 UTC

So the use case is that our customers do keep PXE configuration with LOCALBOOT option (see above) and keep BIOS/UEFI settings to boot from network. Provisioned servers boot from local drive (that's where we have the regression) and servers can be easily scheduled for re-provisioning by PXE configuration changes.

To reproduce the issue, create a VM on QEMU matching this oVirt version: 4.1.1.8-1.el7.centos (vdsm-4.18.13-1.el7.centos) and boot it from network with the following configuration:

DEFAULT menu
PROMPT 0
MENU TITLE PXE Menu
TIMEOUT 200
TOTALTIMEOUT 6000
ONTIMEOUT local
LABEL local
  MENU LABEL Chainload into bootloader on the first disk
  MENU DEFAULT
  LOCALBOOT 0

I randomly encountered this issue on various QEMU/KVM versions. If you can't tell right away what can be wrong, I can try to reproduce on Fedora or RHEL7 libvirt.

Comment 25 Ladi Prosek 2017-05-17 17:39:42 UTC

(In reply to Lukas Zapletal from comment #24)
> So the use case is that our customers do keep PXE configuration with
> LOCALBOOT option (see above) and keep BIOS/UEFI settings to boot from
> network. Provisioned servers boot from local drive (that's where we have the
> regression) and servers can be easily scheduled for re-provisioning by PXE
> configuration changes.

Makes sense and understood.
 
> To reproduce the issue, create a VM on QEMU matching this oVirt version:
> 4.1.1.8-1.el7.centos (vdsm-4.18.13-1.el7.centos) and boot it from network
> with the following configuration:

I'm confused - how come that iPXE is not involved if the VM boots from network? Is this really different from what I described in comment 21?

I would still prefer to know versions of all virt components (kernel and up), host cpu details (output of /proc/cpuinfo) and the QEMU command-line if possible.

Thanks!

> DEFAULT menu
> PROMPT 0
> MENU TITLE PXE Menu
> TIMEOUT 200
> TOTALTIMEOUT 6000
> ONTIMEOUT local
> LABEL local
>   MENU LABEL Chainload into bootloader on the first disk
>   MENU DEFAULT
>   LOCALBOOT 0
> 
> I randomly encountered this issue on various QEMU/KVM versions. If you can't
> tell right away what can be wrong, I can try to reproduce on Fedora or RHEL7
> libvirt.

Comment 26 Ladi Prosek 2017-05-18 08:36:56 UTC

I looked closer at what 'LOCALBOOT n' does. It makes pxelinux end its execution and return to its caller with AX=n.

https://git.kernel.org/pub/scm/boot/syslinux/syslinux.git/tree/core/pxeboot.c

The documentation says that 4 and 5 have special meaning, corresponding to PXENV_STATUS_KEEP_UNDI and PXENV_STATUS_KEEP_ALL in the PXE spec.

Unless you prove me wrong, the caller in our case is iPXE, which treats the return value as an error code - 0 is success, anything else is failure. See pxe_start_nbp in:

https://git.ipxe.org/ipxe.git/blob/HEAD:/src/arch/x86/interface/pxe/pxe_call.c#l368

In either case, iPXE will continue booting from other network interfaces or executing the iPXE script if available. If it doesn't succeed it will hand execution back to BIOS which will continue going through its list of boot devices.


So 'LOCALBOOT n' does not really guarantee that anything local will be booted. It merely says: I'm exitting and will let the machine firmware do its thing. This is in contrast with the alternative method of 'COM32 chain.c32, APPEND hd0' which actually makes pxelinux read the boot sector from disk and execute it.


Now as to why 'LOCALBOOT n' would stop working, the most obvious explanation would be that the BIOS boot sequence is not correct. If I remove the hard drive from my boot sequence, 'LOCALBOOT n' ultimately leads to 'No bootable devices.' printed by SeaBIOS.

It could also be a bug in the virt stack somewhere. We recently saw a bug in instruction emulation in KVM which reproduced only with iPXE (it exercises a lot of edge cases due its use of real mode, long mode, protected mode and everything in between). Please provide at least a QEMU command line. Thanks!

Comment 27 Lukas Zapletal 2017-05-18 13:45:46 UTC

Ladi,

although Foreman do support iPXE configurations and direct HTTP network boots, what most of our users do is iPXE chainbooting into PXELinux, this is the case.

I reproduced this with PXELinux from F25 (syslinux-tftpboot-6.04-0.1.fc25.noarch) and from CentOS 7.4 using clients CentOS 7.4 and Debian 8. It won't boot.

When testing with libvirt, make sure you only check PXE device for booting and delesect other devices, because fallback mechanism in BIOS would boot it after failure.

Comment 28 Ladi Prosek 2017-05-18 13:54:34 UTC

(In reply to Lukas Zapletal from comment #27)
> Ladi,
> 
> although Foreman do support iPXE configurations and direct HTTP network
> boots, what most of our users do is iPXE chainbooting into PXELinux, this is
> the case.

Got it. I think that we mean the same thing. Note however that even if iPXE chainboots into PXELinux, it is still "on the stack" and will continue running its code after PXELinux exits.
 
> I reproduced this with PXELinux from F25
> (syslinux-tftpboot-6.04-0.1.fc25.noarch) and from CentOS 7.4 using clients
> CentOS 7.4 and Debian 8. It won't boot.
> 
> When testing with libvirt, make sure you only check PXE device for booting
> and delesect other devices, because fallback mechanism in BIOS would boot it
> after failure.

Wait, if you deselect other devices, it sure won't boot into them using the 'LOCALBOOT n' command. That's the gist of comment 26 -- 'LOCALBOOT n' *depends* on BIOS boot fallback.

Comment 30 Ladi Prosek 2017-08-31 07:10:46 UTC

Closing as CANTFIX. Per comment 26, 'LOCALBOOT n' doesn't work if the BIOS boot sequence is not set up correctly.


Part of the confusion may come from the fact that 'LOCALBOOT n' has two different implementations in the SYSLinux code base, one used in SYSLinux:

http://repo.or.cz/syslinux.git/blob/HEAD:/core/localboot.c

and the other one in PXELinux:

http://repo.or.cz/syslinux.git/blob/HEAD:/core/pxeboot.c

The former reads disk boot sector, the latter just returns.

Comment 31 Lukas Zapletal 2018-03-19 13:00:58 UTC

Thanks for help. For the record

http://www.syslinux.org/wiki/index.php?title=SYSLINUX#LOCALBOOT_type

http://www.syslinux.org/wiki/index.php?title=Hardware_Compatibility#LOCALBOOT

Looks like LOCALBOOT has problems on some hardware. We might want to default to chain.c32 in Satellite.

Note You need to log in before you can comment on or make changes to this bug.