Bug 1443690
Summary: | PXELinux BIOS reboot loop on Ubuntu 16.04 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Yamakasi <yamakasi.014> | ||||
Component: | qemu-kvm | Assignee: | Ladi Prosek <lprosek> | ||||
Status: | CLOSED CANTFIX | QA Contact: | Chao Yang <chayang> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 7.3 | CC: | ajambhul, bugs, chayang, jinzhao, juzhang, knoel, lzap, michal.skrivanek, michen, rbalakri, virt-maint, yamakasi.014, yfu, ykaul | ||||
Target Milestone: | pre-dev-freeze | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-08-31 07:10:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
That sounds like a QEMU issue, not oVirt. In any case, please provide logs. Specifically, vdsm.log will let us know what is the libvirt command line (if the disk is bootable, etc. If it's guest OS specific, sounds perhaps it's a Foreman / deployment issue? the PXE menu is provided by Foreman, isn't it? Does the Ubuntu VM have multiple disks? Is the first one set as "bootable" in oVirt GUI? Yes the menu is provided by Foreman. The VM has one disk and is bootable (enabled). I will check the logs and post them later on. Any info I can check more would be great. Can you check your configuration with the Foreman community? I already checked it with them and discussed with Lukas Zapletal and he told me that they already seen this for a couple of times on oVirt as well but didn't report it upstream. Earlier this was working fine, foreman is not updated in the time between. (In reply to Yamakasi from comment #5) > I already checked it with them and discussed with Lukas Zapletal and he told > me that they already seen this for a couple of times on oVirt as well but > didn't report it upstream. Earlier this was working fine, foreman is not > updated in the time between. Perhaps it's worthwhile reporting it upstream on Foreman, someone may know the issue better? They report back to oVirt as it should be a QEMU/libvirt issue, they also seen it on KVM. To be honest I think it's @ their side as well but as the alternative way works they think it's a oVirt issue they they don't bother. This is the PXE menu they create for a normal (previously working) PXELinux BIOS alternative boots normally when selected manually. DEFAULT menu PROMPT 0 MENU TITLE PXE Menu TIMEOUT 200 TOTALTIMEOUT 6000 ONTIMEOUT local LABEL local MENU LABEL Chainload into bootloader on the first disk MENU DEFAULT LOCALBOOT 0 LABEL local_legacy MENU LABEL Chainload into bootloader on the first disk - alternative COM32 chain.c32 APPEND hd0 Hey guys, let me jump in. There are several Foreman users already reporting that the chainloading via LOCALBOOT no longer works in oVirt, it's a regression but I am not able to tell from which version this happens. I was not able to reproduce with libvirt myself, I only have 3.5 oVirt which was working fine. Now, we found that COM32 chainbooting does work in these cases and that is the workaround we tell users to use for now. But we would like someone from virt group to take a look and tell why this regressed. If you say it's something in QEMU, let's just flip this to correct RHEL BZ component and see what guys are gonna tell us. Thanks for help! The temporary fix (provided by Lukas Zapletal) is: Change the Foreman provisioning template: pxelinux_default_local_boot.erb ... TOTALTIMEOUT 6000 ONTIMEOUT local ... To: ... TOTALTIMEOUT 6000 ONTIMEOUT local_legacy ... Then it boots normal. Yamakasi, can you please provide us details of your VM that was not booting? virsh dumpxml vm_name Also provide us version of libvirt and other components: rpm -qa | egrep 'virt|kvm|qemu|bios' I am shooting in the dark with this. Michal do you need more information? I am not familiar much with RHEV. I assume we are interested in chipset and bios of the VM which libvirt should provide. ideally yes, as I don't expect this has much to do with oVirt, if we can reproduce in simple way with plain QEMU or without Foreman then we can isolate this to the right component. I have emailed Yaniv the startdetails of the VM. As there were hostnames in it I emailed it him personally. Yaniv, can you check those ? I need to create some other test VM as I'm moving already futher with the one I was seeing this on. Reproducer should be easy, create a VM with PXE configuration and put this into pxelinux.cfg/default: DEFAULT menu PROMPT 0 MENU TITLE PXE Menu TIMEOUT 200 TOTALTIMEOUT 6000 ONTIMEOUT local LABEL local MENU LABEL Chainload into bootloader on the first disk MENU DEFAULT LOCALBOOT 0 If this is problem, I could create an ISO file that would simulate this behavior without PXE. As described that is the problem as local_legacy fixes it. Looking at this, I don't see how oVirt is involved in this. Looks like a QEMU/Ubuntu issue to me. Reassigning to platform, component qemu. Can you guys take a look on this BZ and tell us what can be wrong? This is regression we see in PXELinux (SYSLINUX). In short, LOCALBOOT option no longer works with QEMU from RHEL7 or RHEV4. This is known to be buggy with some hardware but it was working in virt environment previously: http://www.syslinux.org/wiki/index.php?title=Hardware_Compatibility#LOCALBOOT I have tested this on a few QEMU and iPXE builds and it worked everywhere. Can you guys give me version information and VM details as requested in comment #12? Thanks! (In reply to Ladi Prosek from comment #20) > I have tested this on a few QEMU and iPXE builds and it worked everywhere. > Can you guys give me version information and VM details as requested in > comment #12? Thanks! Actually - is iPXE involved in this at all? In my setup I let a VM boot into iPXE and serve pxelinux from a DHCP/TFTP server. Please provide as much information as possible, ideally a VM disk image + QEMU command line. Thanks! Hey, iPXE is indeed not involved at all, this is PXELinux issue and it's LOCALBOOT statement compatiblity. It's part of SYSLINUX package in RHEL. (In reply to Lukas Zapletal from comment #22) > Hey, iPXE is indeed not involved at all, this is PXELinux issue and it's > LOCALBOOT statement compatiblity. It's part of SYSLINUX package in RHEL. Right - the canonical way of loading pxelinux seems to be DHCP / over network. What exactly does the VM boot into and how does it load pxelinux? So the use case is that our customers do keep PXE configuration with LOCALBOOT option (see above) and keep BIOS/UEFI settings to boot from network. Provisioned servers boot from local drive (that's where we have the regression) and servers can be easily scheduled for re-provisioning by PXE configuration changes. To reproduce the issue, create a VM on QEMU matching this oVirt version: 4.1.1.8-1.el7.centos (vdsm-4.18.13-1.el7.centos) and boot it from network with the following configuration: DEFAULT menu PROMPT 0 MENU TITLE PXE Menu TIMEOUT 200 TOTALTIMEOUT 6000 ONTIMEOUT local LABEL local MENU LABEL Chainload into bootloader on the first disk MENU DEFAULT LOCALBOOT 0 I randomly encountered this issue on various QEMU/KVM versions. If you can't tell right away what can be wrong, I can try to reproduce on Fedora or RHEL7 libvirt. (In reply to Lukas Zapletal from comment #24) > So the use case is that our customers do keep PXE configuration with > LOCALBOOT option (see above) and keep BIOS/UEFI settings to boot from > network. Provisioned servers boot from local drive (that's where we have the > regression) and servers can be easily scheduled for re-provisioning by PXE > configuration changes. Makes sense and understood. > To reproduce the issue, create a VM on QEMU matching this oVirt version: > 4.1.1.8-1.el7.centos (vdsm-4.18.13-1.el7.centos) and boot it from network > with the following configuration: I'm confused - how come that iPXE is not involved if the VM boots from network? Is this really different from what I described in comment 21? I would still prefer to know versions of all virt components (kernel and up), host cpu details (output of /proc/cpuinfo) and the QEMU command-line if possible. Thanks! > DEFAULT menu > PROMPT 0 > MENU TITLE PXE Menu > TIMEOUT 200 > TOTALTIMEOUT 6000 > ONTIMEOUT local > LABEL local > MENU LABEL Chainload into bootloader on the first disk > MENU DEFAULT > LOCALBOOT 0 > > I randomly encountered this issue on various QEMU/KVM versions. If you can't > tell right away what can be wrong, I can try to reproduce on Fedora or RHEL7 > libvirt. I looked closer at what 'LOCALBOOT n' does. It makes pxelinux end its execution and return to its caller with AX=n. https://git.kernel.org/pub/scm/boot/syslinux/syslinux.git/tree/core/pxeboot.c The documentation says that 4 and 5 have special meaning, corresponding to PXENV_STATUS_KEEP_UNDI and PXENV_STATUS_KEEP_ALL in the PXE spec. Unless you prove me wrong, the caller in our case is iPXE, which treats the return value as an error code - 0 is success, anything else is failure. See pxe_start_nbp in: https://git.ipxe.org/ipxe.git/blob/HEAD:/src/arch/x86/interface/pxe/pxe_call.c#l368 In either case, iPXE will continue booting from other network interfaces or executing the iPXE script if available. If it doesn't succeed it will hand execution back to BIOS which will continue going through its list of boot devices. So 'LOCALBOOT n' does not really guarantee that anything local will be booted. It merely says: I'm exitting and will let the machine firmware do its thing. This is in contrast with the alternative method of 'COM32 chain.c32, APPEND hd0' which actually makes pxelinux read the boot sector from disk and execute it. Now as to why 'LOCALBOOT n' would stop working, the most obvious explanation would be that the BIOS boot sequence is not correct. If I remove the hard drive from my boot sequence, 'LOCALBOOT n' ultimately leads to 'No bootable devices.' printed by SeaBIOS. It could also be a bug in the virt stack somewhere. We recently saw a bug in instruction emulation in KVM which reproduced only with iPXE (it exercises a lot of edge cases due its use of real mode, long mode, protected mode and everything in between). Please provide at least a QEMU command line. Thanks! Ladi, although Foreman do support iPXE configurations and direct HTTP network boots, what most of our users do is iPXE chainbooting into PXELinux, this is the case. I reproduced this with PXELinux from F25 (syslinux-tftpboot-6.04-0.1.fc25.noarch) and from CentOS 7.4 using clients CentOS 7.4 and Debian 8. It won't boot. When testing with libvirt, make sure you only check PXE device for booting and delesect other devices, because fallback mechanism in BIOS would boot it after failure. (In reply to Lukas Zapletal from comment #27) > Ladi, > > although Foreman do support iPXE configurations and direct HTTP network > boots, what most of our users do is iPXE chainbooting into PXELinux, this is > the case. Got it. I think that we mean the same thing. Note however that even if iPXE chainboots into PXELinux, it is still "on the stack" and will continue running its code after PXELinux exits. > I reproduced this with PXELinux from F25 > (syslinux-tftpboot-6.04-0.1.fc25.noarch) and from CentOS 7.4 using clients > CentOS 7.4 and Debian 8. It won't boot. > > When testing with libvirt, make sure you only check PXE device for booting > and delesect other devices, because fallback mechanism in BIOS would boot it > after failure. Wait, if you deselect other devices, it sure won't boot into them using the 'LOCALBOOT n' command. That's the gist of comment 26 -- 'LOCALBOOT n' *depends* on BIOS boot fallback. Closing as CANTFIX. Per comment 26, 'LOCALBOOT n' doesn't work if the BIOS boot sequence is not set up correctly. Part of the confusion may come from the fact that 'LOCALBOOT n' has two different implementations in the SYSLinux code base, one used in SYSLinux: http://repo.or.cz/syslinux.git/blob/HEAD:/core/localboot.c and the other one in PXELinux: http://repo.or.cz/syslinux.git/blob/HEAD:/core/pxeboot.c The former reads disk boot sector, the latter just returns. Thanks for help. For the record http://www.syslinux.org/wiki/index.php?title=SYSLINUX#LOCALBOOT_type http://www.syslinux.org/wiki/index.php?title=Hardware_Compatibility#LOCALBOOT Looks like LOCALBOOT has problems on some hardware. We might want to default to chain.c32 in Satellite. |
Created attachment 1272701 [details] PXE Menu Description of problem: When I have build a successfull host using Foreman it gets the PXE Menu showed in the attachment. This menu with the first option selected gets into a reboot loop: Chanload into bootloader on the first disk When I manually switch to: Chanload into bootloader on the first disk - alternative the host boots as normal. Why does this happens just at once on Debian/Ubuntu and not on Fedora/CentOS ? Version-Release number of selected component (if applicable): oVirt Engine Version: 4.1.1.8-1.el7.centos Foreman Version 1.14.2 vdsm-4.18.13-1.el7.centos How reproducible: Provision VM from teh Foreman Steps to Reproduce: 1. Create VM on Foreman 2. Build it 3. Check console for booting after successfull build Actual results: VM keeps rebooting after PXE menu finished counting down to 0 Expected results: Booting VM Additional info: