Created attachment 1721859 [details] Domain XML of the VM that doesn't boot Description of problem: While launching a VM we hit the "No bootable device" at the BIOS stage. PCI addresses for the virtIO block devices are manually set, but not for all the disk. For the "vda" and "vdb" there is no PCI addressing defined. The issue doesn't reproduce when we manually define PCI addresses for all the block devices. Version-Release number of selected component (if applicable): libvirt version: 6.0.0, package: 25.2.module+el8.2.1+7722+a9e38cf3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-08-20-23:19:32, ), qemu version: 4.2.0qemu-kvm-4.2.0-29.module+el8.2.1+7712+3c3fe332.2, kernel: 4.18.0-193.24.1.el8_2.dt1.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create a VM from the attached domain XML. 2. Observe the console and the graphics. Actual results: "No bootable device" on the graphics output at the BIOS stage. Expected results: VM manage to boot. Libvirt to consider user-defined PCI addresses and not reuse them for another devices. Additional info:
Created attachment 1721860 [details] QEMU log
I don't see any reused PCI address in the QEMU command line. Would you mind providing the domain XML after starting it? That it while it is hanging and showing "No bootable device", just run "virsh dumpxml $VM".
Created attachment 1721876 [details] virsh dumpxml output of a running VM
Thanks, the XML shows 27 distinct PCI addresses, that is none of them is reused. The problem is that the first two disks (vda, vdb) which did not have any PCI address assigned got the address automatically and since they were attached on bus 0x03 and 0x04 while the other disks are on bus 0x00, the disks are likely enumerated in a completely different order. If you want to increase the chance of disks being enumerated in the order you want them to be, you have two options: - explicitly set PCI addresses for all disks - leave all disk PCI address assignment on libvirt and don't set the address explicitly on any disk There's really no way libvirt could infer what you're trying to achieve with your PCI addresses to assign the missing ones in the same way. BTW, any address assignment cannot guarantee the disk will be seen in the same order in the guest OS anyway as it can enumerate them in an unexpected way. Also, I would strongly suggest adding <boot order='1'/> to the disk you want to boot from. It seems the correct one is selected anyway (as long as you wanted to boot from vda), but being explicit is always better. Can you confirm vda is the correct bootable disk? And does the issue reproduce when you do not manually assign a PCI address for any disk?
Hi Jiri, Thanks a lot for the attention! To your question, the issue doesn't reproduce when I leave the PCI addresses to be assigned automatically. The strange part is that the issue doesn't reproduce on other builds of QEMU+libvirt. For example, it works with the following package versions: libvirt version: 6.0.0, package: 16.fc31 (Unknown, 2020-04-07-15:55:55, ), qemu version: 4.2.0qemu-kvm-4.2.0-27.fc31, kernel: 3.10.0-1062.9.1.el7.x86_64 IIUC from your response, even if I will manually assign addresses for all the disks, the order still may change on every boot. Is that correct? Can you please elaborate on the fact that order cannot be guaranteed even if manually assigning the addresses? TIA Igor
(I had typed everything below the "..." two days ago, but it was obscured behind other tabs before I hit <Save>, so it didn't get read. Much of it is in agreement with what Jirka says (in particular - use <boot order='1'/> on the device you want to boot from, and *don't* use <boot dev='hd'> in the <os> element) But first, to answer your most recent question: No, the order will *not* change on every boot. What Jiri is saying is that we may *assume* that the disks will be assigned their order exactly the same as their PCI addresses, but in the end it is up to the guest OS to decide how to number the disks. It will be consistent from one boot to the next, but there's really no way to directly control it from the level of libvirt or qemu. (Why is it that you want to manually assign the PCI address of some disks, but not others, anyway? What problem are you trying to solve?) Now, to the fuller explanation I had previously typed: ... (I wrote this first paragraph assuming that you were trying to designate the "first" disk by naming it "vda" - possibly that isn't what you were doing, but just want to make sure you know that doesn't work). Although the target device name is included in <disk> elements (and actually *mandatory* for purely historical reasons), it is not possible to force any particular name for a disk device in the guest by setting it; as a matter of fact, the <target dev='blah'> name isn't even passed to qemu, because qemu has no way to accept it. Instead, the disk devices are named by the guest OS itself, usually in the order that they are encountered while probing for disk devices. Thus, just by saying <target dev='vda'/> in the config, you don't actually make that disk "vda"; the only thing you accomplish is to quiet libvirt from complaining that you haven't given a name to the device. So, in your case since you've setup several empty disks on the root bus, and since the guest kernel will detect those disks before detecting the disks that actually contain the OS (which are assigned to pcie-root-ports, that apparently are probed later), the first of the empty disks will end up being "vda", and the BIOS will attempt to boot from that empty disk, and not find any OS, then just stop at that first "failure". (According to the BIOS, it's *not* a failure, since it found executable code in the MBR and executed it. The fact that this code merely looked for an OS and printed "No Bootable Disk" then halted (rather than returning to the BIOS so it could try the next disk) is not in the realm of libvirt or qemu to fix. It *is* possible to specify the order in which disks are checked for booting, but your config hasn't done that, it simply specifies <boot dev='hd'/> which just means "try booting from one of the hard disks first" without specifying a particular order. If you want to be more specific, you'll need to use the <boot order='n'/> element in the specific devices *instead of* <boot dev='blah'/> in the <os> element. So, in your example config, you would have: ... <os> <type arch='x86_64' machine='pc-q35-rhel8.2.0'>hvm</type> <bootmenu enable='yes'/> <== I think this is optional, but it's in all my examples <smbios mode='sysinfo'/> </os> ... <devices> <disk> <!-- this is the disk you want to boot from --> <boot order='1'/> ... </disk> You can specify <boot order='2'/> etc on other disks, but keep in mind that if there is any sort of valid code in the MBR of a disk, the BIOS will latch onto that, execute it, and most likely be stuck there (because most MBR code when it doesn't find an OS, just prints out "Can't find a bootable OS" and halts, rather than passing control back to the BIOS which could then move on to the next disk. It's really better to only put <boot order='n'/> in the config of disks that you may actually want to boot. If you don't want to / can't mess with removing the <boot dev='hd'/>, you can just manually assign addresses to the boot disk and other disk as well as all the extra disks. Alternately, you could try using virtio-scsi disks instead of virtio. In this case, all devices would be connected to single virtio SCSI controller that is itself connected to a single PCI address. It would be easier to say whether or not this would work well for you if you could say why it is that you want some disks to be at manually assign PCI addresses and some not, though. <disk type='file' device='disk'> ... <target dev='sda' bus='scsi'/> ... </disk>
Also, you say in Comment 5 that you don't see this problem with libvirt-6.0.0 (+ different older versions of other packages). Can you provide the XML of a running guest that successfully booted when using those package versions (as well as the final qemu commandline from /etc/libvirt/qemu/$guestname.log), so that we can compare with the XML and commandline of the non-working combination? (AND - I just noticed from the log you've already provided that the versions you're testing are using -blockdev instead of -disk (or whatever it was that was used previously - that's not my area so I never paid attention). It's possible that a change in behavior could be the result of some part of that (either in qemu or in libvirt), and getting the XML + qemu log for the working setup would be the first step to deterimining that.)
(BTW, the previous comment should not be interpreted as saying that any such behavior change should be considered a bug; my opinion is rather that if it worked the way you're describing in the past, that this was purely coincidental, and not something that should be relied on).
Created attachment 1722572 [details] virsh dump of a setup where the problematic VM configuration does work
Created attachment 1722573 [details] QEMU log of the working case
Hi @laine Thanks a lot for the detailed information. Here is some background - I was asked to assist our CNV QE team with identifying issues in automated tests. In Kubevirt the functional test suite is being run in upstream against Kuberenetes clusters while downstream it runs against Openshift Container Platform cluster. So I have noticed a test that is constantly being failed downstream but not in upstream. There is a difference between the cluster environments of upstream and downstream: the host OSs are different and the package versions inside the hosts and the containers are also different. I am not the author of the automated PCI test, but I will patch the test soon, to make it manually assign addresses to all the block devices. I've attached the logs of the working case. I've noticed that in the working case the <boot dev='hd'> is used in the OS field. I don't know where it comes from. In downstream I don't see it's being added. Is it possible that <boot dev='hd'> has been added automatically by libvirt? or in some indirect way? TIA Igor
Hmm. This is getting more confused :-/ 1) The libvirt XML of the two running systems is identical aside from UUID#'s CPU flags, and the names of the boot images (which are auto-generated) 2) The qemu logs are also identical except for UUID#'s, CPU flags, and smbios options 3) both the working and non-working tests are running libvirt-6.0.0 (although slightly different variations (in other words, both are using -blockdev for disks, so my idea that might be the cause was incorrect.) Perhaps this is a difference in seabios versions. Can you compare the versions of those packages? Gerd, do you know of any change in seabios that would cause either the selection order of disks, or the behavior when encountering empty disks, to change? Is there a difference in the qemu commandline -smbios options that could make a difference? Any suggestions what to look for? (a short summary to hopefully avoid making you read everything - when the boot disk is assigned to a pcie-root-port, and several "empty" disks are manually assigned to slots on the root bus (with no explicit boot order options set for devices), some combination of package versions yields a bootable guest, while other combinations of package versions leads to "No Bootable Device". I had thought that bootability in this case could *never* be relied on - I'm actually more confused by the system that *does* boot than by the one that doesn't :-))
> Gerd, do you know of any change in seabios that would cause either the > selection order of disks, or the behavior when encountering empty disks, to > change? Is there a difference in the qemu commandline -smbios options that > could make a difference? Any suggestions what to look for? seabios 1.14 got some optimizations for block device initialization. In case a 'HALT' entry is present in the boot order file ("-boot strict=on") seabios will not initialize block devices which didn't got a boot index assigned. In theory this should not change behavior, seabios should not have considered those devices for boot anyway. In practice it might make a difference. Maybe because there is some corner case where the new 1.14 seabios skips a device it should not have skipped. Maybe because seabios can't initialize all devices due to -ENOMEM, so 1.14 picking bootable devices only initializes the device needed and makes the system boot. > (a short summary to hopefully avoid making you read everything - when the > boot disk is assigned to a pcie-root-port, and several "empty" disks are > manually assigned to slots on the root bus (with no explicit boot order > options set for devices), some combination of package versions yields a > bootable guest, while other combinations of package versions leads to "No > Bootable Device". I had thought that bootability in this case could *never* > be relied on - I'm actually more confused by the system that *does* boot > than by the one that doesn't :-)) Indeed. I'd strongly recommend to tag the disk you want boot from with "<boot order='1'/>" and be done with it. Otherwise you'll have undefined behavior. I think in case of unspecified boot order seabios uses pci scan order, so with pci addresses being the same seabios behaviour should be the same too. Also having lots of disks can cause problems due to seabios running out of memory. The number of working disks may vary depending on seabios version and configuration. seabios logs should help clarify what is going on here.
(In reply to Gerd Hoffmann from comment #13) > > Gerd, do you know of any change in seabios that would cause either the > > selection order of disks, or the behavior when encountering empty disks, to > > change? Is there a difference in the qemu commandline -smbios options that > > could make a difference? Any suggestions what to look for? > > seabios 1.14 got some optimizations for block device initialization. > In case a 'HALT' entry is present in the boot order file ("-boot strict=on") > seabios will not initialize block devices which didn't got a boot index > assigned. > > In theory this should not change behavior, seabios should not have considered > those devices for boot anyway. In practice it might make a difference. > > Maybe because there is some corner case where the new 1.14 seabios skips > a device it should not have skipped. > > Maybe because seabios can't initialize all devices due to -ENOMEM, so 1.14 > picking bootable devices only initializes the device needed and makes the > system boot. > > > (a short summary to hopefully avoid making you read everything - when the > > boot disk is assigned to a pcie-root-port, and several "empty" disks are > > manually assigned to slots on the root bus (with no explicit boot order > > options set for devices), some combination of package versions yields a > > bootable guest, while other combinations of package versions leads to "No > > Bootable Device". I had thought that bootability in this case could *never* > > be relied on - I'm actually more confused by the system that *does* boot > > than by the one that doesn't :-)) > > Indeed. I'd strongly recommend to tag the disk you want boot from > with "<boot order='1'/>" and be done with it. Otherwise you'll have > undefined behavior. > > I think in case of unspecified boot order seabios uses pci scan order, > so with pci addresses being the same seabios behaviour should be the > same too. > > Also having lots of disks can cause problems due to seabios running out > of memory. The number of working disks may vary depending on seabios > version and configuration. > > seabios logs should help clarify what is going on here. @gerd hoff Hi Gerd, can you please tell how can I enable the seabios logs? TIA Igor
> Hi Gerd, can you please tell how can I enable the seabios logs? <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>rhel7bios-el7-firmwarelog</name> [ ... ] </devices> <qemu:commandline> <qemu:arg value='-chardev'/> <qemu:arg value='file,id=firmwarelog,path=/tmp/qemu-firmware.log'/> <qemu:arg value='-device'/> <qemu:arg value='isa-debugcon,iobase=0x402,chardev=firmwarelog'/> </qemu:commandline> </domain>
Created attachment 1724234 [details] logs of the seaBIOS with the problematic pci configuration
Hi Gerd, Can you please take a look at the attached seaBIOS log? It was taking during the reproduction of the issue. TIA Igor
> Maybe because seabios can't initialize all devices due to -ENOMEM, so 1.14 > picking bootable devices only initializes the device needed and makes the > system boot. Seems it is this one. The failing seabios is version 1.13, and the log shows seabios runs out of memory after initializing 10 virtio-blk devices: [ ... ] WARNING - Unable to allocate resource at vp_find_vq:301! fail to find vq for virtio-blk 00:0c.0 [ ... ]
Thank you @kraxel for the information. In which release version will the 1.14 seaBIOS be included?
(In reply to Igor Bezukh from comment #19) > Thank you @kraxel for the information. > > In which release version will the 1.14 seaBIOS be included? In the upcoming rhel-av-8.3.0. Moving the seabios component for further tracking, although this is probably already fixed there.
(In reply to Jaroslav Suchanek from comment #20) > (In reply to Igor Bezukh from comment #19) > > Thank you @kraxel for the information. > > > > In which release version will the 1.14 seaBIOS be included? > > In the upcoming rhel-av-8.3.0. > > Moving the seabios component for further tracking, although this is probably > already fixed there. Yes, should be fixed already. Testing with the 1.14 seabios package (virt:8.3 module) doesn't hurt though ;)
OK so lets wait from the QA to validate that the bug is fixed in 8.3
Tested this bz in rhel8.3, it works well. Guest can boot successfuly. Host env: seabios-1.14.0-1.module+el8.3.0+7638+07cf13d2.x86_64 kernel-4.18.0-240.el8.x86_64 qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64 libvirt-client-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64 Xml file uploaded to attachment(comment25),Thanks!
Created attachment 1726077 [details] The xml file
close/currentrelease I'd say with 8.3 being released now. also depend on the seabios rebase bug 1809772.