Bug 1888720 - [q35-virtio] VM hangs with 'No Bootable Device' message when using user-defined PCI addresses for some of the block devices
Summary: [q35-virtio] VM hangs with 'No Bootable Device' message when using user-defin...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: seabios
Version: 8.2
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 8.3
Assignee: Gerd Hoffmann
QA Contact: leidwang@redhat.com
URL:
Whiteboard:
Depends On: 1809772
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-15 15:07 UTC by Igor Bezukh
Modified: 2021-01-11 03:24 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-09 06:50:36 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Domain XML of the VM that doesn't boot (7.04 KB, text/plain)
2020-10-15 15:07 UTC, Igor Bezukh
no flags Details
QEMU log (10.17 KB, text/plain)
2020-10-15 15:09 UTC, Igor Bezukh
no flags Details
virsh dumpxml output of a running VM (12.12 KB, text/plain)
2020-10-15 16:22 UTC, Igor Bezukh
no flags Details
virsh dump of a setup where the problematic VM configuration does work (11.88 KB, text/plain)
2020-10-19 08:17 UTC, Igor Bezukh
no flags Details
QEMU log of the working case (10.04 KB, text/plain)
2020-10-19 08:17 UTC, Igor Bezukh
no flags Details
logs of the seaBIOS with the problematic pci configuration (18.86 KB, text/plain)
2020-10-26 15:50 UTC, Igor Bezukh
no flags Details
The xml file (4.01 KB, text/plain)
2020-11-03 01:18 UTC, leidwang@redhat.com
no flags Details

Description Igor Bezukh 2020-10-15 15:07:58 UTC
Created attachment 1721859 [details]
Domain XML of the VM that doesn't boot

Description of problem:

While launching a VM we hit the "No bootable device" at the BIOS stage.
PCI addresses for the virtIO block devices are manually set, but not for all the disk. For the "vda" and "vdb" there is no PCI addressing defined.
The issue doesn't reproduce when we manually define PCI addresses for all the 
block devices.


Version-Release number of selected component (if applicable):

libvirt version: 6.0.0, package: 25.2.module+el8.2.1+7722+a9e38cf3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-08-20-23:19:32, ), qemu version: 4.2.0qemu-kvm-4.2.0-29.module+el8.2.1+7712+3c3fe332.2, kernel: 4.18.0-193.24.1.el8_2.dt1.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Create a VM from the attached domain XML.
2. Observe the console and the graphics.


Actual results:
"No bootable device" on the graphics output at the BIOS stage.

Expected results:
VM manage to boot. Libvirt to consider user-defined PCI addresses and not reuse them for another devices.


Additional info:

Comment 1 Igor Bezukh 2020-10-15 15:09:05 UTC
Created attachment 1721860 [details]
QEMU log

Comment 2 Jiri Denemark 2020-10-15 15:37:30 UTC
I don't see any reused PCI address in the QEMU command line. Would you mind
providing the domain XML after starting it? That it while it is hanging and
showing "No bootable device", just run "virsh dumpxml $VM".

Comment 3 Igor Bezukh 2020-10-15 16:22:49 UTC
Created attachment 1721876 [details]
virsh dumpxml output of a running VM

Comment 4 Jiri Denemark 2020-10-15 21:03:14 UTC
Thanks, the XML shows 27 distinct PCI addresses, that is none of them is
reused. The problem is that the first two disks (vda, vdb) which did not have
any PCI address assigned got the address automatically and since they were
attached on bus 0x03 and 0x04 while the other disks are on bus 0x00, the disks
are likely enumerated in a completely different order. If you want to increase
the chance of disks being enumerated in the order you want them to be, you
have two options:

    - explicitly set PCI addresses for all disks
    - leave all disk PCI address assignment on libvirt and don't set the
      address explicitly on any disk

There's really no way libvirt could infer what you're trying to achieve with
your PCI addresses to assign the missing ones in the same way.

BTW, any address assignment cannot guarantee the disk will be seen in the same
order in the guest OS anyway as it can enumerate them in an unexpected way.

Also, I would strongly suggest adding <boot order='1'/> to the disk you want
to boot from. It seems the correct one is selected anyway (as long as you
wanted to boot from vda), but being explicit is always better.

Can you confirm vda is the correct bootable disk?

And does the issue reproduce when you do not manually assign a PCI address for
any disk?

Comment 5 Igor Bezukh 2020-10-16 06:22:39 UTC
Hi Jiri,

Thanks a lot for the attention!

To your question, the issue doesn't reproduce when I leave the PCI addresses to be assigned automatically.
The strange part is that the issue doesn't reproduce on other builds of QEMU+libvirt. For example, it works with the following package versions:

libvirt version: 6.0.0, package: 16.fc31 (Unknown, 2020-04-07-15:55:55, ), qemu version: 4.2.0qemu-kvm-4.2.0-27.fc31, kernel: 3.10.0-1062.9.1.el7.x86_64

IIUC from your response, even if I will manually assign addresses for all the disks, the order still may change on every boot. Is that correct?

Can you please elaborate on the fact that order cannot be guaranteed even if manually assigning the addresses?

TIA
Igor

Comment 6 Laine Stump 2020-10-17 04:17:36 UTC
(I had typed everything below the "..."  two days ago, but it was obscured behind other tabs before I hit <Save>, so it didn't get read. Much of it is in agreement with what Jirka says (in particular - use <boot order='1'/> on the device you want to boot from, and *don't* use <boot dev='hd'> in the <os> element)

But first, to answer your most recent question: No, the order will *not* change on every boot. What Jiri is saying is that we may *assume* that the disks will be assigned their order exactly the same as their PCI addresses, but in the end it is up to the guest OS to decide how to number the disks. It will be consistent from one boot to the next, but there's really no way to directly control it from the level of libvirt or qemu.

(Why is it that you want to manually assign the PCI address of some disks, but not others, anyway? What problem are you trying to solve?)

Now, to the fuller explanation I had previously typed:

...

(I wrote this first paragraph assuming that you were trying to designate the "first" disk by naming it "vda" - possibly that isn't what you were doing, but just want to make sure you know that doesn't work).

Although the target device name is included in <disk> elements (and actually *mandatory* for purely historical reasons), it is not possible to force any particular name for a disk device in the guest by setting it; as a matter of fact, the <target dev='blah'> name isn't even passed to qemu, because qemu has no way to accept it. Instead, the disk devices are named by the guest OS itself, usually in the order that they are encountered while probing for disk devices. Thus, just by saying <target dev='vda'/> in the config, you don't actually make that disk "vda"; the only thing you accomplish is to quiet libvirt from complaining that you haven't given a name to the device.

So, in your case since you've setup several empty disks on the root bus, and since the guest kernel will detect those disks before detecting the disks that actually contain the OS (which are assigned to pcie-root-ports, that apparently are probed later), the first of the empty disks will end up being "vda", and the BIOS will attempt to boot from that empty disk, and not find any OS, then just stop at that first "failure". (According to the BIOS, it's *not* a failure, since it found executable code in the MBR and executed it. The fact that this code merely looked for an OS and printed "No Bootable Disk" then halted (rather than returning to the BIOS so it could try the next disk) is not in the realm of libvirt or qemu to fix.

It *is* possible to specify the order in which disks are checked for booting, but your config hasn't done that, it simply specifies 

    <boot dev='hd'/>

which just means "try booting from one of the hard disks first" without specifying a particular order. If you want to be more specific, you'll need to use the <boot order='n'/> element in the specific devices *instead of* <boot dev='blah'/> in the <os> element. So, in your example config, you would have:


    ...
    <os>
      <type arch='x86_64' machine='pc-q35-rhel8.2.0'>hvm</type>
      <bootmenu enable='yes'/>  <== I think this is optional, but it's in all my examples
      <smbios mode='sysinfo'/>
    </os>

    ...
    <devices>
      <disk>
         <!-- this is the disk you want to boot from -->
         <boot order='1'/>
         ...
      </disk>


You can specify <boot order='2'/> etc on other disks, but keep in mind that if there is any sort of valid code in the MBR of a disk, the BIOS will latch onto that, execute it, and most likely be stuck there (because most MBR code when it doesn't find an OS, just prints out "Can't find a bootable OS" and halts, rather than passing control back to the BIOS which could then move on to the next disk. It's really better to only put <boot order='n'/> in the config of disks that you may actually want to boot.

If you don't want to / can't mess with removing the <boot dev='hd'/>, you can just manually assign addresses to the boot disk and other disk as well as all the extra disks.

Alternately, you could try using virtio-scsi disks instead of virtio. In this case, all devices would be connected to single virtio SCSI controller that is itself connected to a single PCI address. It would be easier to say whether or not this would work well for you if you could say why it is that you want some disks to be at manually assign PCI addresses and some not, though.

<disk type='file' device='disk'>
  ...
  <target dev='sda' bus='scsi'/>
  ...
</disk>

Comment 7 Laine Stump 2020-10-17 17:10:56 UTC
Also, you say in Comment 5 that you don't see this problem with libvirt-6.0.0 (+ different older versions of other packages). Can you provide the XML of a running guest that successfully booted when using those package versions (as well as the final qemu commandline from /etc/libvirt/qemu/$guestname.log), so that we can compare with the XML and commandline of the non-working combination?

(AND - I just noticed from the log you've already provided that the versions you're testing are using -blockdev instead of -disk (or whatever it was that was used previously - that's not my area so I never paid attention). It's possible that a change in behavior could be the result of some part of that (either in qemu or in libvirt), and getting the XML + qemu log for the working setup would be the first step to deterimining that.)

Comment 8 Laine Stump 2020-10-17 17:24:38 UTC
(BTW, the previous comment should not be interpreted as saying that any such behavior change should be considered a bug; my opinion is rather that if it worked the way you're describing in the past, that this was purely coincidental, and not something that should be relied on).

Comment 9 Igor Bezukh 2020-10-19 08:17:17 UTC
Created attachment 1722572 [details]
virsh dump of a setup where the problematic VM configuration does work

Comment 10 Igor Bezukh 2020-10-19 08:17:59 UTC
Created attachment 1722573 [details]
QEMU log of the working case

Comment 11 Igor Bezukh 2020-10-19 12:27:27 UTC
Hi @laine 

Thanks a lot for the detailed information. 
Here is some background - I was asked to assist our CNV QE team with identifying issues in automated tests. In Kubevirt the functional test suite is being run in upstream against Kuberenetes clusters while downstream it runs against Openshift Container Platform cluster. So I have noticed a test that is constantly being failed downstream but not in upstream. There is a difference between the cluster environments of upstream and downstream: the host OSs are different and the package versions inside the hosts and the containers are also different.

I am not the author of the automated PCI test, but I will patch the test soon, to make it manually assign addresses to all the block devices.

I've attached the logs of the working case. I've noticed that in the working case the <boot dev='hd'> is used in the OS field. I don't know where it comes from.
In downstream I don't see it's being added.

Is it possible that <boot dev='hd'> has been added automatically by libvirt? or in some indirect way?

TIA
Igor

Comment 12 Laine Stump 2020-10-20 01:53:19 UTC
Hmm. This is getting more confused :-/

1) The libvirt XML of the two running systems is identical aside from UUID#'s CPU flags, and the names of the boot images (which are auto-generated)

2) The qemu logs are also identical except for UUID#'s, CPU flags, and smbios options

3) both the working and non-working tests are running libvirt-6.0.0 (although slightly different variations

(in other words, both are using -blockdev for disks, so my idea that might be the cause was incorrect.)

Perhaps this is a difference in seabios versions. Can you compare the versions of those packages?

Gerd, do you know of any change in seabios that would cause either the selection order of disks, or the behavior when encountering empty disks, to change? Is there a difference in the qemu commandline -smbios options that could make a difference? Any suggestions what to look for?

(a short summary to hopefully avoid making you read everything - when the boot disk is assigned to a pcie-root-port, and several "empty" disks are manually assigned to slots on the root bus (with no explicit boot order options set for devices), some combination of package versions yields a bootable guest, while other combinations of package versions leads to "No Bootable Device". I had thought that bootability in this case could *never* be relied on - I'm actually more confused by the system that *does* boot than by the one that doesn't :-))

Comment 13 Gerd Hoffmann 2020-10-20 05:45:33 UTC
> Gerd, do you know of any change in seabios that would cause either the
> selection order of disks, or the behavior when encountering empty disks, to
> change? Is there a difference in the qemu commandline -smbios options that
> could make a difference? Any suggestions what to look for?

seabios 1.14 got some optimizations for block device initialization.
In case a 'HALT' entry is present in the boot order file ("-boot strict=on")
seabios will not initialize block devices which didn't got a boot index
assigned.

In theory this should not change behavior, seabios should not have considered
those devices for boot anyway.  In practice it might make a difference.

Maybe because there is some corner case where the new 1.14 seabios skips
a device it should not have skipped.

Maybe because seabios can't initialize all devices due to -ENOMEM, so 1.14
picking bootable devices only initializes the device needed and makes the
system boot.

> (a short summary to hopefully avoid making you read everything - when the
> boot disk is assigned to a pcie-root-port, and several "empty" disks are
> manually assigned to slots on the root bus (with no explicit boot order
> options set for devices), some combination of package versions yields a
> bootable guest, while other combinations of package versions leads to "No
> Bootable Device". I had thought that bootability in this case could *never*
> be relied on - I'm actually more confused by the system that *does* boot
> than by the one that doesn't :-))

Indeed.  I'd strongly recommend to tag the disk you want boot from
with "<boot order='1'/>" and be done with it.  Otherwise you'll have
undefined behavior.

I think in case of unspecified boot order seabios uses pci scan order,
so with pci addresses being the same seabios behaviour should be the
same too.

Also having lots of disks can cause problems due to seabios running out
of memory.  The number of working disks may vary depending on seabios
version and configuration.

seabios logs should help clarify what is going on here.

Comment 14 Igor Bezukh 2020-10-20 10:44:56 UTC
(In reply to Gerd Hoffmann from comment #13)
> > Gerd, do you know of any change in seabios that would cause either the
> > selection order of disks, or the behavior when encountering empty disks, to
> > change? Is there a difference in the qemu commandline -smbios options that
> > could make a difference? Any suggestions what to look for?
> 
> seabios 1.14 got some optimizations for block device initialization.
> In case a 'HALT' entry is present in the boot order file ("-boot strict=on")
> seabios will not initialize block devices which didn't got a boot index
> assigned.
> 
> In theory this should not change behavior, seabios should not have considered
> those devices for boot anyway.  In practice it might make a difference.
> 
> Maybe because there is some corner case where the new 1.14 seabios skips
> a device it should not have skipped.
> 
> Maybe because seabios can't initialize all devices due to -ENOMEM, so 1.14
> picking bootable devices only initializes the device needed and makes the
> system boot.
> 
> > (a short summary to hopefully avoid making you read everything - when the
> > boot disk is assigned to a pcie-root-port, and several "empty" disks are
> > manually assigned to slots on the root bus (with no explicit boot order
> > options set for devices), some combination of package versions yields a
> > bootable guest, while other combinations of package versions leads to "No
> > Bootable Device". I had thought that bootability in this case could *never*
> > be relied on - I'm actually more confused by the system that *does* boot
> > than by the one that doesn't :-))
> 
> Indeed.  I'd strongly recommend to tag the disk you want boot from
> with "<boot order='1'/>" and be done with it.  Otherwise you'll have
> undefined behavior.
> 
> I think in case of unspecified boot order seabios uses pci scan order,
> so with pci addresses being the same seabios behaviour should be the
> same too.
> 
> Also having lots of disks can cause problems due to seabios running out
> of memory.  The number of working disks may vary depending on seabios
> version and configuration.
> 
> seabios logs should help clarify what is going on here.

@gerd hoff

Hi Gerd, can you please tell how can I enable the seabios logs?

TIA
Igor

Comment 15 Gerd Hoffmann 2020-10-20 11:48:44 UTC
> Hi Gerd, can you please tell how can I enable the seabios logs?

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>rhel7bios-el7-firmwarelog</name>
  [ ... ]
  </devices>
  <qemu:commandline>
    <qemu:arg value='-chardev'/>
    <qemu:arg value='file,id=firmwarelog,path=/tmp/qemu-firmware.log'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='isa-debugcon,iobase=0x402,chardev=firmwarelog'/>
  </qemu:commandline>
</domain>

Comment 16 Igor Bezukh 2020-10-26 15:50:22 UTC
Created attachment 1724234 [details]
logs of the seaBIOS with the problematic pci configuration

Comment 17 Igor Bezukh 2020-10-26 15:54:08 UTC
Hi Gerd,

Can you please take a look at the attached seaBIOS log?

It was taking during the reproduction of the issue.

TIA
Igor

Comment 18 Gerd Hoffmann 2020-10-27 07:28:09 UTC
> Maybe because seabios can't initialize all devices due to -ENOMEM, so 1.14
> picking bootable devices only initializes the device needed and makes the
> system boot.

Seems it is this one.  The failing seabios is version 1.13, and the log
shows seabios runs out of memory after initializing 10 virtio-blk devices:

   [ ... ]
   WARNING - Unable to allocate resource at vp_find_vq:301!
   fail to find vq for virtio-blk 00:0c.0
   [ ... ]

Comment 19 Igor Bezukh 2020-10-27 08:38:24 UTC
Thank you @kraxel for the information.

In which release version will the 1.14 seaBIOS be included?

Comment 20 Jaroslav Suchanek 2020-10-27 09:17:59 UTC
(In reply to Igor Bezukh from comment #19)
> Thank you @kraxel for the information.
> 
> In which release version will the 1.14 seaBIOS be included?

In the upcoming rhel-av-8.3.0.

Moving the seabios component for further tracking, although this is probably already fixed there.

Comment 21 Gerd Hoffmann 2020-10-27 10:11:04 UTC
(In reply to Jaroslav Suchanek from comment #20)
> (In reply to Igor Bezukh from comment #19)
> > Thank you @kraxel for the information.
> > 
> > In which release version will the 1.14 seaBIOS be included?
> 
> In the upcoming rhel-av-8.3.0.
> 
> Moving the seabios component for further tracking, although this is probably
> already fixed there.

Yes, should be fixed already.
Testing with the 1.14 seabios package (virt:8.3 module) doesn't hurt though ;)

Comment 22 Igor Bezukh 2020-11-02 08:47:42 UTC
OK so lets wait from the QA to validate that the bug is fixed in 8.3

Comment 23 Igor Bezukh 2020-11-02 08:47:58 UTC
OK so lets wait from the QA to validate that the bug is fixed in 8.3

Comment 24 leidwang@redhat.com 2020-11-03 01:16:25 UTC
Tested this bz in rhel8.3, it works well. Guest can boot successfuly.

Host env:

seabios-1.14.0-1.module+el8.3.0+7638+07cf13d2.x86_64
kernel-4.18.0-240.el8.x86_64
qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64
libvirt-client-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64

Xml file uploaded to attachment(comment25),Thanks!

Comment 25 leidwang@redhat.com 2020-11-03 01:18:35 UTC
Created attachment 1726077 [details]
The xml file

Comment 27 Gerd Hoffmann 2020-11-09 06:50:36 UTC
close/currentrelease I'd say with 8.3 being released now.
also depend on the seabios rebase bug 1809772.


Note You need to log in before you can comment on or make changes to this bug.