Failed to reproduce this issue with latest qemu-kvm-rhev build for RHEL7.5.z.
1. I see the component is "qemu-kvm", but we don't support SR-IOV(VF) on qemu-kvm officially, we support qemu-kvm-rhev.
2. I tested with the latest RHEL7.5.z qemu-kvm-rhev form QEMU side, can not reproduce.
NIC: Intel XL710
Since i failed to build a rom file for VF (Build with git://git.ipxe.org/ipxe.git, it doesn't support the vf's device id 8086154c). So i test with PF and a rom file for it.
Test steps with VF:
1. Enable one VF, and bind it to vfio-pci.
2. Boot VM with the VF, but no romfile for it.
- boot menu=on \
- device vfio-pci,host=04:02.0,id=vf \
After guest boot, in boot menu, first is the hard disk, and no PXE for the VF. Guest can boot with hard disk by default normally.
Test steps with PF:
1. Bind PF to vfio-pci.
2. Boot VM with the PF and given romefile:
- boot menu=on \
- device vfio-pci,host=04:00.1,id=pf,rombar=1,romfile="/root/ipxe/src/bin/80861583.rom" \
After guest boot up, in boot menu, first is the hard disk, then Legacy option rom, and at last it is the iPXE for the PF. Guest can boot with hard disk by default normally.
3. Full qemu command line:
-name 'avocado-vt-vm1' \
-sandbox off \
-machine pc \
-device VGA,bus=pci.0,addr=0x2 \
-device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
-drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel75-64-virtio-scsi.qcow2 \
-device scsi-hd,id=image1,drive=drive_image1 \
-m 7168 \
-smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \
-cpu 'Haswell-noTSX',+kvm_pv_unhalt \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-vnc :0 \
-rtc base=utc,clock=host,driftfix=slew \
-monitor stdio \
-device vfio-pci,host=04:00.1,id=pf,rombar=1,romfile="/root/ipxe/src/bin/80861583.rom" \ ----> this is for pf, the one for vf already listed above.
-boot menu=on \
Created attachment 1497002 [details]
Scrrenshot of boot menu when boot with PF and a available romfile
Created attachment 1497003 [details]
Scrrenshot of boot menu when boot with VF and no available romfile
Switching to libvirt, since we should be trying to reproduce with libvirt first, not direct QEMU invocation, and its possible that libvirt is not configuring QEMU correctly.
I found both the guest xml shows the machine type is "pc-i440fx-rhel7.5.0".
And the "installed-rpms" shows the related package version as below:
I have tried to install libvirt and qemu with the same version, but the default machine type is "pc-i440fx-rhel7.0.0", the guest with 7.5.0 can not start.
# virsh start rh
error: Failed to start domain rh
error: internal error: process exited while connecting to monitor: qemu-kvm: -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off: Unsupported machine type
Use -machine help to list supported machines!
# rpm -q qemu-kvm
# /usr/libexec/qemu-kvm -machine help
Supported machines are:
none empty machine
pc RHEL 7.0.0 PC (i440FX + PIIX, 1996) (alias of pc-i440fx-rhel7.0.0)
pc-i440fx-rhel7.0.0 RHEL 7.0.0 PC (i440FX + PIIX, 1996) (default)
rhel6.6.0 RHEL 6.6.0 PC
rhel6.5.0 RHEL 6.5.0 PC
rhel6.4.0 RHEL 6.4.0 PC
rhel6.3.0 RHEL 6.3.0 PC
rhel6.2.0 RHEL 6.2.0 PC
rhel6.1.0 RHEL 6.1.0 PC
rhel6.0.0 RHEL 6.0.0 PC
I'm curious how it could happen? I think the customer should use "qemu-kvm-rhev" not qemu-kvm.
(In reply to firstname.lastname@example.org from comment #5)
> I found both the guest xml shows the machine type is "pc-i440fx-rhel7.5.0".
> And the "installed-rpms" shows the related package version as below:
Hmm... from the customer's sos-report:
$ grep kvm installed-rpms
libvirt-daemon-kvm-3.9.0-14.el7_5.5.x86_64 Fri May 18 20:13:00 2018
qemu-kvm-common-rhev-2.10.0-21.el7_5.3.x86_64 Fri May 18 20:09:53 2018
qemu-kvm-rhev-2.10.0-21.el7_5.3.x86_64 Fri May 18 20:12:51 2018
Where did you pull that info from? Sorry if I'm missing the obvious.
Sorry, my bad, the correct version in the sos report is:
In the guest xml, there are several hostdev device, both are PFs, and there is no rom file path specified(check instance-000003a3.xml)
<hostdev mode='subsystem' type='pci' managed='yes'>
<address domain='0x0000' bus='0x86' slot='0x00' function='0x0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
According to my experience, pxe by PF/VF only works when rom file is specified in the xml like:
<hostdev mode='subsystem' type='pci' managed='yes'>
<address domain='0x0000' bus='0x86' slot='0x00' function='0x0'/>
** <rom file='/usr/share/ipxe/80861570.rom'/> **
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
I have test with the xml and x710-2 card, still can not reproduce.
Laine, is it possible that pxe boot by PF without specify the rom file?
Good morning, can we get an update about this bugzilla, please. It is quite urgent.
Thanks in advance.
First - there is a misleading statement somewhere back in here - the default setting for rom bar was changed not by libvirt, but by qemu, and it happened back in 2011 (first entered rhel in RHEL6.2, libvirt-0.9.4, qemu-0.12), so it's far beyond the date that this would be a new behavior caused by a change in default value.
Second - wow! looking into this has been a trip down memory lane and shows me how much of the last almost-10 years I've completely forgotten about! :-)
For example, take a look at Bug 888635
Since you are assigning a PF, I think you need to take into account that it may have its own ROM on the card, which would explain why seabios is detecting a bootable ROM.
According to Bug 888635, since 2013 libvirt has added "-boot strict=on" to all qemu commandlines, which is supposed to cause it to only attempt booting from devices that have an explicit boot order given (without that, by default SeaBIOS would *still* attempt to boot from that device if all devices that have a specified priority fail to boot.) As long as the generated qemu commandlines contain -boot strict=on then libvirt is doing all that it can.
However, even if boot strict=on wasn't in the commandline, this shouldn't be a problem as long as there is a higher priority device that *does* boot (even if the only operation once booted is to fail, or simply to reboot, thus jumping back to the top of the list of devices to attempt booting. I have a faint recollection of creating a tiny disk image in the past for exactly this purpose.)
In the screenshot of the boot menu, I see iPXE as the 3rd choice, implying that the SCSI disk will be selected for booting before iPXE is attempted. Is that not happening?
Can you provide the libvirt XML of the guest while it's running, and also the qemu commandline generated from it (tail /var/log/libvirt/$guestname.log)? I'd like to see if -boot strict=on is present.
The qemu logs are empty:
[root@server01 qemu]# ls -arlth instance-000003a1.log*
-rw-------. 1 root root 0 Oct 21 03:14 instance-000003a1.log-20181021
-rw-------. 1 root root 0 Oct 28 03:34 instance-000003a1.log-20181028
-rw-------. 1 root root 0 Nov 4 03:11 instance-000003a1.log-20181104
-rw-------. 1 root root 0 Nov 11 03:46 instance-000003a1.log-20181111
-rw-------. 1 root root 0 Nov 11 03:46 instance-000003a1.log
Running the command line(from ps -ef) you can see the "-boot strict=on" exists in the command line.
qemu 8271 1 99 Sep24 ? 735-00:18:09 /usr/libexec/qemu-kvm -name guest=instance-000003a1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-3-instance-000003a1/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Server-IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,pku=on,stibp=on -m 32768 -realtime mlock=off -smp 16,sockets=16,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/3-instance-000003a1,share=yes,size=34359738368,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-15,memdev=ram-node0 -uuid b8f80e76-2b6c-4042-8539-89f53dbc311d -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.1.0-22.el7ost,serial=5a9751e3-43c7-4e09-b579-c965068ab7be,uuid=b8f80e76-2b6c-4042-8539-89f53dbc311d,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-3-instance-000003a1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/b8f80e76-2b6c-4042-8539-89f53dbc311d/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/nova/instances/b8f80e76-2b6c-4042-8539-89f53dbc311d/disk.config,format=raw,if=none,id=drive-ide0-0-0,readonly=on,cache=none -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:3e:76:9c,bus=pci.0,addr=0x3 -add-fd set=2,fd=33 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.52.53:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device vfio-pci,host=3b:00.0,id=hostdev0,bus=pci.0,addr=0x5,rombar=0 -device vfio-pci,host=3b:00.1,id=hostdev1,bus=pci.0,addr=0x6,rombar=0 -device vfio-pci,host=3b:00.2,id=hostdev2,bus=pci.0,addr=0x7,rombar=0 -device vfio-pci,host=3d:00.0,id=hostdev3,bus=pci.0,addr=0x8,rombar=0 -device vfio-pci,host=3d:00.1,id=hostdev4,bus=pci.0,addr=0x9,rombar=0 -device vfio-pci,host=3d:00.2,id=hostdev5,bus=pci.0,addr=0xa,rombar=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xb -msg timestamp=on
> The qemu logs are empty:
Heh. Looks like the guest hasn't been restarted in a long time, and there is some weekly log rollover script that's being overzealous about gzipping and moving the logs even if they are 0 length. Thanks for grabbing the ps output though - it gives the same info I was looking for.
As you say, -boot strict=on *is* on the commandline, so only the devices specified should be used for booting (as far as I understand boot strict=on anyway). (The example you've given has rom bar=off for the devices, but there is no reason boot strict would change if rombar wasn't specified.)
Also, I'm surprised by the statement that SeaBIOS on this system is attempting to boot from the NIC devices *before* the disk. Even without boot strict that doesn't make sense. Can you verify/confirm that? Also, can you confirm that the behavior was the same when using <boot order='n'/> rather than <boot dev='hd'/>?
(I noticed that the config is using <boot dev='hd'/> in <os> rather than setting <boot order='n'/> in the disk device. I don't know if that should make any difference to boot strict, but I wanted to point it out for Gerd, who is getting the followup question)
Is our long-time understanding of the purpose of "-boot strict=on' correct - that it instructs SeaBIOS to attempt booting *only* from those devices that are explicitly given a boot order, or that are listed in the boot devices (depending on which of the two methods is chosen in the config)?
> Also, I'm surprised by the statement that SeaBIOS on this system is
> attempting to boot from the NIC devices *before* the disk. Even without boot
> strict that doesn't make sense. Can you verify/confirm that? Also, can you
> confirm that the behavior was the same when using <boot order='n'/> rather
> than <boot dev='hd'/>?
Extending that question: Does it try to pxeboot before the boot menu shows up?
> Is our long-time understanding of the purpose of "-boot strict=on' correct -
> that it instructs SeaBIOS to attempt booting *only* from those devices that
> are explicitly given a boot order, or that are listed in the boot devices
> (depending on which of the two methods is chosen in the config)?
That is correct.
Typically a boot looks like this:
========== [ cut here ] ==========
SeaBIOS (version rel-1.11.0-50-g14221cd86e-prebuilt.qemu-project.org)
iPXE (http://ipxe.org) 00:03.0 C980 PCI2.10 PnP PMM+07F913B0+07EF13B0 C980
[ note: option rom is loaded here, it should initialize and register a boot
entry for the nic ]
Press ESC for boot menu.
Booting from Hard Disk...
Boot failed: could not read the boot disk
Booting from ROM...
iPXE (PCI 00:03.0) starting execution...ok
[ note: pxeboot should happen here, more ipxe messages follow ]
========== [ cut here ] ==========
It is possible though that the card's option rom is rude and goes kick the pxeboot right after loading instead of properly registering a boot entry.
> It is possible though that the card's option rom is rude and goes kick the
> pxeboot right after loading instead of properly registering a boot entry.
Are you suggesting that the ROM might directly start up pxeboot when its initialize function is called, thus pre-empting anything else set in the BIOS? If so, that would be worthy of a very strong hand slap :-/ But wouldn't that cause a similar problem when booting the host system?
Also, I've noticed that on my F29 system, even emulated devices show up in the boot menu (in spite of having -boot strict=on in the commandline) - apparently qemu is finding boot roms for them, mapping them into guest memory space, and then SeaBIOS is offering to boot from them. This sounds counter to what you have confirmed as proper behavior when -boot strict is on...
And just to make Gerd's extra question to the BZ reporter more visible, I'll repeat it:
> Does it try to pxeboot before the boot menu shows up?
(In reply to Laine Stump from comment #16)
> > It is possible though that the card's option rom is rude and goes kick the
> > pxeboot right after loading instead of properly registering a boot entry.
> Are you suggesting that the ROM might directly start up pxeboot when its
> initialize function is called, thus pre-empting anything else set in the
> If so, that would be worthy of a very strong hand slap :-/
> But wouldn't that cause a similar problem when booting the host system?
Maybe PF and VF have different option rom images.
Possibly it is configurable. option roms sometimes offer some kind of setup, typically announced with something along the lines "press <hotkey> for <device> setup" at option rom load time (on the host, or in the guest, or both). If that exists it is worth digging there whenever this behavior can be turned off somewhere in the setup.
Failing that it is worth checking whenever the hardware in question is supported by ipxe, and should that be the case use the ipxe rom instead of the one provided by the hardware.
> Also, I've noticed that on my F29 system, even emulated devices show up in
> the boot menu (in spite of having -boot strict=on in the commandline) -
> apparently qemu is finding boot roms for them, mapping them into guest
> memory space, and then SeaBIOS is offering to boot from them. This sounds
> counter to what you have confirmed as proper behavior when -boot strict is
strict=on only affects automatic boot, i.e. seabios will not fallback to devices without bootindex=x entry (i.e. the nic in this case) when it could not boot from a device with bootindex=x entry (i.e. the disk in this case).
Manually picking the nic in the boot menu is always possible, no matter whenever strict is on or off.
Screenshot and video are not that helpful unfortunaly.
Is it possible to enable the boot menu, so the "Press ESC for boot menu." line shows up on the screen?
libvirt xml for that:
Even more helpful would be a seabios logfile.
Can be obtained this way:
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
[ ... ]
Then attach /tmp/qemu-firmware.log to this bug.
The information requested is attached in the bugzilla
Hmm. That creates more questions than it answers.
The log (comment 22) looks fairly normal. option roms are loaded and initialized. They register boot entries, as seen in the boot menu (both logfile and display video). virtio disk is first in the menu, as it should.
And then, according to the log, seabios boots from the hard disk. Apparently it successfully passed control to whatever it loaded from the disk. There is no error message and something (some boot loader probably) invokes a bunch of vgabios calls.
That is not consistent with the video though.
The "Booting from Hard Disk..." line which is in the log should have been printed to the vga display too. But it isn't there. And a "Booting from ROM..." message which seabios prints before booting via option rom isn't there either (not in the logfile and not in the display video).
So I'm wondering how the pxerom is invoked in the first place ...
What happens after the search for a boot server times out?
Does the guest actually boot from the hard disk then?
How does the tail of qemu-firmware.log look like while the pxerom is running (and waiting for a dhcp response)?
I have attached two recording where you can see vm screen and qemu-firmware log at same time. In the one with boot menu I chose first one as boot device.
Ok, the nic option rom hooks into interrupt 19.
I don't think this is how things are supposed to work.
For legacy option roms hooking into int19 is the way to take over control for boot.
For modern option roms which register a BEV (and show up with a descriptive name in the boot menu) this is not needed because the BEV also contains the entry vector which the bios can call to kick pxe boot.
The intel nic rom does both (register BEV and hook int19) though.
Apparently at least some real hardware has a config option in the bios setup to enable/disable int19 hooks (see https://superuser.com/questions/1000339/interrupt-19-capture-bios-option). seabios allows this unconditionally. This nicely explains why we see different behavior on physical hardware.
So, what are our options?
(a) Add a config option to seabios, simliar to real hardware.
Problem with this is that we have to wire this up through all virt
management layers so people can actually make use of it.
(b) Try do to something clever in seabios to avoid the need for a config option,
for example checking whenever the option rom registered a BEV and in case it
did do not allow to also hook int19.
(c) Backport commit "d2063b7693 [intelxl] Add driver for Intel 40 Gigabit
Ethernet NICs" to our ipxe package, then go for
in the libvirt config.
This is a test build for variant (3), the ipxe update.
It might provide useful information, but option (c) doesn't help much in production - they can already eliminate the problem by setting <rom bar='off'/> in the libvirt config, but apparently adding a custom <rom> element to the config is problematic to do in OpenStack.
(Also, note that making the default for rombar to "off" isn't acceptable either, since its default has been "on" since 2011 (through 2 major releases of RHEL. Changing it would create havoc among all those installations that expect it to be on).
If it's possible to "do something clever" as you suggest in (b), that would be the most useful.
The real problem here AFAICS is that the firmware in this ultra-modern 40Gb NIC is pulling shenanigans that would have been acceptable in 1985, but in this age really aren't. Should we be filing a bug with Intel (where the problem really lies IMO)?
> If it's possible to "do something clever" as you suggest in (b), that would
> be the most useful.
Ok, can try that.
> The real problem here AFAICS is that the firmware in this ultra-modern 40Gb
> NIC is pulling shenanigans that would have been acceptable in 1985, but in
> this age really aren't. Should we be filing a bug with Intel (where the
> problem really lies IMO)?
Well, it is more like 1995. I think the BEV mechanism was added in mid-90ies when PCI support showed up in PCs. More than two decades ago.
But, yes, you have a point here. It is kida silly to care about backward compatibility to bios firmware from the 90ies in an option rom for PCI express hardware.
Filing a bug with Intel is worth trying.
Has the seabios update now.
It simply disallows any int19 changes for pnp roms, unconditionally. It also prints a debug message with some rom info, so we can refine the logic should that be needed. So, please try this with seabios logfile enabled (see comment 21).
Seems like there is nothing libvirt should fix. I am moving this bug to seabios component. Please reset it back if you disagree. Thanks.
ping, any test results with the seabios update?
(In reply to Gerd Hoffmann from comment #37)
> ping, any test results with the seabios update?
Negative feedback. Please see the following recording:
The file can be found in the internal Google Drive created by Edu Alcaniz (c#27).
> Negative feedback. Please see the following recording:
Why negative? What is the problem?
From the screen recording it looks like everthing works as intended.
Customer feedback verbatim:
"Thanks for update . I have tried and it seems to be skipping pxe boot attempts. Please find attached screen recording shows the log from vm console and /tmp/qemu-firmware.log."
Ah, wait, I just rewatched the video. It does try to pxe-boot first, it can't find an interface, and then it falls back to booting from Hard Drive. Right?
(In reply to Irina Petrova from comment #41)
> Customer feedback verbatim:
> "Thanks for update . I have tried and it seems to be skipping pxe boot
> attempts. Please find attached screen recording shows the log from vm
> console and /tmp/qemu-firmware.log."
> Ah, wait, I just rewatched the video. It does try to pxe-boot first, it
> can't find an interface, and then it falls back to booting from Hard Drive.
The pxe roms are loaded (this is where the messages printed come from).
The roms are not started for pxe boot (no attempt to dhcp, compare with the other videos), so boot ordering (hard drive has highest priority) works again.
The idea that appears to emerge on the upstream SeaBIOS list is similar to (c) in comment 29. Generally disabling Int19 hooking for oproms is considered risky (it could regress valid oproms, if I understand correctly). And a blacklist of specific oproms, if it existed, should not be maintained within SeaBIOS. (I hope that I've summarized the discussion more or less faithfully.)
But option (c) requires a config special case for that particular card (and the desire to avoid that was the entire purpose of this BZ being filed). And if you're going to require a config change, you may as well just change the config by disabling the rombar, which already works without needing to coordinate with the backport of a commit in ipxe.
I think everyone agrees that the real culprit is the firmware on the card though, and Alex W. told me yesterday that this particular card (Intel XL710) permits updates to its firmware, so how about suggesting that Intel make a firmware update available for the card that doesn't capture int 19h, then the customer can install that update on their hardware once and be done with it - no config changes needed.
I think at the very least someone who knows how to navigate Intel firmware bug reporting should report this to them - maybe it truly is an oversight (if it wasn't, I would expect this behavior from other Intel SRIOV NIC cards, and this is the only one I've heard of), and they'll just respond with "Oh yeah, how did we miss that?!?" (or maybe they'll respond with "Yes, it does do that, and we had to do it because XYZ motherboard had obscure problem PDQ, and this was the only way to fix it.", but at least then we'll know the reason).
Upstream discussion is still in progress.
May I ask for an additional test?
With the updated seabios and boot menu enabled (see comment 21), will the pxeboot start correctly if one of the NICs is picked in the boot menu?
Patch has been updated after some upstream discussions:
New test packages are available:
Can you please test this version too?
Behavior should be identical to the previous version (no pxe boot by default, but pxe booot via boot menu should be possible), except for a slightly changed message text in the debug log when seabios reverts the int19 redirection.
upstream commit 0932c20560574696cf87ddd12623e8c423ee821b
seabios rebase (to not-yet available 1.13 probably) will pickup the fix.