Bug 2101787

Summary: [rhel.8.7] Loading a kernel/initrd is sometimes very slow
Product: Red Hat Enterprise Linux 8 Reporter: Yedidyah Bar David <didi>
Component: seabiosAssignee: Gerd Hoffmann <kraxel>
Status: CLOSED ERRATA QA Contact: Xueqiang Wei <xuwei>
Severity: high Docs Contact:
Priority: high    
Version: 8.7CC: coli, jinzhao, juzhang, kkiwi, kraxel, mrezanin, msobczyk, virt-maint, ymankad, zhguo
Target Milestone: rcKeywords: Performance, Regression, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: seabios-1.16.0-3.module+el8.7.0+16134+e5908aa2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2108555 (view as bug list) Environment:
Last Closed: 2022-11-08 09:20:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2108555    
Attachments:
Description Flags
HostedEngine-console.log none

Description Yedidyah Bar David 2022-06-28 11:14:19 UTC
Created attachment 1893165 [details]
HostedEngine-console.log

Description of problem:

Some time ago, oVirt's automated testing [1] started failing, timing out on a VM's start. At some point we added a console log to a file of the VM's console, which revealed that it was booted, started the grub menu, selected an entry, then got stuck. Later we tried manually running linux/initrd commands from the console, and saw that some times they are very slow.

Today I spent some time on this, pushed a patch [2] to add a progress indicator (using 'insmod progress') to grub, and see that it's indeed slow - often much slower than expected, around 100KiB/s - but not always - sometimes it gets faster, up to around 10MiB/s (which is, I guess, expected/normal).

I then tried manually to eliminate regressions, by downgrading relevant components to older versions - qemu-kvm, grub, which didn't help, and then seabios-bin, which did. Thus opening the bug on this component.

We are using CentOS Stream 8, and the current/broken version is seabios-bin-1.16.0-1.module_el8.7.0+1140+ff0772f9.noarch .

With seabios-bin-1.15.0-1.module_el8.6.0+1087+b42c8331.noarch , in all my tests I got better throughput - minimum was ~ 400KiB/s, often 1-2MiB/s, sometimes more than 20MiB/s.

[1] https://github.com/oVirt/ovirt-system-tests/

[2] https://github.com/oVirt/ovirt-system-tests/pull/203

Version-Release number of selected component (if applicable):
seabios-bin-1.16.0-1.module_el8.7.0+1140+ff0772f9.noarch

How reproducible:
Often, but not sure - not always

Steps to Reproduce:
1. Boot a VM with a large (the default) kernel+initrd
2.
3.

Actual results:
Sometimes it's very slow

Expected results:
Should always be as fast as the platform allows

Additional info:

Attaching a console log which is a result of [2] above. You can view it comfortably with:

$ cat HostedEngine-console.log | tr '\r' '\n' | less -R +/'\[ vml|\[ ini'

Comment 1 Gerd Hoffmann 2022-06-29 09:04:21 UTC
What exactly is the boot device?
Can you capture a firmware log?

  <qemu:commandline>
    <qemu:arg value='-chardev'/>
    <qemu:arg value='file,id=firmware,path=/tmp/edk2-devel-q35.log'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='isa-debugcon,iobase=0x402,chardev=firmware'/>
  </qemu:commandline>

Comment 2 Yedidyah Bar David 2022-06-29 10:04:27 UTC
(In reply to Gerd Hoffmann from comment #1)
> What exactly is the boot device?

Attaching the libvirt xml (as copied from a typical successful run of our CI, should be mostly identical to current case).

> Can you capture a firmware log?
> 
>   <qemu:commandline>
>     <qemu:arg value='-chardev'/>
>     <qemu:arg value='file,id=firmware,path=/tmp/edk2-devel-q35.log'/>
>     <qemu:arg value='-device'/>
>     <qemu:arg value='isa-debugcon,iobase=0x402,chardev=firmware'/>
>   </qemu:commandline>

I'll try.

Comment 5 Gerd Hoffmann 2022-06-29 10:25:14 UTC
three changes for virtio-blk in 1.15 ... 1.16:

a05af290bac5 virtio-blk: split large IO according to size_max
815d7498655b virtio-blk: abstract a function named virtio_blk_op_one_segment to handle r/w request
27b573d4f5a5 virtio-blk: add feature VIRTIO_BLK_F_SIZE_MAX and VIRTIO_BLK_F_SEG_MAX

They make seabios query the host for the maximum request size and if io requests
are larger than that seabios will split them into smaller ones. Should essentially
be a no-op, but maybe there is a bug in there that seabios splits requests even
though it is not needed, thereby slowing down the boot.

The firmware log (grep for lines with virtio-blk) should help figuring this.

Comment 8 Yedidyah Bar David 2022-06-30 06:12:18 UTC
Attached some console and firmware logs.

Files named *-03 are from a plain unattended boot.

Files named *-04* are from a run where I entered the grub prompt and ran several 'initrd' commands to load the default initramfs image (~ 67 MB). During some of them, I imposed cpu load on the host (plain 'while true; do :; done &'). Tried running more than one endless loop in parallel but this caused no difference - apparently we (vdsm/libvirt, no idea) somehow allocate a cpu to the qemu process for itself (the host (actually a nested-kvm VM) has 2 cpus). While cpu was loaded, throughput was ~ 100KB/s. When not, it was faster, but changing significantly.

Files named *-05* are from a similar flow as 04, but after downgrading seabios-bin to 1.15. Throughput during cpu load is ~ 500 KB/s, without cpu load it's more than 1MB/s (in previous cases it also went much higher, not sure what changed. Perhaps the various tests cause memory fragmentation or whatever that makes it rather slow even unloaded, just a wild guess).

Comment 9 Gerd Hoffmann 2022-06-30 10:59:03 UTC
(In reply to Yedidyah Bar David from comment #8)
> Attached some console and firmware logs.

Seems to be corrupted or incomplete.  I get a decompression error + only HostedEngine-console.log-02 unpacked, the other files are missing.

Comment 20 Yanan Fu 2022-07-29 01:51:20 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 28 Xueqiang Wei 2022-08-09 08:09:28 UTC
According to Comment 21 and Comment 25, set status to VERIFIED. Thanks.

Comment 30 errata-xmlrpc 2022-11-08 09:20:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7472