Bug 847546

Summary: seabios sits for 3 seconds doing nothing during boot
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: amit.shah, berrange, cfergeau, dwmw2, gansalmon, itamar, jonathan, kernel-maint, knoel, madhu.chinakonda, pbonzini, rjones, scottt.tw, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-13 16:33:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Richard W.M. Jones 2012-08-12 21:26:36 UTC
Description of problem:

$ annotate-output +'%T:%N' guestfish -a /dev/null run -v
[...]
22:17:54:457493465 E: [00514ms] /usr/bin/qemu-kvm [...]
22:17:54:460912406 E: Google, Inc.
22:17:54:464330486 E: Serial Graphics Adapter 01/14/12
22:17:54:467727552 E: SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $ (\
mockbuild@) Sat Jan 14 23:18:18 UTC 2012
22:17:54:471107681 E: Term: 80x24
22:17:54:474397757 E: 4 0
22:17:54:478236572 E: x1b[2J^MSeaBIOS (version 1.7.0-20120722_040125-)^M
22:17:54:481615939 E: ^M
22:17:54:624122918 E: ^MBooting from ROM...^M
22:17:57:767809071 E: ^MProbing EDD (edd=off to disable)... ok^M
22:17:58:033771349 E: x1b[2J[    0.000000] Initializing cgroup subsys cpuset
22:17:58:037214742 E: [    0.000000] Initializing cgroup subsys cpu

Notice the 3 second pause 22:17:54:624122918 -> 22:17:57:767809071.
That's 3/8ths of the entire time libguestfs takes to boot up.

Version-Release number of selected component (if applicable):

$ rpm -qf /usr/share/qemu/sgabios.bin
qemu-system-x86-1.1.1-1.fc18.x86_64

How reproducible:

100%

Steps to Reproduce:

Run this command:
  annotate-output +'%T:%N' guestfish -a /dev/null run -v
in Fedora 18+.

Observe that it hangs for 3 seconds.

FYI here is the full qemu command line we are using:

/usr/bin/qemu-kvm     -global virtio-blk-pci.scsi=off     -nodefconfig     -nodefaults     -nographic     -device virtio-scsi-pci,id=scsi     -drive file=/dev/null,id=hd0,if=none     -device scsi-hd,drive=hd0     -drive file=/var/tmp/.guestfs-1000/root.2326,snapshot=on,id=appliance,if=none,cache=unsafe     -device scsi-hd,drive=appliance     -machine accel=kvm:tcg     -m 500     -no-reboot     -no-hpet     -device virtio-serial     -serial stdio     -device sga     -chardev socket,path=/tmp/libguestfslnVlGO/guestfsd.sock,id=channel0     -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0     -kernel /var/tmp/.guestfs-1000/kernel.2326     -initrd /var/tmp/.guestfs-1000/initrd.2326     -append 'panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=xterm '

Comment 1 Richard W.M. Jones 2012-08-12 21:37:36 UTC
Also tested with qemu 2:1.2-0.1.20120806git3e430569.fc18.

This also suffers from the same problem -- 3 second delay in
sgabios.

It also contains some sort of bug in virtio-scsi emulation
so that the kernel crashes when the guest does 'insmod virtio_scsi.ko'.
I'll open a separate bug about that.

Comment 2 Richard W.M. Jones 2012-08-12 21:55:19 UTC
Sorry, wrong component.  The bios is located in the
sgabios package:

sgabios-bin-0-0.20110622SVN.fc17.noarch

(I tried the later sgabios, but it is completely
broken.  I will file a separate bug ...)

Comment 3 Richard W.M. Jones 2012-08-12 22:29:23 UTC
... SeaBIOS ...

Comment 4 Richard W.M. Jones 2012-08-13 12:18:32 UTC
Even enabling full debugging in seabios didn't make the
problem obvious, so I took a Fedora 17 (baremetal) host
and updated it component by component to Fedora 18 until
the bug manifested itself.

The component to blame is the *host kernel* (NB: not the
guest kernel).

Updating from 3.4.0-1.fc17.x86_64 -> 3.6.0-0.rc1.git4.1.fc18.x86_64
is what triggers the problem.

So I guess this is some sort of regression in the KVM module,
perhaps causing guest memory or I/O access to be slower.

Comment 5 Richard W.M. Jones 2012-08-13 13:00:50 UTC
I tracked this down using gdb, and it does seem to be a
regression (severe slow down) in the 'rep insb' instruction.

(gdb) info registers 
eax            0x500011 5242897
ecx            0x354990 3492240
edx            0x511    1297
ebx            0x0      0
esp            0x6f40   0x6f40
ebp            0x6f40   0x6f40
esi            0x0      0
edi            0x2b1400 2823168
eip            0x15e    0x15e
eflags         0x86     [ PF SF ]
cs             0xc380   50048
ss             0x0      0
ds             0xc380   50048
es             0x10     16
fs             0x0      0
gs             0x0      0
(gdb) x/5i ($cs*16)+$eip
   0xc395e:     rep insb (%dx),%es:(%edi)   <--- here
   0xc3961:     mov    $0xa,%ax
   0xc3964:     mov    $0x510,%dx
   0xc3967:     out    %ax,(%dx)
   0xc3968:     mov    $0x511,%dx

Port 0x511 is PORT_QEMU_CFG_DATA, ie. fw_cfg data.

I don't know the full size of the read or the destination,
because by the time I hit ^C it had already copied maybe
half of the data, so %ecx and %edi have already decremented/
incremented from the initial values.

Lots of places in the seabios code call qemu_cfg_read and
it could be any one of them, so still debugging this.

Comment 6 Josh Boyer 2012-08-13 13:55:39 UTC
(In reply to comment #4)
> Even enabling full debugging in seabios didn't make the
> problem obvious, so I took a Fedora 17 (baremetal) host
> and updated it component by component to Fedora 18 until
> the bug manifested itself.
> 
> The component to blame is the *host kernel* (NB: not the
> guest kernel).
> 
> Updating from 3.4.0-1.fc17.x86_64 -> 3.6.0-0.rc1.git4.1.fc18.x86_64
> is what triggers the problem.
> 
> So I guess this is some sort of regression in the KVM module,
> perhaps causing guest memory or I/O access to be slower.

That has all of the various kernel debugging options enabled.  They are known to slow things down.  You might want to test a non-debug kernel.  Grab 

http://koji.fedoraproject.org/koji/buildinfo?buildID=345823

or wait for the first -rc2 kernel to be built if that one has bugs that hold you up.

Comment 7 Richard W.M. Jones 2012-08-13 14:35:47 UTC
I don't know why there is a regression.

However I do have a fix for this, which is to apply Gleb's
(non-upstream) patch series to improve rep ins performance:

http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95259

I applied this series on top of 3.6.0-0.rc1.git3.2 and
essentially the pause goes away.  Going from "Booting the ROM"
to the first kernel message is about 0.8 seconds (versus 3 before).

Comment 8 Richard W.M. Jones 2012-08-13 14:37:22 UTC
(In reply to comment #6)
> That has all of the various kernel debugging options enabled.  They are
> known to slow things down.  You might want to test a non-debug kernel.  Grab 

How can I find out if a kernel has debugging options enabled?

Comment 9 Richard W.M. Jones 2012-08-13 14:43:18 UTC
(In reply to comment #7)
> http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95259
> 
> I applied this series on top of 3.6.0-0.rc1.git3.2 and
> essentially the pause goes away.  Going from "Booting the ROM"
> to the first kernel message is about 0.8 seconds (versus 3 before).

Just to be clearer about this.  I tested two identical host
kernels except that one had Gleb's patch series applied.  So
whether or not debugging is enabled in this kernel, the patch
series still makes a big difference.

Comment 10 Josh Boyer 2012-08-13 14:46:29 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > That has all of the various kernel debugging options enabled.  They are
> > known to slow things down.  You might want to test a non-debug kernel.  Grab 
> 
> How can I find out if a kernel has debugging options enabled?

If you're using rawhide, we build one kernel without debug set per RC/release.  So the first build of rc1, rc2, etc will have them off.  Otherwise, they're always enabled.  Anything with -gitX will have debug options enabled.

Comment 11 Richard W.M. Jones 2012-08-13 16:27:16 UTC
(In reply to comment #6)
> That has all of the various kernel debugging options enabled.  They are
> known to slow things down.  You might want to test a non-debug kernel.

What I did was I removed Gleb's patches and cherry picked
d1027f01240c3426211a80146f8c36c2b5736865 (which disables
debug options).

So now I have a kernel which has debugging off, but is otherwise
completely identical to the one I was testing before (eg. in
comment 4).

I rebooted the host with this kernel and ran the test again.

The delay is now ~ 1.2 seconds, so it does seem likely that
the regression was caused by adding debugging options.

Nevertheless Gleb's patches are a clear benefit (even with
debugging on).

Comment 12 Richard W.M. Jones 2012-08-13 16:33:39 UTC
OK I'm going to close this because I'm satisfied with
the boot times I'm getting without debug enabled.