Bug 720995

Summary: kvm gets stuck while booting 32 bit guest on 64 bit host with smp
Product: [Fedora] Fedora Reporter: Ilkka Tengvall <ikke>
Component: qemuAssignee: Fedora Virtualization Maintainers <virt-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: amit.shah, berrange, clalance, crobinso, dougsland, dwmw2, ehabkost, extras-orphan, itamar, jaswinder, jforbes, knoel, markmc, notting, pcfe, quintela, rdassen, scottt.tw, tburke, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-29 00:54:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 494832    

Description Ilkka Tengvall 2011-07-13 13:51:16 UTC
Description of problem:

Recently I haven't been able to boot my 32 bit linux guests due kvm getting stuck while guest kernel boots. KVM consumes 200 % cpu load while given 2 cpus to use.


Version-Release number of selected component (if applicable):

qemu-kvm-0.14.0-7.fc15.x86_64
kernel-2.6.38.8-32.fc15.x86_64


How reproducible:

most times (~98% ? of the) boots fail.

Steps to Reproduce:
1. download rhel or centos vmlinuz and initrd:
   wget ftp://ftp.funet.fi/pub/Linux/mirrors/centos/5.6/os/i386/isolinux/{vmlinuz,initrd.img}
2. start guest:
  qemu-kvm  -M pc-0.14  -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -name CentOS  -kernel vmlinuz -initrd initrd.img

  
Actual results:

booting stops around the time kernel frees unused memory, but the lines might be little earlier or after. It's time dependent (race) rather than actual console output line.

Expected results:

boot succesfully

Additional info:

I tried to take perf trace of it, but cpu is too busy that it could do it.
I'm willing to take more traces if someone tells me what exactly.

System is fedora 15 up to date, with cpu:
Intel(R) Core(TM) i7 CPU       X 000  @ 3.33GHz
and 24GB of ram. The same happened to me on older Xeon too.

Comment 1 Patrick C. F. Ernzer 2011-07-13 14:13:14 UTC
some additions after talking with Ikke and doing some tests as well.

 - the qemu-kvm line is for everybody's ease of testing. Ikke gets the same result when using virt-manager in cliky mode.
 - RHEL 6.1 host works fine (did change the -M to rhel5.5.0 as there is no pc-0.14 on el6), did 10 successfuil tests in a row
 - F15 x86_64 also fails for me, roughly 2 out of 3 attempts fail
 - test with latest qemu and kernel from rawhide will follow shortly (Ikke; bc-blade-02 in the test network downstairs)

Comment 2 Patrick C. F. Ernzer 2011-07-13 15:13:32 UTC
(In reply to comment #1)
> some additions after talking with Ikke and doing some tests as well.
> 
>  - RHEL 6.1 host works fine (did change the -M to rhel5.5.0 as there is no
> pc-0.14 on el6), did 10 successfuil tests in a row

Ignore the above, the 6.1 test was done on an IBM LS21, that has an AMD CPU and further testing reveals that F15 x86_64 as a host also works on that hardware

sorry

>  - F15 x86_64 also fails for me, roughly 2 out of 3 attempts fail

That remains as written. test was done on an Intel CPU (my workstation) but on that box I will not install RHEL 6.1 or rawhide (as it's my main work tool)

Ikke has the same problem, he sees the failure on his workstation, so changing distro will hinder his other work too much.

I'll hunt for a box in the lab here that actually reproduces the issue under F15 x86_64 host and report back (have a HP pizzabox in mind)

Comment 3 Patrick C. F. Ernzer 2011-07-13 16:41:48 UTC
re-testing done on an Intel CPU (Ikke; acpi4-15 in the test network)

F15 host; same as reported by Ikke, ~half the boots fail

pulled F15 host to rawhide; 10 consecutive successful boots of the guest

I guess we can CLOSE this RAWHIDE, is that OK with you Ikke? Rawhide being rawhide, I would not blindly update my workstation as I have done with that test machine. YMMW

Another question; want me to re-do the RHEL 6.1 test round on this box? (for now I guess leaving it on rawhide so you can also have a look is more reasonable for your use case)

Comment 4 Ilkka Tengvall 2011-07-14 08:36:32 UTC
Thanks pcfe, I tried it several times on the box you setup. It seems not to get stuck. I also spent 1h trying to search kvm mailing list about regression fix, and finally found this regression fix:

http://marc.info/?l=linux-kernel&m=129942743310538&w=4

and for anyone else hitting the same problem, here is a workaround:

add kernel command line parameter "clocksource=acpi_pm" and it will boot \o/

please upgrade the KVM in fedora to fix the regression.

Comment 5 Ilkka Tengvall 2011-07-14 10:21:40 UTC
This is the fix needed:

$ git show 1aa8ceef
commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
Author: Nikola Ciprich <extmaillist>
Date:   Wed Mar 9 23:36:51 2011 +0100

    KVM: fix kvmclock regression due to missing clock update
    
    commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
    breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
    to proper place.
    
    Signed-off-by: Nikola Ciprich <nikola.ciprich>
    Acked-by: Zachary Amsden <zamsden>
    Signed-off-by: Avi Kivity <avi>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 01f08a6..f1e4025 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
                if (check_tsc_unstable()) {
                        kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
                        vcpu->arch.tsc_catchup = 1;
-                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
                }
+               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
                if (vcpu->cpu != cpu)
                        kvm_migrate_timers(vcpu);
                vcpu->cpu = cpu;

Comment 6 Patrick C. F. Ernzer 2011-07-14 13:31:24 UTC
thanks Ikke. Let's see what the owner of the bug thinks.
setting NEEDINFO on bug owner

Comment 7 Justin M. Forbes 2011-07-25 17:05:07 UTC
This seems to be correct and will make the next update.

Comment 8 Fedora Admin XMLRPC Client 2012-03-15 17:53:14 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 9 Cole Robinson 2012-05-29 00:54:45 UTC
Verified this code is in f15 kernel git, so closing as CURRENTRELEASE