Bug 820112 - usage of cpu is 100% after boot up win2k8R2 guest with -smp 48 and -m 256GB ,sometimes guest BSOD during bootup on AMD host
usage of cpu is 100% after boot up win2k8R2 guest with -smp 48 and -m 256GB ,...
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
6.3
Unspecified Unspecified
urgent Severity urgent
: rc
: ---
Assigned To: Gleb Natapov
Virtualization Bugs
:
: 823839 (view as bug list)
Depends On:
Blocks: 851382
  Show dependency treegraph
 
Reported: 2012-05-09 04:04 EDT by Mike Cao
Modified: 2013-01-09 19:56 EST (History)
23 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-11-14 07:45:05 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ftrace (14.95 MB, text/plain)
2012-05-09 04:20 EDT, Mike Cao
no flags Details

  None (edit)
Description Mike Cao 2012-05-09 04:04:41 EDT
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Start Guest with w/ -smp 48 and -m 256G
CLI:/usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 -cpu cpu64-rhel6,+x2apic,family=0xf -drive file=amd-max-sut.raw,format=raw,if=none,id=drive-virtio0,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,drive=drive-virtio0,id=virtio-blk-pci0,bootindex=1 -netdev tap,sndbuf=0,id=hostnet0,script=/etc/qemu-ifup0,downscript=no,vhost=on -device virtio-net-pci,netdev=hostnet0,mac=00:10:1a:61:72:01,bus=pci.0,addr=0x4,id=virtio-net-pci0 -uuid 7678e130-f1a7-4157-875a-8defcdb27af7 -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -chardev socket,id=111a,path=/tmp/amd-max-sut,server,nowait -mon chardev=111a,mode=readline -name amd-max-sut -vnc :1 
.
  
Actual results:
1.sometimes guest BSOD (will attach dump file later)
2.after guest login ,usage of per cpu is 100% ,and guest is very very slow to response 

Expected results:


Additional info:
This is a testblocker for SVVP Test
Comment 1 Mike Cao 2012-05-09 04:20:45 EDT
Created attachment 583182 [details]
ftrace
Comment 6 Mike Cao 2012-05-09 07:16:36 EDT
only occurs in AMD host ,did not occurs this bug on Intel host
Comment 21 Mike Cao 2012-05-11 06:34:36 EDT
Hi, Vadim

I tried the upsteam qemu-kvm
Afer guest bootup ,"System" process consume all the cpu resources more than 15 mins. Guest is very slow to response.

I did not reproduce graphic driver's crash during boot (comment#5) w/ upstream qemu-kvm and upstream kernel

Best Regards,
Mike
Comment 22 Vadim Rozenfeld 2012-05-11 07:42:35 EDT
(In reply to comment #21)
> Hi, Vadim
> 
> I tried the upsteam qemu-kvm
> Afer guest bootup ,"System" process consume all the cpu resources more than 15
> mins. Guest is very slow to response.
> 
> I did not reproduce graphic driver's crash during boot (comment#5) w/ upstream
> qemu-kvm and upstream kernel
> 
> Best Regards,
> Mike

OK, now please try running VM with the following cpu options 
",+x2apic,hv_spinlocks=1000,hv_relaxed,hv_vapic"

Cheers,
Vadim.
Comment 23 Mike Cao 2012-05-13 23:22:11 EDT
(In reply to comment #22)
> (In reply to comment #21)
> > Hi, Vadim
> > 
> > I tried the upsteam qemu-kvm
> > Afer guest bootup ,"System" process consume all the cpu resources more than 15
> > mins. Guest is very slow to response.
> > 
> > I did not reproduce graphic driver's crash during boot (comment#5) w/ upstream
> > qemu-kvm and upstream kernel
> > 
> > Best Regards,
> > Mike
> 
> OK, now please try running VM with the following cpu options 
> ",+x2apic,hv_spinlocks=1000,hv_relaxed,hv_vapic"
> 
> Cheers,
> Vadim.

Tried upstream qemu-kvm w/ -cpu host,,+x2apic,hv_spinlocks=1000,hv_relaxed,hv_vapic on AMD host ,still hit the same issue 

*note* I did not hit this issue on Intel host
Comment 28 Mike Cao 2012-05-16 05:11:59 EDT
Re-Test this bug w/ following scnarios:

1.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none (-numa node) *12 

Actual Results: 
Guest always BSOD during boot ,referring to comment #5 .
After Guest bootup ,the usage of cpu is 2%

2.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=slew (-numa node) *12 

Actual Results: 
Guest always BSOD during boot ,referring to comment #5 .
After Guest bootup ,the usage of cpu is 2%
Guest will BSOD ,dumps similiar w/ Bug 801196

3.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none w/o numa
Actual Results:
Guest always BSOD during boot ,refering to comment #5
After Guest bootup, the usage of cpu is 100%


4.Start Guest w/ -smp 32 -m 256G -rtc base=localtime,clock=host,driftfix=none (-numa node) *12 

Actual Results: 
After Guest bootup ,the usage of cpu is 2%

5.Start Guest w/ -smp 32 -m 256G -rtc base=localtime,clock=host,driftfix=slew (-numa node) *12 

Actual Results: 
After Guest bootup ,the usage of cpu is 2%
Guest will BSOD ,dumps similiar w/ Bug 801196


Mike
Comment 29 Dor Laor 2012-05-16 05:37:10 EDT
(In reply to comment #28)
> Re-Test this bug w/ following scnarios:
> 
> 1.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none
> (-numa node) *12 

What's "*12" ?
Gleb, care to provide a recommended params for it?
 
> Actual Results: 
> Guest always BSOD during boot ,referring to comment #5 .

What about using qxl driver instead?

> After Guest bootup ,the usage of cpu is 2%

Does this means the guest managed to survive the BSOD?
Comment 30 Mike Cao 2012-05-16 06:38:49 EDT
(In reply to comment #29)
> (In reply to comment #28)
> > Re-Test this bug w/ following scnarios:
> > 
> > 1.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none
> > (-numa node) *12 
> 
> What's "*12" ?

</usr/libexec/qemu-kvm XXXX > -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node 

> Gleb, care to provide a recommended params for it?
> 
> > Actual Results: 
> > Guest always BSOD during boot ,referring to comment #5 .
> 
> What about using qxl driver instead?

xfu Tried this ,Did not hit the BSOD related to graphic driver's

> 
> > After Guest bootup ,the usage of cpu is 2%
> 
> Does this means the guest managed to survive the BSOD?

There are 2 kinds of BSOD I hit during this Bug 
one is related to graphic driver(referring to comment #5) ,This kind of BSOD 6/8 times occurs during guest bootup  When I use -smp 48 and -vnc 

another one is related to CLOCK_WATCHDOG_TIMEOUT(referring to Bug 801196).This kind of BSOD almost 100% occurs after I login the guest When use -rtc driftfix=slew and -numa node
Comment 37 Ademar Reis 2012-05-21 18:07:40 EDT
Too late for RHEL6.3, postponing to 6.4 (ask for the z-stream if necessary).
Comment 43 Gleb Natapov 2012-06-09 13:13:37 EDT
*** Bug 823839 has been marked as a duplicate of this bug. ***
Comment 47 Ronen Hod 2012-07-17 05:20:13 EDT
Might be the same issue as Bug 821377
Comment 48 Ronen Hod 2012-07-25 12:52:11 EDT
We suspect that it might be related to bug 842211 (in POST)
Gleb tested that bug fix using the brew build https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
so until we have official 6.4 kernel build, can you please try it.

Thanks, Ronen.
Comment 49 FuXiangChun 2012-07-26 06:17:55 EDT
(In reply to comment #48)
> We suspect that it might be related to bug 842211 (in POST)
> Gleb tested that bug fix using the brew build
> https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> so until we have official 6.4 kernel build, can you please try it.
> 
> Thanks, Ronen.

host use below kernel to re-test this bug. still has the same issue.  
https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
Comment 50 Gleb Natapov 2012-07-26 06:24:13 EDT
(In reply to comment #49)
> (In reply to comment #48)
> > We suspect that it might be related to bug 842211 (in POST)
> > Gleb tested that bug fix using the brew build
> > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > so until we have official 6.4 kernel build, can you please try it.
> > 
> > Thanks, Ronen.
> 
> host use below kernel to re-test this bug. still has the same issue.  
> https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

Which issue exactly? There is a lot of issues discussed throughout this BZ. What I am interested in checking is to run qemu with 12 numa nodes specified and see if we get a BSOD like in comment#5.
Comment 51 FuXiangChun 2012-07-26 23:22:55 EDT
(In reply to comment #50)
> (In reply to comment #49)
> > (In reply to comment #48)
> > > We suspect that it might be related to bug 842211 (in POST)
> > > Gleb tested that bug fix using the brew build
> > > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > > so until we have official 6.4 kernel build, can you please try it.
> > > 
> > > Thanks, Ronen.
> > 
> > host use below kernel to re-test this bug. still has the same issue.  
> > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> 
> Which issue exactly? There is a lot of issues discussed throughout this BZ.
> What I am interested in checking is to run qemu with 12 numa nodes specified
> and see if we get a BSOD like in comment#5.

tested two scenarios with this kernel
  https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

1. boot guest without numa
   /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1
 
 testing result:
 guest don't show BSOD, but cpu utilization is 100% yet.

2. boot guest with numa
   /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 ..... -numa node,nodeid=0 -numa node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 -numa node,nodeid=10 -numa node,nodeid=11
 
 testing result:
 guest work well.  


host info:
 cpu:24cores,AMD Opteron(tm) Processor 6168
 memory size:264516632 kB
 Numa number:
# cat /proc/buddyinfo

Node 0, zone      DMA      0      1      3      2      0      0      1      1      0      1      3 
Node 0, zone    DMA32     89     50     47     16      8      7     74     77     36      2     36 
Node 0, zone   Normal    186     43     21      9     24     16     14     14     21      1      2 
Node 1, zone   Normal    105    193    122     42     21     12      8      9     30      2      1 
Node 2, zone   Normal     57     23     20      7     10     12     10      7      5     14      2 
Node 3, zone   Normal     10     15     41     19     11     10      4      4     33      0      1
Comment 52 Mike Cao 2012-07-26 23:28:08 EDT
(In reply to comment #51)
> (In reply to comment #50)
> > (In reply to comment #49)
> > > (In reply to comment #48)
> > > > We suspect that it might be related to bug 842211 (in POST)
> > > > Gleb tested that bug fix using the brew build
> > > > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > > > so until we have official 6.4 kernel build, can you please try it.
> > > > 
> > > > Thanks, Ronen.
> > > 
> > > host use below kernel to re-test this bug. still has the same issue.  
> > > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > 
> > Which issue exactly? There is a lot of issues discussed throughout this BZ.
> > What I am interested in checking is to run qemu with 12 numa nodes specified
> > and see if we get a BSOD like in comment#5.
> 
> tested two scenarios with this kernel
>   https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> 
> 1. boot guest without numa
>    /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1
>  
>  testing result:
>  guest don't show BSOD, but cpu utilization is 100% yet.
> 
> 2. boot guest with numa
>    /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 .....
> -numa node,nodeid=0 -numa node,nodeid=1 -numa node,nodeid=2 -numa
> node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 -numa node,nodeid=6
> -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 -numa
> node,nodeid=10 -numa node,nodeid=11
>  
>  testing result:
>  guest work well.  
> 
> 
> host info:
>  cpu:24cores,AMD Opteron(tm) Processor 6168
>  memory size:264516632 kB
>  Numa number:
> # cat /proc/buddyinfo
> 
> Node 0, zone      DMA      0      1      3      2      0      0      1     
> 1      0      1      3 
> Node 0, zone    DMA32     89     50     47     16      8      7     74    
> 77     36      2     36 
> Node 0, zone   Normal    186     43     21      9     24     16     14    
> 14     21      1      2 
> Node 1, zone   Normal    105    193    122     42     21     12      8     
> 9     30      2      1 
> Node 2, zone   Normal     57     23     20      7     10     12     10     
> 7      5     14      2 
> Node 3, zone   Normal     10     15     41     19     11     10      4     
> 4     33      0      1

I don't think we use the right environment to verify this bug . 
Pls find the host w/ 48 cores and 512 GB memory and re-test it .

Mike
Comment 53 FuXiangChun 2012-07-31 02:31:56 EDT
I will re-test it when I reserve a big machine.
Comment 54 FuXiangChun 2012-08-01 01:59:04 EDT
Gleb,
   Brew build is closed, I want to use it to re-retest this bug in another big machine, Could you provide it for me again?    
  https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
Comment 55 Gleb Natapov 2012-08-01 02:27:02 EDT
(In reply to comment #54)
> Gleb,
>    Brew build is closed, I want to use it to re-retest this bug in another
> big machine, Could you provide it for me again?    
>   https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

rhel6 kernels starting from 2.6.32-290.el6 includes the patch already.
Comment 56 FuXiangChun 2012-08-03 05:13:37 EDT
verify this bug with 2.6.32-293.el6.x86_64

1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
2.if boot guest without Numa, guest still will show BSOD.

cli:
/usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 -name win2k8r2 -rtc base=localtime,clock=host,driftfix=slew -drive file=/root/win2k8r2.qcow2,if=none,id=virtio0,format=qcow2,cache=none -device ide-drive,drive=virtio0,id=virtio0-device -monitor stdio -vnc :1 -k en-us -numa node,nodeid=0 -numa node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 -numa node,nodeid=10 -numa node,nodeid=11

host:
Mem:512G
cpu:48 cores, amd-6172

# cat /proc/buddyinfo
Node 0, zone      DMA      1      2      2      2      2      1      2      1      1      0      3 
Node 0, zone    DMA32     15      4      7      6      6     16      7      8      6      4    599 
Node 0, zone   Normal    541    590    324    121     49     25      6      5      3     12  15281 
Node 1, zone   Normal   1054    768    427    208    121     64     53     37     20     34  15946 
Node 2, zone   Normal   1074    539    369    179     76     35     21     14     11     27  15176 
Node 3, zone   Normal   1257    604    297    152     71     38     26     24     15     31  15937 
Node 4, zone   Normal   1611   1054    563    321    171    119     55     45     35     56  15895 
Node 5, zone   Normal   1538    949    509    287    188     79     45     29     14     27  15946 
Node 6, zone   Normal    880   1016    494    255    145     90     77     54     43     55  15893 
Node 7, zone   Normal   1598    782    334    162     89     33     24     22     19     33  15942
Comment 57 Gleb Natapov 2012-08-03 06:26:58 EDT
(In reply to comment #56)
> verify this bug with 2.6.32-293.el6.x86_64
> 
> 1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
So you are no longer seeing BSODs in this config during boot with vnc? This is good.
Comment 58 FuXiangChun 2012-08-06 01:29:54 EDT
(In reply to comment #57)
> (In reply to comment #56)
> > verify this bug with 2.6.32-293.el6.x86_64
> > 
> > 1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
> So you are no longer seeing BSODs in this config during boot with vnc? This
> is good.

Yes,I always boot guest with vnc.

If boot guest with spicec qxl, then guest don't show BSOD but cpu usage is 100%.

Summary:

   1. with vnc 
     result:BSOD

   2. with vnc and numa
     result:guest work well(and cpu usage is normal)
   
   3. with spice qxl
     result:guest don't show BSOD but cpu usage is 100%(guest response is very slow)
Comment 61 Mike Cao 2012-09-04 03:42:16 EDT
(In reply to comment #58)
> (In reply to comment #57)
> > (In reply to comment #56)
> > > verify this bug with 2.6.32-293.el6.x86_64
> > > 
> > > 1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
> > So you are no longer seeing BSODs in this config during boot with vnc? This
> > is good.
> 
> Yes,I always boot guest with vnc.
> 
> If boot guest with spicec qxl, then guest don't show BSOD but cpu usage is
> 100%.
> 
> Summary:
> 
>    1. with vnc 
>      result:BSOD
> 
>    2. with vnc and numa
>      result:guest work well(and cpu usage is normal)

If we will not hit BSODs w/ VNC + numa . then it is not a testblocker for SVVP.
Comment 62 Mike Cao 2012-11-02 06:50:22 EDT
Since I was asked to run svvp test over windows server 2012 platform on RHEL6.3.z host .

So It is a testblocker to me
Comment 63 Mike Cao 2012-11-02 06:53:03 EDT
(In reply to comment #62)
> Since I was asked to run svvp test over windows server 2012 platform on
> RHEL6.3.z host .
> 
> So It is a testblocker to me

Sorry for updating to a wrong bug 
ignore this comment pls.
Comment 65 Ademar Reis 2012-11-02 14:27:12 EDT
(In reply to comment #63)
> (In reply to comment #62)
> > Since I was asked to run svvp test over windows server 2012 platform on
> > RHEL6.3.z host .
> > 
> > So It is a testblocker to me
> 
> Sorry for updating to a wrong bug 
> ignore this comment pls.

I'm assuming adding TestBlocker flag was a mistake as well, and therefore I'm removing it. Please re-add it if it's indeed a test blocker to you.
Comment 66 Karen Noel 2012-11-14 07:46:32 EST
*** Bug 873613 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.