820112 – usage of cpu is 100% after boot up win2k8R2 guest with -smp 48 and -m 256GB ,sometimes guest BSOD during bootup on AMD host

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 820112 - usage of cpu is 100% after boot up win2k8R2 guest with -smp 48 and -m 256GB ,sometimes guest BSOD during bootup on AMD host

Summary: usage of cpu is 100% after boot up win2k8R2 guest with -smp 48 and -m 256GB ,...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Gleb Natapov
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	823839 (view as bug list)
Depends On:
Blocks:	851382
TreeView+	depends on / blocked

Reported:	2012-05-09 08:04 UTC by Mike Cao
Modified:	2013-01-10 00:56 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-11-14 12:45:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ftrace (14.95 MB, text/plain) 2012-05-09 08:20 UTC, Mike Cao	no flags	Details
View All

Description Mike Cao 2012-05-09 08:04:41 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Start Guest with w/ -smp 48 and -m 256G
CLI:/usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 -cpu cpu64-rhel6,+x2apic,family=0xf -drive file=amd-max-sut.raw,format=raw,if=none,id=drive-virtio0,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,drive=drive-virtio0,id=virtio-blk-pci0,bootindex=1 -netdev tap,sndbuf=0,id=hostnet0,script=/etc/qemu-ifup0,downscript=no,vhost=on -device virtio-net-pci,netdev=hostnet0,mac=00:10:1a:61:72:01,bus=pci.0,addr=0x4,id=virtio-net-pci0 -uuid 7678e130-f1a7-4157-875a-8defcdb27af7 -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -chardev socket,id=111a,path=/tmp/amd-max-sut,server,nowait -mon chardev=111a,mode=readline -name amd-max-sut -vnc :1 
.
  
Actual results:
1.sometimes guest BSOD (will attach dump file later)
2.after guest login ,usage of per cpu is 100% ,and guest is very very slow to response 

Expected results:


Additional info:
This is a testblocker for SVVP Test

Comment 1 Mike Cao 2012-05-09 08:20:45 UTC

Created attachment 583182 [details]
ftrace

Comment 6 Mike Cao 2012-05-09 11:16:36 UTC

only occurs in AMD host ,did not occurs this bug on Intel host

Comment 21 Mike Cao 2012-05-11 10:34:36 UTC

Hi, Vadim

I tried the upsteam qemu-kvm
Afer guest bootup ,"System" process consume all the cpu resources more than 15 mins. Guest is very slow to response.

I did not reproduce graphic driver's crash during boot (comment#5) w/ upstream qemu-kvm and upstream kernel

Best Regards,
Mike

Comment 22 Vadim Rozenfeld 2012-05-11 11:42:35 UTC

(In reply to comment #21)
> Hi, Vadim
> 
> I tried the upsteam qemu-kvm
> Afer guest bootup ,"System" process consume all the cpu resources more than 15
> mins. Guest is very slow to response.
> 
> I did not reproduce graphic driver's crash during boot (comment#5) w/ upstream
> qemu-kvm and upstream kernel
> 
> Best Regards,
> Mike

OK, now please try running VM with the following cpu options 
",+x2apic,hv_spinlocks=1000,hv_relaxed,hv_vapic"

Cheers,
Vadim.

Comment 23 Mike Cao 2012-05-14 03:22:11 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > Hi, Vadim
> > 
> > I tried the upsteam qemu-kvm
> > Afer guest bootup ,"System" process consume all the cpu resources more than 15
> > mins. Guest is very slow to response.
> > 
> > I did not reproduce graphic driver's crash during boot (comment#5) w/ upstream
> > qemu-kvm and upstream kernel
> > 
> > Best Regards,
> > Mike
> 
> OK, now please try running VM with the following cpu options 
> ",+x2apic,hv_spinlocks=1000,hv_relaxed,hv_vapic"
> 
> Cheers,
> Vadim.

Tried upstream qemu-kvm w/ -cpu host,,+x2apic,hv_spinlocks=1000,hv_relaxed,hv_vapic on AMD host ,still hit the same issue 

*note* I did not hit this issue on Intel host

Comment 28 Mike Cao 2012-05-16 09:11:59 UTC

Re-Test this bug w/ following scnarios:

1.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none (-numa node) *12 

Actual Results: 
Guest always BSOD during boot ,referring to comment #5 .
After Guest bootup ,the usage of cpu is 2%

2.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=slew (-numa node) *12 

Actual Results: 
Guest always BSOD during boot ,referring to comment #5 .
After Guest bootup ,the usage of cpu is 2%
Guest will BSOD ,dumps similiar w/ Bug 801196

3.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none w/o numa
Actual Results:
Guest always BSOD during boot ,refering to comment #5
After Guest bootup, the usage of cpu is 100%


4.Start Guest w/ -smp 32 -m 256G -rtc base=localtime,clock=host,driftfix=none (-numa node) *12 

Actual Results: 
After Guest bootup ,the usage of cpu is 2%

5.Start Guest w/ -smp 32 -m 256G -rtc base=localtime,clock=host,driftfix=slew (-numa node) *12 

Actual Results: 
After Guest bootup ,the usage of cpu is 2%
Guest will BSOD ,dumps similiar w/ Bug 801196


Mike

Comment 29 Dor Laor 2012-05-16 09:37:10 UTC

(In reply to comment #28)
> Re-Test this bug w/ following scnarios:
> 
> 1.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none
> (-numa node) *12 

What's "*12" ?
Gleb, care to provide a recommended params for it?
 
> Actual Results: 
> Guest always BSOD during boot ,referring to comment #5 .

What about using qxl driver instead?

> After Guest bootup ,the usage of cpu is 2%

Does this means the guest managed to survive the BSOD?

Comment 30 Mike Cao 2012-05-16 10:38:49 UTC

(In reply to comment #29)
> (In reply to comment #28)
> > Re-Test this bug w/ following scnarios:
> > 
> > 1.Start Guest w/ -smp 48 -m 256G -rtc base=localtime,clock=host,driftfix=none
> > (-numa node) *12 
> 
> What's "*12" ?

</usr/libexec/qemu-kvm XXXX > -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node 

> Gleb, care to provide a recommended params for it?
> 
> > Actual Results: 
> > Guest always BSOD during boot ,referring to comment #5 .
> 
> What about using qxl driver instead?

xfu Tried this ,Did not hit the BSOD related to graphic driver's

> 
> > After Guest bootup ,the usage of cpu is 2%
> 
> Does this means the guest managed to survive the BSOD?

There are 2 kinds of BSOD I hit during this Bug 
one is related to graphic driver(referring to comment #5) ,This kind of BSOD 6/8 times occurs during guest bootup  When I use -smp 48 and -vnc 

another one is related to CLOCK_WATCHDOG_TIMEOUT(referring to Bug 801196).This kind of BSOD almost 100% occurs after I login the guest When use -rtc driftfix=slew and -numa node

Comment 37 Ademar Reis 2012-05-21 22:07:40 UTC

Too late for RHEL6.3, postponing to 6.4 (ask for the z-stream if necessary).

Comment 43 Gleb Natapov 2012-06-09 17:13:37 UTC

*** Bug 823839 has been marked as a duplicate of this bug. ***

Comment 47 Ronen Hod 2012-07-17 09:20:13 UTC

Might be the same issue as Bug 821377

Comment 48 Ronen Hod 2012-07-25 16:52:11 UTC

We suspect that it might be related to bug 842211 (in POST)
Gleb tested that bug fix using the brew build https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
so until we have official 6.4 kernel build, can you please try it.

Thanks, Ronen.

Comment 49 FuXiangChun 2012-07-26 10:17:55 UTC

(In reply to comment #48)
> We suspect that it might be related to bug 842211 (in POST)
> Gleb tested that bug fix using the brew build
> https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> so until we have official 6.4 kernel build, can you please try it.
> 
> Thanks, Ronen.

host use below kernel to re-test this bug. still has the same issue.  
https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

Comment 50 Gleb Natapov 2012-07-26 10:24:13 UTC

(In reply to comment #49)
> (In reply to comment #48)
> > We suspect that it might be related to bug 842211 (in POST)
> > Gleb tested that bug fix using the brew build
> > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > so until we have official 6.4 kernel build, can you please try it.
> > 
> > Thanks, Ronen.
> 
> host use below kernel to re-test this bug. still has the same issue.  
> https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

Which issue exactly? There is a lot of issues discussed throughout this BZ. What I am interested in checking is to run qemu with 12 numa nodes specified and see if we get a BSOD like in comment#5.

Comment 51 FuXiangChun 2012-07-27 03:22:55 UTC

(In reply to comment #50)
> (In reply to comment #49)
> > (In reply to comment #48)
> > > We suspect that it might be related to bug 842211 (in POST)
> > > Gleb tested that bug fix using the brew build
> > > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > > so until we have official 6.4 kernel build, can you please try it.
> > > 
> > > Thanks, Ronen.
> > 
> > host use below kernel to re-test this bug. still has the same issue.  
> > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> 
> Which issue exactly? There is a lot of issues discussed throughout this BZ.
> What I am interested in checking is to run qemu with 12 numa nodes specified
> and see if we get a BSOD like in comment#5.

tested two scenarios with this kernel
  https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

1. boot guest without numa
   /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1
 
 testing result:
 guest don't show BSOD, but cpu　utilization　is 100% yet.

2. boot guest with numa
   /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 ..... -numa node,nodeid=0 -numa node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 -numa node,nodeid=10 -numa node,nodeid=11
 
 testing result:
 guest work well.  


host info:
 cpu:24cores,AMD Opteron(tm) Processor 6168
 memory size:264516632 kB
 Numa number:
# cat /proc/buddyinfo

Node 0, zone      DMA      0      1      3      2      0      0      1      1      0      1      3 
Node 0, zone    DMA32     89     50     47     16      8      7     74     77     36      2     36 
Node 0, zone   Normal    186     43     21      9     24     16     14     14     21      1      2 
Node 1, zone   Normal    105    193    122     42     21     12      8      9     30      2      1 
Node 2, zone   Normal     57     23     20      7     10     12     10      7      5     14      2 
Node 3, zone   Normal     10     15     41     19     11     10      4      4     33      0      1

Comment 52 Mike Cao 2012-07-27 03:28:08 UTC

(In reply to comment #51)
> (In reply to comment #50)
> > (In reply to comment #49)
> > > (In reply to comment #48)
> > > > We suspect that it might be related to bug 842211 (in POST)
> > > > Gleb tested that bug fix using the brew build
> > > > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > > > so until we have official 6.4 kernel build, can you please try it.
> > > > 
> > > > Thanks, Ronen.
> > > 
> > > host use below kernel to re-test this bug. still has the same issue.  
> > > https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> > 
> > Which issue exactly? There is a lot of issues discussed throughout this BZ.
> > What I am interested in checking is to run qemu with 12 numa nodes specified
> > and see if we get a BSOD like in comment#5.
> 
> tested two scenarios with this kernel
>   https://brewweb.devel.redhat.com/taskinfo?taskID=4639681
> 
> 1. boot guest without numa
>    /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1
>  
>  testing result:
>  guest don't show BSOD, but cpu　utilization　is 100% yet.
> 
> 2. boot guest with numa
>    /usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 .....
> -numa node,nodeid=0 -numa node,nodeid=1 -numa node,nodeid=2 -numa
> node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 -numa node,nodeid=6
> -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 -numa
> node,nodeid=10 -numa node,nodeid=11
>  
>  testing result:
>  guest work well.  
> 
> 
> host info:
>  cpu:24cores,AMD Opteron(tm) Processor 6168
>  memory size:264516632 kB
>  Numa number:
> # cat /proc/buddyinfo
> 
> Node 0, zone      DMA      0      1      3      2      0      0      1     
> 1      0      1      3 
> Node 0, zone    DMA32     89     50     47     16      8      7     74    
> 77     36      2     36 
> Node 0, zone   Normal    186     43     21      9     24     16     14    
> 14     21      1      2 
> Node 1, zone   Normal    105    193    122     42     21     12      8     
> 9     30      2      1 
> Node 2, zone   Normal     57     23     20      7     10     12     10     
> 7      5     14      2 
> Node 3, zone   Normal     10     15     41     19     11     10      4     
> 4     33      0      1

I don't think we use the right environment to verify this bug . 
Pls find the host w/ 48 cores and 512 GB memory and re-test it .

Mike

Comment 53 FuXiangChun 2012-07-31 06:31:56 UTC

I will re-test it when I reserve a big machine.

Comment 54 FuXiangChun 2012-08-01 05:59:04 UTC

Gleb,
   Brew build is closed, I want to use it to re-retest this bug in another big machine, Could you provide it for me again?    
  https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

Comment 55 Gleb Natapov 2012-08-01 06:27:02 UTC

(In reply to comment #54)
> Gleb,
>    Brew build is closed, I want to use it to re-retest this bug in another
> big machine, Could you provide it for me again?    
>   https://brewweb.devel.redhat.com/taskinfo?taskID=4639681

rhel6 kernels starting from 2.6.32-290.el6 includes the patch already.

Comment 56 FuXiangChun 2012-08-03 09:13:37 UTC

verify this bug with 2.6.32-293.el6.x86_64

1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
2.if boot guest without Numa, guest still will show BSOD.

cli:
/usr/libexec/qemu-kvm -m 256G -smp 48,cores=48,sockets=1,threads=1 -name win2k8r2 -rtc base=localtime,clock=host,driftfix=slew -drive file=/root/win2k8r2.qcow2,if=none,id=virtio0,format=qcow2,cache=none -device ide-drive,drive=virtio0,id=virtio0-device -monitor stdio -vnc :1 -k en-us -numa node,nodeid=0 -numa node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 -numa node,nodeid=10 -numa node,nodeid=11

host:
Mem:512G
cpu:48 cores, amd-6172

# cat /proc/buddyinfo
Node 0, zone      DMA      1      2      2      2      2      1      2      1      1      0      3 
Node 0, zone    DMA32     15      4      7      6      6     16      7      8      6      4    599 
Node 0, zone   Normal    541    590    324    121     49     25      6      5      3     12  15281 
Node 1, zone   Normal   1054    768    427    208    121     64     53     37     20     34  15946 
Node 2, zone   Normal   1074    539    369    179     76     35     21     14     11     27  15176 
Node 3, zone   Normal   1257    604    297    152     71     38     26     24     15     31  15937 
Node 4, zone   Normal   1611   1054    563    321    171    119     55     45     35     56  15895 
Node 5, zone   Normal   1538    949    509    287    188     79     45     29     14     27  15946 
Node 6, zone   Normal    880   1016    494    255    145     90     77     54     43     55  15893 
Node 7, zone   Normal   1598    782    334    162     89     33     24     22     19     33  15942

Comment 57 Gleb Natapov 2012-08-03 10:26:58 UTC

(In reply to comment #56)
> verify this bug with 2.6.32-293.el6.x86_64
> 
> 1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
So you are no longer seeing BSODs in this config during boot with vnc? This is good.

Comment 58 FuXiangChun 2012-08-06 05:29:54 UTC

(In reply to comment #57)
> (In reply to comment #56)
> > verify this bug with 2.6.32-293.el6.x86_64
> > 
> > 1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
> So you are no longer seeing BSODs in this config during boot with vnc? This
> is good.

Yes,I always boot guest with vnc.

If boot guest with spicec qxl, then guest don't show BSOD but cpu usage is 100%.

Summary:

   1. with vnc 
     result:BSOD

   2. with vnc and numa
     result:guest work well(and cpu usage is normal)
   
   3. with spice qxl
     result:guest don't show BSOD but cpu usage is 100%(guest response is very slow)

Comment 61 Mike Cao 2012-09-04 07:42:16 UTC

(In reply to comment #58)
> (In reply to comment #57)
> > (In reply to comment #56)
> > > verify this bug with 2.6.32-293.el6.x86_64
> > > 
> > > 1.if boot guest with Numa, guest works well(cpu usage is aobut 1%~2%)
> > So you are no longer seeing BSODs in this config during boot with vnc? This
> > is good.
> 
> Yes,I always boot guest with vnc.
> 
> If boot guest with spicec qxl, then guest don't show BSOD but cpu usage is
> 100%.
> 
> Summary:
> 
>    1. with vnc 
>      result:BSOD
> 
>    2. with vnc and numa
>      result:guest work well(and cpu usage is normal)

If we will not hit BSODs w/ VNC + numa . then it is not a testblocker for SVVP.

Comment 62 Mike Cao 2012-11-02 10:50:22 UTC

Since I was asked to run svvp test over windows server 2012 platform on RHEL6.3.z host .

So It is a testblocker to me

Comment 63 Mike Cao 2012-11-02 10:53:03 UTC

(In reply to comment #62)
> Since I was asked to run svvp test over windows server 2012 platform on
> RHEL6.3.z host .
> 
> So It is a testblocker to me

Sorry for updating to a wrong bug 
ignore this comment pls.

Comment 65 Ademar Reis 2012-11-02 18:27:12 UTC

(In reply to comment #63)
> (In reply to comment #62)
> > Since I was asked to run svvp test over windows server 2012 platform on
> > RHEL6.3.z host .
> > 
> > So It is a testblocker to me
> 
> Sorry for updating to a wrong bug 
> ignore this comment pls.

I'm assuming adding TestBlocker flag was a mistake as well, and therefore I'm removing it. Please re-add it if it's indeed a test blocker to you.

Comment 66 Karen Noel 2012-11-14 12:46:32 UTC

*** Bug 873613 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

acathrow
areis
bcao
bsarathy
dyasny
fstrauss
gleb
gnatapov
juzhang
knoel
leiwang
ltroan
michen
mkenneth
mtosatti
qguan
qzhang
rhod
tburke
virt-maint
vrozenfe
xfu
yuzhou