This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 664902 - Windows guests hit BSOD
Windows guests hit BSOD
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm (Show other bugs)
5.6
Unspecified Unspecified
medium Severity medium
: rc
: ---
Assigned To: Paolo Bonzini
Virtualization Bugs
:
Depends On:
Blocks: Rhel5KvmTier3
  Show dependency treegraph
 
Reported: 2010-12-21 22:31 EST by juzhang
Modified: 2013-01-09 18:26 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-12-29 08:15:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
BOSD's guest screenshot and win7 debug info (68.00 KB, application/x-gzip)
2010-12-21 22:31 EST, juzhang
no flags Details

  None (edit)
Description juzhang 2010-12-21 22:31:17 EST
Created attachment 470119 [details]
BOSD's guest screenshot and win7 debug info

Description of problem:
Concurrent run 14 guests on one host and run app in each guest,at last.win2008 32/64 and win7-64 hit BSOD.
Please note,I am not very clear how long run before hit BSOD.



Version-Release number of selected component (if applicable):
#rpm -qa | grep libvirt
libvirt-0.8.2-15.el5
libvirt-python-0.8.2-15.el5
libvirt-0.8.2-15.el5
libvirt-cim-0.5.8-3.el5

kvm version:

kvm-83-223.el5

Host Kernel
#uname -r
2.6.18-237.el5


How reproducible:


Steps to Reproduce:
1.Concurrent run 14 guests on one host.
#virsh list
 Id Name                 State
----------------------------------
  9 rhel4u8-i386         running
 10 rhel4u8-x86_64       running
 12 rhel5u5-i386         running
 13 rhel5u5-x86_64       running
 14 rhel6-i386           running
 16 rhel6-x86_64         running
 17 winxp                running
 18 win2003-32           running
 19 win2003-63           running
 22 win2008r2-64         running
 25 win7-32              running
 26 win7-64              running
 27 win2008-32           running
 28 win2008-64           running

 Some of guest kvm CML
#/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2008 -smp 2,sockets=2,cores=1,threads=1 -name win2008-32 -uuid 0983859a-cc00-0b14-c7ab-3518855cd487 -monitor unix:/var/lib/libvirt/qemu/win2008-32.monitor,server,nowait -no-kvm-pit-reinjection -boot c -drive file=/mnt/win2008-32-virtio.qcow2,if=ide,bus=0,unit=0,boot=on,format=qcow2,cache=none -net nic,macaddr=52:54:00:53:d6:3d,vlan=0 -net tap,fd=16,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:3 -k en-us -vga cirrus -balloon virtio

#/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2008 -smp 2,sockets=2,cores=1,threads=1 -name win2008-64 -uuid 9b5a0a7b-5f6e-68f1-762d-5ab5ae7f4ac0 -monitor unix:/var/lib/libvirt/qemu/win2008-64.monitor,server,nowait -no-kvm-pit-reinjection -boot c -drive file=/media/img/win2008-64-virtio.qcow2,if=ide,bus=0,unit=0,boot=on,format=qcow2,cache=none -net nic,macaddr=52:54:00:0d:4c:f1,vlan=0 -net tap,fd=16,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:4 -k en-us -vga cirrus -balloon virtio

2.In every windows guest,we use clock.redhat.com ntp server rather then mircosoft ntp server.and install cygwin app in order to run scripts.
#!/bin/sh
ntpdate -b clock.redhat.com;
for ((;;));
do
echo ------------------------`date`-------------------------->>timekeep.txt
ntpdate -qb clock.redhat.com >>timekeep.txt
sleep 7200
done

3.In windows in2008 32/64,we also run oracle DB.
  
Actual results:
win2008 32/64 and win7-64 hit BSOD. 

Expected results:
All guest run well

Additional info:
1.The attachment is BOSD's guest screenshot and win7 debug info.
---snip of win7 debug info---
CLOCK_WATCHDOG_TIMEOUT (101)

An expected clock interrupt was not received on a secondary processor in an

MP system within the allocated interval. This indicates that the specified

processor is hung and not processing interrupts.

Arguments:

Arg1: 00000060, Clock interrupt time out interval in nominal clock ticks.

Arg2: 00000000, 0.

Arg3: 803d1120, The PRCB address of the hung processor.

Arg4: 00000001, 0.



Debugging Details:

------------------





BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_2_PROC



DEFAULT_BUCKET_ID:  INTEL_CPU_MICROCODE_ZERO



PROCESS_NAME:  System



CURRENT_IRQL:  1c



STACK_TEXT:  

8039db80 81aea8c5 00000101 00000060 00000000 nt!KeBugCheckEx+0x1e

8039dbb4 81aec6bd 83c3a31b 000000d1 8039dc50 nt!KeUpdateRunTime+0xd5

8039dbb4 81a4576f 83c3a31b 000000d1 8039dc50 nt!KeUpdateSystemTime+0xed

8039dc50 87421622 87421466 8039dc68 00040000 nt!KeIpiGenericCall+0x97

WARNING: Stack unwind information not available. Following frames may be wrong.

8039dd38 81ae720f 83a41138 00000000 83a41002 sptd+0x14622

8039dd4c 81a73602 83cff368 00000000 000001d8 nt!KiExitDispatcher+0x1db

8039dd84 81a839c6 83df4030 83cff368 83cff368 nt!IoReuseIrp+0x27

00000000 00000000 00000000 00000000 00000000 nt!IofCallDriver+0x63





STACK_COMMAND:  kb



SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE



FOLLOWUP_NAME:  MachineOwner



MODULE_NAME: Unknown_Module



IMAGE_NAME:  Unknown_Image



DEBUG_FLR_IMAGE_TIMESTAMP:  0



FAILURE_BUCKET_ID:  CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE



BUCKET_ID:  CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE



Followup: MachineOwner

---------

2.#cat /proc/cpuinfo
-----------
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           X3323  @ 2.50GHz

3.#free
             total       used       free     shared    buffers     cached
Mem:      24665660   24458500     207160          0     248980    4855064
-/+ buffers/cache:   19354456    5311204
Swap:      1052248    1020252      31996

4.#top
24756 root      15   0 2220m 2.0g  48m S 191.6  8.4   1913:47 qemu-kvm                                                                                       
10217 root      15   0  433m  37m  11m S  3.7  0.2   1:56.28 /usr/share/virt                                                                                 
11152 root      15   0 2224m 2.0g 3476 S  2.0  8.4   1993:27 qemu-kvm                                                                                        
 9880 root      15   0 2224m 2.0g 3492 S  1.3  8.4   3133:07 qemu-kvm                                                                                        
 9912 root      15   0 2224m 2.0g 3476 S  1.3  8.4   1089:00 qemu-kvm                                                                                        
 9849 root      15   0 1242m 1.0g 3500 S  1.0  4.4   2201:26 qemu-kvm                                                                                        
25289 root      15   0 2224m 1.9g 2512 S  0.7  8.3  31:01.53 qemu-kvm                                                                                        
 3890 root      15   0  232m  15m 2724 S  0.3  0.1   4:23.98 libvirtd                                                                                        
 8128 root      15   0 2206m 848m 2192 S  0.3  3.5  29:11.85 qemu-kvm                                                                                        
 8441 root      16   0 2206m 1.0g 2188 S  0.3  4.2  24:42.74 qemu-kvm                                                                                        
 9369 root      15   0 2207m 880m 2180 S  0.3  3.7  11:29.21 qemu-kvm                                                                                        
 9409 root      15   0  106m  19m 2692 S  0.3  0.1   0:30.30 sshd                                                                                            
10065 root      15   0 2224m 1.9g 3480 S  0.3  8.3 448:59.55 qemu-kvm
Comment 1 Paolo Bonzini 2010-12-22 04:44:46 EST
How many cores does the host have?  This BSOD is quite common in case you're overcommitting resources on the host.

Unfortunately, the "!analyze -v" output from the debugger will always be of the boot processor (the one checking the watchdog) rather than the hung processor.  It's better if you include backtraces from all the processors.  You can change processor in WinDBG with the "~NN" command (where NN is from 0 to #processors-1) and then type "kb" to print the backtrace.
Comment 2 juzhang 2010-12-22 21:26:34 EST
(In reply to comment #1)
> How many cores does the host have?  This BSOD is quite common in case you're
> overcommitting resources on the host.
Just four processors.
#cat /proc/cpuinfo | grep processor
processor	: 0
processor	: 1
processor	: 2
processor	: 3
> 
> Unfortunately, the "!analyze -v" output from the debugger will always be of the
> boot processor (the one checking the watchdog) rather than the hung processor. 
> It's better if you include backtraces from all the processors.  You can change
> processor in WinDBG with the "~NN" command (where NN is from 0 to
> #processors-1) and then type "kb" to print the backtrace.
I assign two cores to this guest.
0: kd> ~NN
       ^ Syntax error in '~NN'
0: kd> ~0
0: kd> kb
ChildEBP RetAddr  Args to Child
WARNING: Frame IP not in any known module. Following frames may be wrong.
88553a60 81aba8c5 00000101 00000060 00000000 0x81adcb8d
88553a94 81abc6bd 88553a1b 000000d1 88553b24 0x81aba8c5
88553aa4 81a53bfe badb0d00 00000000 00000000 0x81abc6bd
88553b24 81a6a94b 88553b84 00000000 a4310000 0x81a53bfe
88553c08 81a46448 a4311000 00001000 88553d10 0x81a6a94b
88553c98 81a4751b 84100a5c 00000000 00000001 0x81a46448
88553cec 81a44695 84899880 88553d10 00000000 0x81a4751b
88553d44 81ab4e22 83849ff0 00000000 8383f020 0x81a44695
88553d7c 81be4c42 83849ff0 e6e092ad 00000000 0x81ab4e22
88553dc0 81a4df4e 81ab4d25 00000000 00000000 0x81be4c42
00000000 00000000 00000000 00000000 00000000 0x81a4df4e
0: kd> ~1
Can't switch processors on a kernel triage dump
Comment 3 Paolo Bonzini 2010-12-23 04:34:04 EST
If you have so many guests running at the same time, I strongly suggest you make Windows guests uniprocessor.

Ok to close as NOTABUG?
Comment 4 juzhang 2010-12-23 04:53:11 EST
(In reply to comment #3)
> If you have so many guests running at the same time, I strongly suggest you
> make Windows guests uniprocessor.
> 
> Ok to close as NOTABUG?
It's ok for me.if you make this decision according to infos what I have provided.
Comment 5 Dor Laor 2010-12-23 07:51:24 EST
I would not recommend just closing it.
First, try run windows as it should be - using the pmtimer - docs:
http://cleo.tlv.redhat.com/qumrawiki/KVM/TimeKeeping

Also, please us -M rhel5.5 and beyond
Comment 6 juzhang 2010-12-24 01:34:20 EST
(In reply to comment #5)
> I would not recommend just closing it.
> First, try run windows as it should be - using the pmtimer - docs:
> http://cleo.tlv.redhat.com/qumrawiki/KVM/TimeKeeping
> 
> Also, please us -M rhel5.5 and beyond

Hi,Dor

   Just confirm,Run again with -M rhel5.5 and beyond using the comment0's stress?
In comment0 situation,cpu overcommit rate is very high.almost 10X overcommit.
Comment 7 Dor Laor 2010-12-26 08:13:39 EST
(In reply to comment #6)
> (In reply to comment #5)
> > I would not recommend just closing it.
> > First, try run windows as it should be - using the pmtimer - docs:
> > http://cleo.tlv.redhat.com/qumrawiki/KVM/TimeKeeping
> > 
> > Also, please us -M rhel5.5 and beyond
> 
> Hi,Dor
> 
>    Just confirm,Run again with -M rhel5.5 and beyond using the comment0's
> stress?
> In comment0 situation,cpu overcommit rate is very high.almost 10X overcommit.

The main important issue is to use the correct flags like in the above wiki - -rtc-td-hack is the main important one
Comment 8 juzhang 2010-12-27 06:04:18 EST
> The main important issue is to use the correct flags like in the above wiki -
> -rtc-td-hack is the main important one

Concurrent run 5 windows guests on one host and run app in each guest,will check results and update results tmr.

1.In every windows guest,we use clock.redhat.com ntp server rather then
mircosoft ntp server.and install cygwin app in order to run scripts.
#!/bin/sh
ntpdate -b clock.redhat.com;
for ((;;));
do
echo ------------------------`date`-------------------------->>timekeep.txt
ntpdate -qb clock.redhat.com >>timekeep.txt
sleep 7200
done

2.In windows in2008 32 and win7-64,we also run oracle DB.


Each guest CML with -rtc-td-hack and -M rhel5.6.0

#ps -aux | grep qemu-kvm
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root     12804 25.6 12.6 4412292 3115696 pts/3 Sl+  04:14  25:26 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win7-32 -monitor stdio -boot c -drive file=/mnt/win7-32-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:13,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :1 -k en-us -vga cirrus -rtc-td-hack
root     12821 24.4 16.7 4412292 4137060 pts/4 Sl+  04:14  24:10 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win7-64 -monitor stdio -boot c -drive file=/mnt/win7-64-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:14,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :2 -k en-us -vga cirrus -rtc-td-hack
root     12840 14.5 16.6 4419296 4117552 pts/5 Sl+  04:14  14:21 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win2008-32 -monitor stdio -boot c -drive file=/root/img/win2008-32-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:15,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :3 -k en-us -vga cirrus -rtc-td-hack
root     12857  2.5 16.6 4412292 4117272 pts/7 Sl+  04:14   2:31 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win2008-64 -monitor stdio -boot c -drive file=/root/img/win2008-64-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:16,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :4 -k en-us -vga cirrus -rtc-td-hack
root     12874  2.6 16.6 4412292 4114608 pts/6 Sl+  04:14   2:38 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win2008-64 -monitor stdio -boot c -drive file=/root/img/win2008r2-64-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:17,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :5 -k en-us -vga cirrus -rtc-td-hack
Comment 9 Dor Laor 2010-12-27 07:57:19 EST
(In reply to comment #8)
> > The main important issue is to use the correct flags like in the above wiki -
> > -rtc-td-hack is the main important one
> 
> Concurrent run 5 windows guests on one host and run app in each guest,will
> check results and update results tmr.
> 
> 1.In every windows guest,we use clock.redhat.com ntp server rather then
> mircosoft ntp server.and install cygwin app in order to run scripts.
> #!/bin/sh
> ntpdate -b clock.redhat.com;

Please use the standard windows NTP and not some funny cygwin code
Comment 10 juzhang 2010-12-27 21:55:47 EST
> Please use the standard windows NTP and not some funny cygwin code
Ok,I will use the standard windows NTP rather than cygwin.and update results later.

Anyway,I paste my results using cygwin code,maybe useful.
for now,almost run 16 hours.no windows guest were crash.for win7-32/64,no found time drift.for windows2008-64,offset is -6.167691.
-----snip of time logs from win7-32---

------------------------Mon Dec 27 17:47:34 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset -0.029536, delay 0.30057^M
27 Dec 17:47:36 ntpdate.exe[3196]: step time server 66.187.233.4 offset -0.029536 sec
-------------------------------------------------------------
----------------------------------------------------------------
------------------------Tue Dec 28 09:47:51 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset -0.257216, delay 0.28958^M
28 Dec 09:47:53 ntpdate.exe[3824]: step time server 66.187.233.4 offset -0.257216 sec

-----snip of time logs from win7-64---
------------------------Mon Dec 27 17:50:50 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset 0.002618, delay 0.28014^M
27 Dec 17:50:52 ntpdate.exe[3524]: step time server 66.187.233.4 offset 0.002618 sec^M
---------------------------------------------------
------------------------Tue Dec 28 09:51:08 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset -0.198457, delay 0.29611^M
28 Dec 09:51:09 ntpdate.exe[2372]: step time server 66.187.233.4 offset -0.198457 sec

-----snip of time logs from win2008-64--- please note,this offset is -6.167691 sec

------------------------Mon Dec 27 09:46:23 PST 2010--------------------------
server 66.187.233.4, stratum 1, offset 0.026871, delay 0.30544^M
27 Dec 01:46:26 ntpdate.exe[920]: step time server 66.187.233.4 offset 0.026871 sec^M

--------------------------------------------------------------
------------------------Tue Dec 28 01:47:08 PST 2010--------------------------
server 66.187.233.4, stratum 1, offset -6.167691, delay 0.28793^M
27 Dec 17:47:11 ntpdate.exe[1872]: step time server 66.187.233.4 offset -6.167691 sec
Comment 11 juzhang 2010-12-29 07:30:12 EST
(In reply to comment #9)
> (In reply to comment #8)
> > > The main important issue is to use the correct flags like in the above wiki -
> > > -rtc-td-hack is the main important one
> > 
> > Concurrent run 5 windows guests on one host and run app in each guest,will
> > check results and update results tmr.
> > 
> > 1.In every windows guest,we use clock.redhat.com ntp server rather then
> > mircosoft ntp server.and install cygwin app in order to run scripts.
> > #!/bin/sh
> > ntpdate -b clock.redhat.com;
> 
> Please use the standard windows NTP and not some funny cygwin code

Direct run windows .bat rather than cygwin code
@echo on
ntpdate -b clock.redhat.com
:end
echo ------------------------%time%-------------------------->>timekeep.txt
ntpdate -qb clock.redhat.com >>timekeep.txt
ping /n 7200 127.0.0.1>nul
goto :end


Almost run 32 hours.no windows guest were crash.however,I think win2008-64 time drift is a little big,offset is -6.362924

----snip of win2008-64 timekeeping log---

------------------------21:42:19.86--------------------------
server 66.187.233.4, stratum 1, offset 0.002129, delay 0.29085
27 Dec 21:42:21 ntpdate[728]: step time server 66.187.233.4 offset 0.002129 sec
------------------------23:42:20.55--------------------------
server 66.187.233.4, stratum 1, offset -0.190169, delay 0.29376
27 Dec 23:42:21 ntpdate[2792]: step time server 66.187.233.4 offset -0.190169 sec
------------------------ 1:42:21.74--------------------------
server 66.187.233.4, stratum 1, offset -0.510236, delay 0.29533
28 Dec 01:42:23 ntpdate[2468]: step time server 66.187.233.4 offset -0.510236 sec
------------------------ 3:42:22.39--------------------------
server 66.187.233.4, stratum 1, offset -0.568400, delay 0.34065
28 Dec 03:42:23 ntpdate[528]: step time server 66.187.233.4 offset -0.568400 sec
------------------------ 5:42:22.98--------------------------
server 66.187.233.4, stratum 1, offset -0.716122, delay 0.29054
28 Dec 05:42:24 ntpdate[3048]: step time server 66.187.233.4 offset -0.716122 sec
------------------------ 7:42:23.52--------------------------
server 66.187.233.4, stratum 1, offset -1.006065, delay 0.28987
28 Dec 07:42:24 ntpdate[1776]: step time server 66.187.233.4 offset -1.006065 sec
------------------------ 9:42:24.35--------------------------
server 66.187.233.4, stratum 1, offset -1.762275, delay 0.28784
28 Dec 09:42:25 ntpdate[320]: step time server 66.187.233.4 offset -1.762275 sec
------------------------11:42:25.12--------------------------
server 66.187.233.4, stratum 1, offset -2.195595, delay 0.29216
28 Dec 11:42:26 ntpdate[2332]: step time server 66.187.233.4 offset -2.195595 sec
------------------------13:42:25.90--------------------------
server 66.187.233.4, stratum 1, offset -2.607375, delay 0.29236
28 Dec 13:42:27 ntpdate[2344]: step time server 66.187.233.4 offset -2.607375 sec
------------------------15:42:26.68--------------------------
server 66.187.233.4, stratum 1, offset -2.870626, delay 0.28792
28 Dec 15:42:27 ntpdate[220]: step time server 66.187.233.4 offset -2.870626 sec
------------------------17:42:27.05--------------------------
server 66.187.233.4, stratum 1, offset -2.918016, delay 0.29269
28 Dec 17:42:28 ntpdate[2500]: step time server 66.187.233.4 offset -2.918016 sec
------------------------19:42:27.44--------------------------
server 66.187.233.4, stratum 1, offset -2.960739, delay 0.29117
28 Dec 19:42:28 ntpdate[2344]: step time server 66.187.233.4 offset -2.960739 sec
------------------------21:42:28.08--------------------------
server 66.187.233.4, stratum 1, offset -3.450191, delay 0.30730
28 Dec 21:42:29 ntpdate[988]: step time server 66.187.233.4 offset -3.450191 sec
------------------------23:42:29.60--------------------------
server 66.187.233.4, stratum 1, offset -5.592107, delay 0.28990
28 Dec 23:42:30 ntpdate[2892]: step time server 66.187.233.4 offset -5.592107 sec
------------------------ 1:42:30.13--------------------------
server 66.187.233.4, stratum 1, offset -5.731235, delay 0.29111
29 Dec 01:42:31 ntpdate[228]: step time server 66.187.233.4 offset -5.731235 sec
------------------------ 3:42:31.10--------------------------
server 66.187.233.4, stratum 1, offset -6.362924, delay 0.29282
29 Dec 03:42:32 ntpdate[1036]: step time server 66.187.233.4 offset -6.362924 sec
Comment 12 Dor Laor 2010-12-29 08:15:14 EST
Probably NTP error rate is larger than 0.05% so NTP stops updating the local time. On Linux there is an option to force ntp update despite time changes. I should be done here as well.
I'm closing the bug since the blue screen stopped and the time drift looks reasonable (assuming NTP froced update)

Note You need to log in before you can comment on or make changes to this bug.