Bug 664902
| Summary: | Windows guests hit BSOD | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | juzhang <juzhang> | ||||
| Component: | kvm | Assignee: | Paolo Bonzini <pbonzini> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 5.6 | CC: | dyuan, eblake, michen, mkenneth, pbonzini, tburke, virt-maint, xen-maint, ydu | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2010-12-29 13:15:14 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 580946 | ||||||
| Attachments: |
|
||||||
How many cores does the host have? This BSOD is quite common in case you're overcommitting resources on the host. Unfortunately, the "!analyze -v" output from the debugger will always be of the boot processor (the one checking the watchdog) rather than the hung processor. It's better if you include backtraces from all the processors. You can change processor in WinDBG with the "~NN" command (where NN is from 0 to #processors-1) and then type "kb" to print the backtrace. (In reply to comment #1) > How many cores does the host have? This BSOD is quite common in case you're > overcommitting resources on the host. Just four processors. #cat /proc/cpuinfo | grep processor processor : 0 processor : 1 processor : 2 processor : 3 > > Unfortunately, the "!analyze -v" output from the debugger will always be of the > boot processor (the one checking the watchdog) rather than the hung processor. > It's better if you include backtraces from all the processors. You can change > processor in WinDBG with the "~NN" command (where NN is from 0 to > #processors-1) and then type "kb" to print the backtrace. I assign two cores to this guest. 0: kd> ~NN ^ Syntax error in '~NN' 0: kd> ~0 0: kd> kb ChildEBP RetAddr Args to Child WARNING: Frame IP not in any known module. Following frames may be wrong. 88553a60 81aba8c5 00000101 00000060 00000000 0x81adcb8d 88553a94 81abc6bd 88553a1b 000000d1 88553b24 0x81aba8c5 88553aa4 81a53bfe badb0d00 00000000 00000000 0x81abc6bd 88553b24 81a6a94b 88553b84 00000000 a4310000 0x81a53bfe 88553c08 81a46448 a4311000 00001000 88553d10 0x81a6a94b 88553c98 81a4751b 84100a5c 00000000 00000001 0x81a46448 88553cec 81a44695 84899880 88553d10 00000000 0x81a4751b 88553d44 81ab4e22 83849ff0 00000000 8383f020 0x81a44695 88553d7c 81be4c42 83849ff0 e6e092ad 00000000 0x81ab4e22 88553dc0 81a4df4e 81ab4d25 00000000 00000000 0x81be4c42 00000000 00000000 00000000 00000000 00000000 0x81a4df4e 0: kd> ~1 Can't switch processors on a kernel triage dump If you have so many guests running at the same time, I strongly suggest you make Windows guests uniprocessor. Ok to close as NOTABUG? (In reply to comment #3) > If you have so many guests running at the same time, I strongly suggest you > make Windows guests uniprocessor. > > Ok to close as NOTABUG? It's ok for me.if you make this decision according to infos what I have provided. I would not recommend just closing it. First, try run windows as it should be - using the pmtimer - docs: http://cleo.tlv.redhat.com/qumrawiki/KVM/TimeKeeping Also, please us -M rhel5.5 and beyond (In reply to comment #5) > I would not recommend just closing it. > First, try run windows as it should be - using the pmtimer - docs: > http://cleo.tlv.redhat.com/qumrawiki/KVM/TimeKeeping > > Also, please us -M rhel5.5 and beyond Hi,Dor Just confirm,Run again with -M rhel5.5 and beyond using the comment0's stress? In comment0 situation,cpu overcommit rate is very high.almost 10X overcommit. (In reply to comment #6) > (In reply to comment #5) > > I would not recommend just closing it. > > First, try run windows as it should be - using the pmtimer - docs: > > http://cleo.tlv.redhat.com/qumrawiki/KVM/TimeKeeping > > > > Also, please us -M rhel5.5 and beyond > > Hi,Dor > > Just confirm,Run again with -M rhel5.5 and beyond using the comment0's > stress? > In comment0 situation,cpu overcommit rate is very high.almost 10X overcommit. The main important issue is to use the correct flags like in the above wiki - -rtc-td-hack is the main important one > The main important issue is to use the correct flags like in the above wiki -
> -rtc-td-hack is the main important one
Concurrent run 5 windows guests on one host and run app in each guest,will check results and update results tmr.
1.In every windows guest,we use clock.redhat.com ntp server rather then
mircosoft ntp server.and install cygwin app in order to run scripts.
#!/bin/sh
ntpdate -b clock.redhat.com;
for ((;;));
do
echo ------------------------`date`-------------------------->>timekeep.txt
ntpdate -qb clock.redhat.com >>timekeep.txt
sleep 7200
done
2.In windows in2008 32 and win7-64,we also run oracle DB.
Each guest CML with -rtc-td-hack and -M rhel5.6.0
#ps -aux | grep qemu-kvm
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root 12804 25.6 12.6 4412292 3115696 pts/3 Sl+ 04:14 25:26 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win7-32 -monitor stdio -boot c -drive file=/mnt/win7-32-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:13,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :1 -k en-us -vga cirrus -rtc-td-hack
root 12821 24.4 16.7 4412292 4137060 pts/4 Sl+ 04:14 24:10 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win7-64 -monitor stdio -boot c -drive file=/mnt/win7-64-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:14,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :2 -k en-us -vga cirrus -rtc-td-hack
root 12840 14.5 16.6 4419296 4117552 pts/5 Sl+ 04:14 14:21 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win2008-32 -monitor stdio -boot c -drive file=/root/img/win2008-32-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:15,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :3 -k en-us -vga cirrus -rtc-td-hack
root 12857 2.5 16.6 4412292 4117272 pts/7 Sl+ 04:14 2:31 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win2008-64 -monitor stdio -boot c -drive file=/root/img/win2008-64-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:16,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :4 -k en-us -vga cirrus -rtc-td-hack
root 12874 2.6 16.6 4412292 4114608 pts/6 Sl+ 04:14 2:38 /usr/libexec/qemu-kvm -M rhel5.6.0 -m 4096 -smp 1 -name win2008-64 -monitor stdio -boot c -drive file=/root/img/win2008r2-64-virtio.qcow2,if=ide,boot=on,format=qcow2,werror=stop -net nic,macaddr=54:52:00:27:cd:17,vlan=0,model=rtl8139 -net tap,vlan=0 -usb -usbdevice tablet -vnc :5 -k en-us -vga cirrus -rtc-td-hack
(In reply to comment #8) > > The main important issue is to use the correct flags like in the above wiki - > > -rtc-td-hack is the main important one > > Concurrent run 5 windows guests on one host and run app in each guest,will > check results and update results tmr. > > 1.In every windows guest,we use clock.redhat.com ntp server rather then > mircosoft ntp server.and install cygwin app in order to run scripts. > #!/bin/sh > ntpdate -b clock.redhat.com; Please use the standard windows NTP and not some funny cygwin code > Please use the standard windows NTP and not some funny cygwin code
Ok,I will use the standard windows NTP rather than cygwin.and update results later.
Anyway,I paste my results using cygwin code,maybe useful.
for now,almost run 16 hours.no windows guest were crash.for win7-32/64,no found time drift.for windows2008-64,offset is -6.167691.
-----snip of time logs from win7-32---
------------------------Mon Dec 27 17:47:34 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset -0.029536, delay 0.30057^M
27 Dec 17:47:36 ntpdate.exe[3196]: step time server 66.187.233.4 offset -0.029536 sec
-------------------------------------------------------------
----------------------------------------------------------------
------------------------Tue Dec 28 09:47:51 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset -0.257216, delay 0.28958^M
28 Dec 09:47:53 ntpdate.exe[3824]: step time server 66.187.233.4 offset -0.257216 sec
-----snip of time logs from win7-64---
------------------------Mon Dec 27 17:50:50 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset 0.002618, delay 0.28014^M
27 Dec 17:50:52 ntpdate.exe[3524]: step time server 66.187.233.4 offset 0.002618 sec^M
---------------------------------------------------
------------------------Tue Dec 28 09:51:08 CST 2010--------------------------
server 66.187.233.4, stratum 1, offset -0.198457, delay 0.29611^M
28 Dec 09:51:09 ntpdate.exe[2372]: step time server 66.187.233.4 offset -0.198457 sec
-----snip of time logs from win2008-64--- please note,this offset is -6.167691 sec
------------------------Mon Dec 27 09:46:23 PST 2010--------------------------
server 66.187.233.4, stratum 1, offset 0.026871, delay 0.30544^M
27 Dec 01:46:26 ntpdate.exe[920]: step time server 66.187.233.4 offset 0.026871 sec^M
--------------------------------------------------------------
------------------------Tue Dec 28 01:47:08 PST 2010--------------------------
server 66.187.233.4, stratum 1, offset -6.167691, delay 0.28793^M
27 Dec 17:47:11 ntpdate.exe[1872]: step time server 66.187.233.4 offset -6.167691 sec
(In reply to comment #9) > (In reply to comment #8) > > > The main important issue is to use the correct flags like in the above wiki - > > > -rtc-td-hack is the main important one > > > > Concurrent run 5 windows guests on one host and run app in each guest,will > > check results and update results tmr. > > > > 1.In every windows guest,we use clock.redhat.com ntp server rather then > > mircosoft ntp server.and install cygwin app in order to run scripts. > > #!/bin/sh > > ntpdate -b clock.redhat.com; > > Please use the standard windows NTP and not some funny cygwin code Direct run windows .bat rather than cygwin code @echo on ntpdate -b clock.redhat.com :end echo ------------------------%time%-------------------------->>timekeep.txt ntpdate -qb clock.redhat.com >>timekeep.txt ping /n 7200 127.0.0.1>nul goto :end Almost run 32 hours.no windows guest were crash.however,I think win2008-64 time drift is a little big,offset is -6.362924 ----snip of win2008-64 timekeeping log--- ------------------------21:42:19.86-------------------------- server 66.187.233.4, stratum 1, offset 0.002129, delay 0.29085 27 Dec 21:42:21 ntpdate[728]: step time server 66.187.233.4 offset 0.002129 sec ------------------------23:42:20.55-------------------------- server 66.187.233.4, stratum 1, offset -0.190169, delay 0.29376 27 Dec 23:42:21 ntpdate[2792]: step time server 66.187.233.4 offset -0.190169 sec ------------------------ 1:42:21.74-------------------------- server 66.187.233.4, stratum 1, offset -0.510236, delay 0.29533 28 Dec 01:42:23 ntpdate[2468]: step time server 66.187.233.4 offset -0.510236 sec ------------------------ 3:42:22.39-------------------------- server 66.187.233.4, stratum 1, offset -0.568400, delay 0.34065 28 Dec 03:42:23 ntpdate[528]: step time server 66.187.233.4 offset -0.568400 sec ------------------------ 5:42:22.98-------------------------- server 66.187.233.4, stratum 1, offset -0.716122, delay 0.29054 28 Dec 05:42:24 ntpdate[3048]: step time server 66.187.233.4 offset -0.716122 sec ------------------------ 7:42:23.52-------------------------- server 66.187.233.4, stratum 1, offset -1.006065, delay 0.28987 28 Dec 07:42:24 ntpdate[1776]: step time server 66.187.233.4 offset -1.006065 sec ------------------------ 9:42:24.35-------------------------- server 66.187.233.4, stratum 1, offset -1.762275, delay 0.28784 28 Dec 09:42:25 ntpdate[320]: step time server 66.187.233.4 offset -1.762275 sec ------------------------11:42:25.12-------------------------- server 66.187.233.4, stratum 1, offset -2.195595, delay 0.29216 28 Dec 11:42:26 ntpdate[2332]: step time server 66.187.233.4 offset -2.195595 sec ------------------------13:42:25.90-------------------------- server 66.187.233.4, stratum 1, offset -2.607375, delay 0.29236 28 Dec 13:42:27 ntpdate[2344]: step time server 66.187.233.4 offset -2.607375 sec ------------------------15:42:26.68-------------------------- server 66.187.233.4, stratum 1, offset -2.870626, delay 0.28792 28 Dec 15:42:27 ntpdate[220]: step time server 66.187.233.4 offset -2.870626 sec ------------------------17:42:27.05-------------------------- server 66.187.233.4, stratum 1, offset -2.918016, delay 0.29269 28 Dec 17:42:28 ntpdate[2500]: step time server 66.187.233.4 offset -2.918016 sec ------------------------19:42:27.44-------------------------- server 66.187.233.4, stratum 1, offset -2.960739, delay 0.29117 28 Dec 19:42:28 ntpdate[2344]: step time server 66.187.233.4 offset -2.960739 sec ------------------------21:42:28.08-------------------------- server 66.187.233.4, stratum 1, offset -3.450191, delay 0.30730 28 Dec 21:42:29 ntpdate[988]: step time server 66.187.233.4 offset -3.450191 sec ------------------------23:42:29.60-------------------------- server 66.187.233.4, stratum 1, offset -5.592107, delay 0.28990 28 Dec 23:42:30 ntpdate[2892]: step time server 66.187.233.4 offset -5.592107 sec ------------------------ 1:42:30.13-------------------------- server 66.187.233.4, stratum 1, offset -5.731235, delay 0.29111 29 Dec 01:42:31 ntpdate[228]: step time server 66.187.233.4 offset -5.731235 sec ------------------------ 3:42:31.10-------------------------- server 66.187.233.4, stratum 1, offset -6.362924, delay 0.29282 29 Dec 03:42:32 ntpdate[1036]: step time server 66.187.233.4 offset -6.362924 sec Probably NTP error rate is larger than 0.05% so NTP stops updating the local time. On Linux there is an option to force ntp update despite time changes. I should be done here as well. I'm closing the bug since the blue screen stopped and the time drift looks reasonable (assuming NTP froced update) |
Created attachment 470119 [details] BOSD's guest screenshot and win7 debug info Description of problem: Concurrent run 14 guests on one host and run app in each guest,at last.win2008 32/64 and win7-64 hit BSOD. Please note,I am not very clear how long run before hit BSOD. Version-Release number of selected component (if applicable): #rpm -qa | grep libvirt libvirt-0.8.2-15.el5 libvirt-python-0.8.2-15.el5 libvirt-0.8.2-15.el5 libvirt-cim-0.5.8-3.el5 kvm version: kvm-83-223.el5 Host Kernel #uname -r 2.6.18-237.el5 How reproducible: Steps to Reproduce: 1.Concurrent run 14 guests on one host. #virsh list Id Name State ---------------------------------- 9 rhel4u8-i386 running 10 rhel4u8-x86_64 running 12 rhel5u5-i386 running 13 rhel5u5-x86_64 running 14 rhel6-i386 running 16 rhel6-x86_64 running 17 winxp running 18 win2003-32 running 19 win2003-63 running 22 win2008r2-64 running 25 win7-32 running 26 win7-64 running 27 win2008-32 running 28 win2008-64 running Some of guest kvm CML #/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2008 -smp 2,sockets=2,cores=1,threads=1 -name win2008-32 -uuid 0983859a-cc00-0b14-c7ab-3518855cd487 -monitor unix:/var/lib/libvirt/qemu/win2008-32.monitor,server,nowait -no-kvm-pit-reinjection -boot c -drive file=/mnt/win2008-32-virtio.qcow2,if=ide,bus=0,unit=0,boot=on,format=qcow2,cache=none -net nic,macaddr=52:54:00:53:d6:3d,vlan=0 -net tap,fd=16,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:3 -k en-us -vga cirrus -balloon virtio #/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2008 -smp 2,sockets=2,cores=1,threads=1 -name win2008-64 -uuid 9b5a0a7b-5f6e-68f1-762d-5ab5ae7f4ac0 -monitor unix:/var/lib/libvirt/qemu/win2008-64.monitor,server,nowait -no-kvm-pit-reinjection -boot c -drive file=/media/img/win2008-64-virtio.qcow2,if=ide,bus=0,unit=0,boot=on,format=qcow2,cache=none -net nic,macaddr=52:54:00:0d:4c:f1,vlan=0 -net tap,fd=16,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:4 -k en-us -vga cirrus -balloon virtio 2.In every windows guest,we use clock.redhat.com ntp server rather then mircosoft ntp server.and install cygwin app in order to run scripts. #!/bin/sh ntpdate -b clock.redhat.com; for ((;;)); do echo ------------------------`date`-------------------------->>timekeep.txt ntpdate -qb clock.redhat.com >>timekeep.txt sleep 7200 done 3.In windows in2008 32/64,we also run oracle DB. Actual results: win2008 32/64 and win7-64 hit BSOD. Expected results: All guest run well Additional info: 1.The attachment is BOSD's guest screenshot and win7 debug info. ---snip of win7 debug info--- CLOCK_WATCHDOG_TIMEOUT (101) An expected clock interrupt was not received on a secondary processor in an MP system within the allocated interval. This indicates that the specified processor is hung and not processing interrupts. Arguments: Arg1: 00000060, Clock interrupt time out interval in nominal clock ticks. Arg2: 00000000, 0. Arg3: 803d1120, The PRCB address of the hung processor. Arg4: 00000001, 0. Debugging Details: ------------------ BUGCHECK_STR: CLOCK_WATCHDOG_TIMEOUT_2_PROC DEFAULT_BUCKET_ID: INTEL_CPU_MICROCODE_ZERO PROCESS_NAME: System CURRENT_IRQL: 1c STACK_TEXT: 8039db80 81aea8c5 00000101 00000060 00000000 nt!KeBugCheckEx+0x1e 8039dbb4 81aec6bd 83c3a31b 000000d1 8039dc50 nt!KeUpdateRunTime+0xd5 8039dbb4 81a4576f 83c3a31b 000000d1 8039dc50 nt!KeUpdateSystemTime+0xed 8039dc50 87421622 87421466 8039dc68 00040000 nt!KeIpiGenericCall+0x97 WARNING: Stack unwind information not available. Following frames may be wrong. 8039dd38 81ae720f 83a41138 00000000 83a41002 sptd+0x14622 8039dd4c 81a73602 83cff368 00000000 000001d8 nt!KiExitDispatcher+0x1db 8039dd84 81a839c6 83df4030 83cff368 83cff368 nt!IoReuseIrp+0x27 00000000 00000000 00000000 00000000 00000000 nt!IofCallDriver+0x63 STACK_COMMAND: kb SYMBOL_NAME: ANALYSIS_INCONCLUSIVE FOLLOWUP_NAME: MachineOwner MODULE_NAME: Unknown_Module IMAGE_NAME: Unknown_Image DEBUG_FLR_IMAGE_TIMESTAMP: 0 FAILURE_BUCKET_ID: CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE BUCKET_ID: CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE Followup: MachineOwner --------- 2.#cat /proc/cpuinfo ----------- processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU X3323 @ 2.50GHz 3.#free total used free shared buffers cached Mem: 24665660 24458500 207160 0 248980 4855064 -/+ buffers/cache: 19354456 5311204 Swap: 1052248 1020252 31996 4.#top 24756 root 15 0 2220m 2.0g 48m S 191.6 8.4 1913:47 qemu-kvm 10217 root 15 0 433m 37m 11m S 3.7 0.2 1:56.28 /usr/share/virt 11152 root 15 0 2224m 2.0g 3476 S 2.0 8.4 1993:27 qemu-kvm 9880 root 15 0 2224m 2.0g 3492 S 1.3 8.4 3133:07 qemu-kvm 9912 root 15 0 2224m 2.0g 3476 S 1.3 8.4 1089:00 qemu-kvm 9849 root 15 0 1242m 1.0g 3500 S 1.0 4.4 2201:26 qemu-kvm 25289 root 15 0 2224m 1.9g 2512 S 0.7 8.3 31:01.53 qemu-kvm 3890 root 15 0 232m 15m 2724 S 0.3 0.1 4:23.98 libvirtd 8128 root 15 0 2206m 848m 2192 S 0.3 3.5 29:11.85 qemu-kvm 8441 root 16 0 2206m 1.0g 2188 S 0.3 4.2 24:42.74 qemu-kvm 9369 root 15 0 2207m 880m 2180 S 0.3 3.7 11:29.21 qemu-kvm 9409 root 15 0 106m 19m 2692 S 0.3 0.1 0:30.30 sshd 10065 root 15 0 2224m 1.9g 3480 S 0.3 8.3 448:59.55 qemu-kvm