Bug 1110305 - BSOD - CLOCK_WATCHDOG_TIMEOUT_2 - Win 7SP1 guest, need to set hv_relaxed
BSOD - CLOCK_WATCHDOG_TIMEOUT_2 - Win 7SP1 guest, need to set hv_relaxed
Status: CLOSED CURRENTRELEASE
Product: oVirt
Classification: Community
Component: vdsm (Show other bugs)
3.4
Unspecified Unspecified
unspecified Severity high
: ---
: 3.5.0
Assigned To: Francesco Romani
Pavel Novotny
virt
: Triaged
Depends On:
Blocks: 1073943 1083529
  Show dependency treegraph
 
Reported: 2014-06-17 08:11 EDT by Markus Stockhausen
Modified: 2016-02-10 14:49 EST (History)
21 users (show)

See Also:
Fixed In Version: ovirt-3.5.0-beta2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-10-17 08:40:40 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
1 cpu hypervisor (43.64 KB, image/png)
2014-06-17 11:15 EDT, Markus Stockhausen
no flags Details
2 timedrift hypervisor (34.68 KB, image/png)
2014-06-17 11:16 EDT, Markus Stockhausen
no flags Details
3 memory hypervisor (34.86 KB, image/png)
2014-06-17 11:16 EDT, Markus Stockhausen
no flags Details
4 infiniband/NFS hypervisor (40.73 KB, image/png)
2014-06-17 11:17 EDT, Markus Stockhausen
no flags Details
5 - io bytes NFS server (17.15 KB, image/png)
2014-06-17 11:18 EDT, Markus Stockhausen
no flags Details
6 IOs NFS (17.04 KB, image/png)
2014-06-17 11:18 EDT, Markus Stockhausen
no flags Details
7 io times NFS (18.50 KB, image/png)
2014-06-17 11:19 EDT, Markus Stockhausen
no flags Details
8 cpu nfs server (15.16 KB, image/png)
2014-06-17 11:19 EDT, Markus Stockhausen
no flags Details
9 swap io hypervisor (35.34 KB, image/png)
2014-06-17 11:20 EDT, Markus Stockhausen
no flags Details
10 swap usage hypervisor (12.23 KB, image/png)
2014-06-17 11:21 EDT, Markus Stockhausen
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 27619 master MERGED vm: hyperv: initial windows hyperv support Never
oVirt gerrit 29238 master MERGED core: enable hyperv optimization on windows Never
oVirt gerrit 30188 ovirt-engine-3.5 MERGED core: enable hyperv optimization on windows Never
oVirt gerrit 30254 ovirt-3.5 MERGED vm: hyperv: initial windows hyperv support Never

  None (edit)
Description Markus Stockhausen 2014-06-17 08:11:33 EDT
Description of problem:


Windows 7 SP1 VM was killed with BSOD during normal operation. 

Version-Release number of selected component (if applicable):

Hardware XEON x5650 
Fedora 20
qemu 1.6.2

How reproducible:

Unknown

Steps to Reproduce:

Don't know

Actual results:

VM crashed


Expected results:

VM should run

Additional info:

*******************
******************
*****************

Might be related to BZ990824 

*******************
******************
*****************

Analysis of memory dump:

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000061, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff88002e40180, The PRCB address of the hung processor.
Arg4: 0000000000000001, 0.

Debugging Details:
------------------

Page a0f3f not present in the dump file. Type ".hh dbgerr004" for details
Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll
Das System kann die angegebene Datei nicht finden.

Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll
Das System kann die angegebene Datei nicht finden.


BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_2_PROC

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  WmiPrvSE.exe

CURRENT_IRQL:  d

STACK_TEXT:  
fffff880`04a25318 fffff800`02931a4a : 00000000`00000101 00000000`00000061 00000000`00000000 fffff880`02e40180 : nt!KeBugCheckEx
fffff880`04a25320 fffff800`028e46f7 : 00000000`00000000 fffff800`00000001 00000000`00026160 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4e3e
fffff880`04a253b0 fffff800`02826895 : fffff800`0284c3c0 fffff880`04a25560 fffff800`0284c3c0 fffffa80`00000000 : nt!KeUpdateSystemTime+0x377
fffff880`04a254b0 fffff800`028d7113 : fffff800`02a55e80 00000000`00000001 ffffffff`fffffd80 00000000`00000005 : hal!HalpHpetClockInterrupt+0x8d
fffff880`04a254e0 fffff800`028af939 : 00000000`016d8330 00000000`000007ff fffffa80`05d4f060 fffff800`02b97abd : nt!KiInterruptDispatchNoLock+0x163
fffff880`04a25670 fffff800`02b96bdf : 00000000`00000000 fffff880`04a25ca0 00000000`00000000 00000000`016d7ec0 : nt!KeFlushProcessWriteBuffers+0x65
fffff880`04a256e0 fffff800`02be6416 : 00000000`001ba350 fffff800`00000100 fffff880`04a25870 00000000`00000000 : nt!ExpGetProcessInformation+0x7f
fffff880`04a25830 fffff800`02be6e6d : 00000000`001ba350 fffff960`001a61b3 00000000`001ba350 00000000`00000b3a : nt!ExpQuerySystemInformation+0xfb4
fffff880`04a25be0 fffff800`028d9e53 : fffffa80`05d1b640 00000000`00000001 fffff880`04a25ca0 fffffa80`03793cc0 : nt!NtQuerySystemInformation+0x4d
fffff880`04a25c20 00000000`77b8161a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`0163f9f8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77b8161a


STACK_COMMAND:  kb

SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

Followup: MachineOwner
---------

*******************
******************
*****************

QEMU command line:


/usr/bin/qemu-system-x86_64 -machine accel=kvm -name colvm42 -S -machine pc-1.0,accel=kvm,usb=off -cpu Nehalem -m 4096 -realtime mlock=off -smp 2,maxcpus=160,sockets=80,cores=2,threads=1 -uuid 3b839558-a7df-4d70-9f06-e2a0c4b8d095 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=20-3,serial=75E79C3D-B774-11DF-935C-0019998D0D3A,uuid=3b839558-a7df-4d70-9f06-e2a0c4b8d095 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/colvm42.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-06-16T15:55:55,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/f2132f99-775c-4943-93e6-a56a9f42bf30/08d14339-d111-4eee-a91e-bbae2f681c52,if=none,id=drive-virtio-disk0,format=raw,serial=f2132f99-775c-4943-93e6-a56a9f42bf30,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=44,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:0c:29:b4:38:19,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice tls-port=5908,addr=192.168.11.44,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8
Comment 1 Markus Stockhausen 2014-06-17 11:10:14 EDT
Crash occured again on several VMs. This happened during start of a single VM.
We are driving that node in a OVirt NFS environment and collect OS data. So I attach the graphs of everything.

1) CPU of node colovn04 - the hypervisor node
2) Timedrift of node colovn04 (just in case that helps)
3) Memory usage of node colovn04 - yellow are KSM pages - the black line shows "uncompressed" KSM pages
4) Infiniband interface bytes - NFS is residing on that interface

5) NFS server IO Bytes
6) NFS Server IOs
7) NFS server average IO times
8) NFS server CPU usage
Comment 2 Markus Stockhausen 2014-06-17 11:15:54 EDT
Created attachment 909635 [details]
1 cpu hypervisor
Comment 3 Markus Stockhausen 2014-06-17 11:16:24 EDT
Created attachment 909636 [details]
2 timedrift hypervisor
Comment 4 Markus Stockhausen 2014-06-17 11:16:57 EDT
Created attachment 909638 [details]
3 memory hypervisor
Comment 5 Markus Stockhausen 2014-06-17 11:17:39 EDT
Created attachment 909639 [details]
4 infiniband/NFS hypervisor
Comment 6 Markus Stockhausen 2014-06-17 11:18:14 EDT
Created attachment 909640 [details]
5 - io bytes NFS server
Comment 7 Markus Stockhausen 2014-06-17 11:18:50 EDT
Created attachment 909641 [details]
6 IOs NFS
Comment 8 Markus Stockhausen 2014-06-17 11:19:17 EDT
Created attachment 909642 [details]
7 io times NFS
Comment 9 Markus Stockhausen 2014-06-17 11:19:46 EDT
Created attachment 909643 [details]
8 cpu nfs server
Comment 10 Markus Stockhausen 2014-06-17 11:20:30 EDT
Created attachment 909644 [details]
9 swap io hypervisor
Comment 11 Markus Stockhausen 2014-06-17 11:21:00 EDT
Created attachment 909645 [details]
10 swap usage hypervisor
Comment 12 Markus Stockhausen 2014-06-17 11:23:47 EDT
9/10 show swap IOs and usage on the hypervisor node

Kernel on hypversior is 3.14.4-200.fc20.x86_64
Comment 13 Cole Robinson 2014-06-17 11:46:50 EDT
There's a kbase article about this:

https://access.redhat.com/site/solutions/755943
https://bugzilla.redhat.com/show_bug.cgi?id=990824

The suggested solution is to pass this with libvirt:

<domain ...>
  <features>
    <hyperv>
      <relaxed state='on'/>
    </hyperv>
  </features>
</domain>

So ovirt should be doing that for windows 7 guests, reassigning
Comment 14 Markus Stockhausen 2014-06-17 15:47:59 EDT
Similar bug where qemu parametrization could be enhanced: BZ1107835
Comment 15 Federico Simoncelli 2014-06-18 07:10:42 EDT
Francesco, can we handle this?

(In reply to Cole Robinson from comment #13)
> There's a kbase article about this:
> 
> https://access.redhat.com/site/solutions/755943
> https://bugzilla.redhat.com/show_bug.cgi?id=990824
> 
> The suggested solution is to pass this with libvirt:
> 
> <domain ...>
>   <features>
>     <hyperv>
>       <relaxed state='on'/>
>     </hyperv>
>   </features>
> </domain>
> 
> So ovirt should be doing that for windows 7 guests, reassigning
Comment 16 Francesco Romani 2014-06-18 07:20:07 EDT
Yes, there are already plans and patch floating:
https://bugzilla.redhat.com/show_bug.cgi?id=1083529
http://gerrit.ovirt.org/#/c/27619/3

However, a few details still need to be sorted out to have proper support.
Comment 17 Michal Skrivanek 2014-06-25 10:41:26 EDT
(fixing product)
Comment 18 Michal Skrivanek 2014-06-25 10:50:20 EDT
(In reply to Francesco Romani from comment #16)
we may try to expedite the hv_relaxed part…that's the simplest one
since it's not a regression, AFAIK, I'd not block 3.5 for now
Comment 19 Markus Stockhausen 2014-06-25 12:52:54 EDT
A short update. Up to now I cannot tell if the bug is or not with the "relax" setting. We had the errors sporadic (once in two weeks) so no direct before/after effect comparable. 

For setting the parameter I simply rely on Cole Robinsons comment 13.
Comment 20 Francesco Romani 2014-06-26 03:26:46 EDT
VDSM patch posted for review.
Comment 21 Francesco Romani 2014-06-27 05:00:09 EDT
VDSM patch merged, Engine patch posted
Comment 22 Francesco Romani 2014-07-18 04:31:29 EDT
turns out VDSM patch was merged after 3.5 branched.
Posted backports:
http://gerrit.ovirt.org/#/c/30254/
http://gerrit.ovirt.org/#/c/30255/
Comment 23 Pavel Novotny 2014-08-05 10:36:29 EDT
Verified in vdsm-4.16.0-42.git3bfad86.el6.x86_64 (oVirt 3.5 beta2).

Windows guests have now the hv_relaxed flag enabled, i.e., the QEMU process now looks like:

10774 ?        Sl     0:10 /usr/libexec/qemu-kvm -name win7 -S -M rhel6.5.0 -cpu Nehalem,hv_relaxed -enable-kvm -m 1024 ...
Comment 24 Sandro Bonazzola 2014-10-17 08:40:40 EDT
oVirt 3.5 has been released and should include the fix for this issue.

Note You need to log in before you can comment on or make changes to this bug.