Bug 1110305 - BSOD - CLOCK_WATCHDOG_TIMEOUT_2 - Win 7SP1 guest, need to set hv_relaxed
Summary: BSOD - CLOCK_WATCHDOG_TIMEOUT_2 - Win 7SP1 guest, need to set hv_relaxed
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm   
(Show other bugs)
Version: 3.4
Hardware: Unspecified Unspecified
unspecified
high
Target Milestone: ---
: 3.5.0
Assignee: Francesco Romani
QA Contact: Pavel Novotny
URL:
Whiteboard: virt
Keywords: Triaged
Depends On:
Blocks: 1073943 1083529
TreeView+ depends on / blocked
 
Reported: 2014-06-17 12:11 UTC by Markus Stockhausen
Modified: 2016-02-10 19:49 UTC (History)
21 users (show)

Fixed In Version: ovirt-3.5.0-beta2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-10-17 12:40:40 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
1 cpu hypervisor (43.64 KB, image/png)
2014-06-17 15:15 UTC, Markus Stockhausen
no flags Details
2 timedrift hypervisor (34.68 KB, image/png)
2014-06-17 15:16 UTC, Markus Stockhausen
no flags Details
3 memory hypervisor (34.86 KB, image/png)
2014-06-17 15:16 UTC, Markus Stockhausen
no flags Details
4 infiniband/NFS hypervisor (40.73 KB, image/png)
2014-06-17 15:17 UTC, Markus Stockhausen
no flags Details
5 - io bytes NFS server (17.15 KB, image/png)
2014-06-17 15:18 UTC, Markus Stockhausen
no flags Details
6 IOs NFS (17.04 KB, image/png)
2014-06-17 15:18 UTC, Markus Stockhausen
no flags Details
7 io times NFS (18.50 KB, image/png)
2014-06-17 15:19 UTC, Markus Stockhausen
no flags Details
8 cpu nfs server (15.16 KB, image/png)
2014-06-17 15:19 UTC, Markus Stockhausen
no flags Details
9 swap io hypervisor (35.34 KB, image/png)
2014-06-17 15:20 UTC, Markus Stockhausen
no flags Details
10 swap usage hypervisor (12.23 KB, image/png)
2014-06-17 15:21 UTC, Markus Stockhausen
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 27619 master MERGED vm: hyperv: initial windows hyperv support Never
oVirt gerrit 29238 master MERGED core: enable hyperv optimization on windows Never
oVirt gerrit 30188 ovirt-engine-3.5 MERGED core: enable hyperv optimization on windows Never
oVirt gerrit 30254 ovirt-3.5 MERGED vm: hyperv: initial windows hyperv support Never

Description Markus Stockhausen 2014-06-17 12:11:33 UTC
Description of problem:


Windows 7 SP1 VM was killed with BSOD during normal operation. 

Version-Release number of selected component (if applicable):

Hardware XEON x5650 
Fedora 20
qemu 1.6.2

How reproducible:

Unknown

Steps to Reproduce:

Don't know

Actual results:

VM crashed


Expected results:

VM should run

Additional info:

*******************
******************
*****************

Might be related to BZ990824 

*******************
******************
*****************

Analysis of memory dump:

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000061, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff88002e40180, The PRCB address of the hung processor.
Arg4: 0000000000000001, 0.

Debugging Details:
------------------

Page a0f3f not present in the dump file. Type ".hh dbgerr004" for details
Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll
Das System kann die angegebene Datei nicht finden.

Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll
Das System kann die angegebene Datei nicht finden.


BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_2_PROC

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  WmiPrvSE.exe

CURRENT_IRQL:  d

STACK_TEXT:  
fffff880`04a25318 fffff800`02931a4a : 00000000`00000101 00000000`00000061 00000000`00000000 fffff880`02e40180 : nt!KeBugCheckEx
fffff880`04a25320 fffff800`028e46f7 : 00000000`00000000 fffff800`00000001 00000000`00026160 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4e3e
fffff880`04a253b0 fffff800`02826895 : fffff800`0284c3c0 fffff880`04a25560 fffff800`0284c3c0 fffffa80`00000000 : nt!KeUpdateSystemTime+0x377
fffff880`04a254b0 fffff800`028d7113 : fffff800`02a55e80 00000000`00000001 ffffffff`fffffd80 00000000`00000005 : hal!HalpHpetClockInterrupt+0x8d
fffff880`04a254e0 fffff800`028af939 : 00000000`016d8330 00000000`000007ff fffffa80`05d4f060 fffff800`02b97abd : nt!KiInterruptDispatchNoLock+0x163
fffff880`04a25670 fffff800`02b96bdf : 00000000`00000000 fffff880`04a25ca0 00000000`00000000 00000000`016d7ec0 : nt!KeFlushProcessWriteBuffers+0x65
fffff880`04a256e0 fffff800`02be6416 : 00000000`001ba350 fffff800`00000100 fffff880`04a25870 00000000`00000000 : nt!ExpGetProcessInformation+0x7f
fffff880`04a25830 fffff800`02be6e6d : 00000000`001ba350 fffff960`001a61b3 00000000`001ba350 00000000`00000b3a : nt!ExpQuerySystemInformation+0xfb4
fffff880`04a25be0 fffff800`028d9e53 : fffffa80`05d1b640 00000000`00000001 fffff880`04a25ca0 fffffa80`03793cc0 : nt!NtQuerySystemInformation+0x4d
fffff880`04a25c20 00000000`77b8161a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`0163f9f8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77b8161a


STACK_COMMAND:  kb

SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

Followup: MachineOwner
---------

*******************
******************
*****************

QEMU command line:


/usr/bin/qemu-system-x86_64 -machine accel=kvm -name colvm42 -S -machine pc-1.0,accel=kvm,usb=off -cpu Nehalem -m 4096 -realtime mlock=off -smp 2,maxcpus=160,sockets=80,cores=2,threads=1 -uuid 3b839558-a7df-4d70-9f06-e2a0c4b8d095 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=20-3,serial=75E79C3D-B774-11DF-935C-0019998D0D3A,uuid=3b839558-a7df-4d70-9f06-e2a0c4b8d095 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/colvm42.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-06-16T15:55:55,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/f2132f99-775c-4943-93e6-a56a9f42bf30/08d14339-d111-4eee-a91e-bbae2f681c52,if=none,id=drive-virtio-disk0,format=raw,serial=f2132f99-775c-4943-93e6-a56a9f42bf30,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=44,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:0c:29:b4:38:19,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice tls-port=5908,addr=192.168.11.44,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8

Comment 1 Markus Stockhausen 2014-06-17 15:10:14 UTC
Crash occured again on several VMs. This happened during start of a single VM.
We are driving that node in a OVirt NFS environment and collect OS data. So I attach the graphs of everything.

1) CPU of node colovn04 - the hypervisor node
2) Timedrift of node colovn04 (just in case that helps)
3) Memory usage of node colovn04 - yellow are KSM pages - the black line shows "uncompressed" KSM pages
4) Infiniband interface bytes - NFS is residing on that interface

5) NFS server IO Bytes
6) NFS Server IOs
7) NFS server average IO times
8) NFS server CPU usage

Comment 2 Markus Stockhausen 2014-06-17 15:15:54 UTC
Created attachment 909635 [details]
1 cpu hypervisor

Comment 3 Markus Stockhausen 2014-06-17 15:16:24 UTC
Created attachment 909636 [details]
2 timedrift hypervisor

Comment 4 Markus Stockhausen 2014-06-17 15:16:57 UTC
Created attachment 909638 [details]
3 memory hypervisor

Comment 5 Markus Stockhausen 2014-06-17 15:17:39 UTC
Created attachment 909639 [details]
4 infiniband/NFS hypervisor

Comment 6 Markus Stockhausen 2014-06-17 15:18:14 UTC
Created attachment 909640 [details]
5 - io bytes NFS server

Comment 7 Markus Stockhausen 2014-06-17 15:18:50 UTC
Created attachment 909641 [details]
6 IOs NFS

Comment 8 Markus Stockhausen 2014-06-17 15:19:17 UTC
Created attachment 909642 [details]
7 io times NFS

Comment 9 Markus Stockhausen 2014-06-17 15:19:46 UTC
Created attachment 909643 [details]
8 cpu nfs server

Comment 10 Markus Stockhausen 2014-06-17 15:20:30 UTC
Created attachment 909644 [details]
9 swap io hypervisor

Comment 11 Markus Stockhausen 2014-06-17 15:21:00 UTC
Created attachment 909645 [details]
10 swap usage hypervisor

Comment 12 Markus Stockhausen 2014-06-17 15:23:47 UTC
9/10 show swap IOs and usage on the hypervisor node

Kernel on hypversior is 3.14.4-200.fc20.x86_64

Comment 13 Cole Robinson 2014-06-17 15:46:50 UTC
There's a kbase article about this:

https://access.redhat.com/site/solutions/755943
https://bugzilla.redhat.com/show_bug.cgi?id=990824

The suggested solution is to pass this with libvirt:

<domain ...>
  <features>
    <hyperv>
      <relaxed state='on'/>
    </hyperv>
  </features>
</domain>

So ovirt should be doing that for windows 7 guests, reassigning

Comment 14 Markus Stockhausen 2014-06-17 19:47:59 UTC
Similar bug where qemu parametrization could be enhanced: BZ1107835

Comment 15 Federico Simoncelli 2014-06-18 11:10:42 UTC
Francesco, can we handle this?

(In reply to Cole Robinson from comment #13)
> There's a kbase article about this:
> 
> https://access.redhat.com/site/solutions/755943
> https://bugzilla.redhat.com/show_bug.cgi?id=990824
> 
> The suggested solution is to pass this with libvirt:
> 
> <domain ...>
>   <features>
>     <hyperv>
>       <relaxed state='on'/>
>     </hyperv>
>   </features>
> </domain>
> 
> So ovirt should be doing that for windows 7 guests, reassigning

Comment 16 Francesco Romani 2014-06-18 11:20:07 UTC
Yes, there are already plans and patch floating:
https://bugzilla.redhat.com/show_bug.cgi?id=1083529
http://gerrit.ovirt.org/#/c/27619/3

However, a few details still need to be sorted out to have proper support.

Comment 17 Michal Skrivanek 2014-06-25 14:41:26 UTC
(fixing product)

Comment 18 Michal Skrivanek 2014-06-25 14:50:20 UTC
(In reply to Francesco Romani from comment #16)
we may try to expedite the hv_relaxed part…that's the simplest one
since it's not a regression, AFAIK, I'd not block 3.5 for now

Comment 19 Markus Stockhausen 2014-06-25 16:52:54 UTC
A short update. Up to now I cannot tell if the bug is or not with the "relax" setting. We had the errors sporadic (once in two weeks) so no direct before/after effect comparable. 

For setting the parameter I simply rely on Cole Robinsons comment 13.

Comment 20 Francesco Romani 2014-06-26 07:26:46 UTC
VDSM patch posted for review.

Comment 21 Francesco Romani 2014-06-27 09:00:09 UTC
VDSM patch merged, Engine patch posted

Comment 22 Francesco Romani 2014-07-18 08:31:29 UTC
turns out VDSM patch was merged after 3.5 branched.
Posted backports:
http://gerrit.ovirt.org/#/c/30254/
http://gerrit.ovirt.org/#/c/30255/

Comment 23 Pavel Novotny 2014-08-05 14:36:29 UTC
Verified in vdsm-4.16.0-42.git3bfad86.el6.x86_64 (oVirt 3.5 beta2).

Windows guests have now the hv_relaxed flag enabled, i.e., the QEMU process now looks like:

10774 ?        Sl     0:10 /usr/libexec/qemu-kvm -name win7 -S -M rhel6.5.0 -cpu Nehalem,hv_relaxed -enable-kvm -m 1024 ...

Comment 24 Sandro Bonazzola 2014-10-17 12:40:40 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.