Bug 1110305

Summary: BSOD - CLOCK_WATCHDOG_TIMEOUT_2 - Win 7SP1 guest, need to set hv_relaxed
Product: [Retired] oVirt Reporter: Markus Stockhausen <mst>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Novotny <pnovotny>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4CC: amit.shah, bazulay, berrange, cfergeau, crobinso, dougsland, dwmw2, fromani, fsimonce, gklein, iheim, itamar, mavital, mgoldboi, michal.skrivanek, pbonzini, rbalakri, rjones, scottt.tw, virt-maint, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: ovirt-3.5.0-beta2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:40:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1073943, 1083529    
Attachments:
Description Flags
1 cpu hypervisor
none
2 timedrift hypervisor
none
3 memory hypervisor
none
4 infiniband/NFS hypervisor
none
5 - io bytes NFS server
none
6 IOs NFS
none
7 io times NFS
none
8 cpu nfs server
none
9 swap io hypervisor
none
10 swap usage hypervisor none

Description Markus Stockhausen 2014-06-17 12:11:33 UTC
Description of problem:


Windows 7 SP1 VM was killed with BSOD during normal operation. 

Version-Release number of selected component (if applicable):

Hardware XEON x5650 
Fedora 20
qemu 1.6.2

How reproducible:

Unknown

Steps to Reproduce:

Don't know

Actual results:

VM crashed


Expected results:

VM should run

Additional info:

*******************
******************
*****************

Might be related to BZ990824 

*******************
******************
*****************

Analysis of memory dump:

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000061, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff88002e40180, The PRCB address of the hung processor.
Arg4: 0000000000000001, 0.

Debugging Details:
------------------

Page a0f3f not present in the dump file. Type ".hh dbgerr004" for details
Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll
Das System kann die angegebene Datei nicht finden.

Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll
Das System kann die angegebene Datei nicht finden.


BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_2_PROC

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  WmiPrvSE.exe

CURRENT_IRQL:  d

STACK_TEXT:  
fffff880`04a25318 fffff800`02931a4a : 00000000`00000101 00000000`00000061 00000000`00000000 fffff880`02e40180 : nt!KeBugCheckEx
fffff880`04a25320 fffff800`028e46f7 : 00000000`00000000 fffff800`00000001 00000000`00026160 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4e3e
fffff880`04a253b0 fffff800`02826895 : fffff800`0284c3c0 fffff880`04a25560 fffff800`0284c3c0 fffffa80`00000000 : nt!KeUpdateSystemTime+0x377
fffff880`04a254b0 fffff800`028d7113 : fffff800`02a55e80 00000000`00000001 ffffffff`fffffd80 00000000`00000005 : hal!HalpHpetClockInterrupt+0x8d
fffff880`04a254e0 fffff800`028af939 : 00000000`016d8330 00000000`000007ff fffffa80`05d4f060 fffff800`02b97abd : nt!KiInterruptDispatchNoLock+0x163
fffff880`04a25670 fffff800`02b96bdf : 00000000`00000000 fffff880`04a25ca0 00000000`00000000 00000000`016d7ec0 : nt!KeFlushProcessWriteBuffers+0x65
fffff880`04a256e0 fffff800`02be6416 : 00000000`001ba350 fffff800`00000100 fffff880`04a25870 00000000`00000000 : nt!ExpGetProcessInformation+0x7f
fffff880`04a25830 fffff800`02be6e6d : 00000000`001ba350 fffff960`001a61b3 00000000`001ba350 00000000`00000b3a : nt!ExpQuerySystemInformation+0xfb4
fffff880`04a25be0 fffff800`028d9e53 : fffffa80`05d1b640 00000000`00000001 fffff880`04a25ca0 fffffa80`03793cc0 : nt!NtQuerySystemInformation+0x4d
fffff880`04a25c20 00000000`77b8161a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`0163f9f8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77b8161a


STACK_COMMAND:  kb

SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

Followup: MachineOwner
---------

*******************
******************
*****************

QEMU command line:


/usr/bin/qemu-system-x86_64 -machine accel=kvm -name colvm42 -S -machine pc-1.0,accel=kvm,usb=off -cpu Nehalem -m 4096 -realtime mlock=off -smp 2,maxcpus=160,sockets=80,cores=2,threads=1 -uuid 3b839558-a7df-4d70-9f06-e2a0c4b8d095 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=20-3,serial=75E79C3D-B774-11DF-935C-0019998D0D3A,uuid=3b839558-a7df-4d70-9f06-e2a0c4b8d095 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/colvm42.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-06-16T15:55:55,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/f2132f99-775c-4943-93e6-a56a9f42bf30/08d14339-d111-4eee-a91e-bbae2f681c52,if=none,id=drive-virtio-disk0,format=raw,serial=f2132f99-775c-4943-93e6-a56a9f42bf30,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=44,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:0c:29:b4:38:19,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice tls-port=5908,addr=192.168.11.44,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8

Comment 1 Markus Stockhausen 2014-06-17 15:10:14 UTC
Crash occured again on several VMs. This happened during start of a single VM.
We are driving that node in a OVirt NFS environment and collect OS data. So I attach the graphs of everything.

1) CPU of node colovn04 - the hypervisor node
2) Timedrift of node colovn04 (just in case that helps)
3) Memory usage of node colovn04 - yellow are KSM pages - the black line shows "uncompressed" KSM pages
4) Infiniband interface bytes - NFS is residing on that interface

5) NFS server IO Bytes
6) NFS Server IOs
7) NFS server average IO times
8) NFS server CPU usage

Comment 2 Markus Stockhausen 2014-06-17 15:15:54 UTC
Created attachment 909635 [details]
1 cpu hypervisor

Comment 3 Markus Stockhausen 2014-06-17 15:16:24 UTC
Created attachment 909636 [details]
2 timedrift hypervisor

Comment 4 Markus Stockhausen 2014-06-17 15:16:57 UTC
Created attachment 909638 [details]
3 memory hypervisor

Comment 5 Markus Stockhausen 2014-06-17 15:17:39 UTC
Created attachment 909639 [details]
4 infiniband/NFS hypervisor

Comment 6 Markus Stockhausen 2014-06-17 15:18:14 UTC
Created attachment 909640 [details]
5 - io bytes NFS server

Comment 7 Markus Stockhausen 2014-06-17 15:18:50 UTC
Created attachment 909641 [details]
6 IOs NFS

Comment 8 Markus Stockhausen 2014-06-17 15:19:17 UTC
Created attachment 909642 [details]
7 io times NFS

Comment 9 Markus Stockhausen 2014-06-17 15:19:46 UTC
Created attachment 909643 [details]
8 cpu nfs server

Comment 10 Markus Stockhausen 2014-06-17 15:20:30 UTC
Created attachment 909644 [details]
9 swap io hypervisor

Comment 11 Markus Stockhausen 2014-06-17 15:21:00 UTC
Created attachment 909645 [details]
10 swap usage hypervisor

Comment 12 Markus Stockhausen 2014-06-17 15:23:47 UTC
9/10 show swap IOs and usage on the hypervisor node

Kernel on hypversior is 3.14.4-200.fc20.x86_64

Comment 13 Cole Robinson 2014-06-17 15:46:50 UTC
There's a kbase article about this:

https://access.redhat.com/site/solutions/755943
https://bugzilla.redhat.com/show_bug.cgi?id=990824

The suggested solution is to pass this with libvirt:

<domain ...>
  <features>
    <hyperv>
      <relaxed state='on'/>
    </hyperv>
  </features>
</domain>

So ovirt should be doing that for windows 7 guests, reassigning

Comment 14 Markus Stockhausen 2014-06-17 19:47:59 UTC
Similar bug where qemu parametrization could be enhanced: BZ1107835

Comment 15 Federico Simoncelli 2014-06-18 11:10:42 UTC
Francesco, can we handle this?

(In reply to Cole Robinson from comment #13)
> There's a kbase article about this:
> 
> https://access.redhat.com/site/solutions/755943
> https://bugzilla.redhat.com/show_bug.cgi?id=990824
> 
> The suggested solution is to pass this with libvirt:
> 
> <domain ...>
>   <features>
>     <hyperv>
>       <relaxed state='on'/>
>     </hyperv>
>   </features>
> </domain>
> 
> So ovirt should be doing that for windows 7 guests, reassigning

Comment 16 Francesco Romani 2014-06-18 11:20:07 UTC
Yes, there are already plans and patch floating:
https://bugzilla.redhat.com/show_bug.cgi?id=1083529
http://gerrit.ovirt.org/#/c/27619/3

However, a few details still need to be sorted out to have proper support.

Comment 17 Michal Skrivanek 2014-06-25 14:41:26 UTC
(fixing product)

Comment 18 Michal Skrivanek 2014-06-25 14:50:20 UTC
(In reply to Francesco Romani from comment #16)
we may try to expedite the hv_relaxed part…that's the simplest one
since it's not a regression, AFAIK, I'd not block 3.5 for now

Comment 19 Markus Stockhausen 2014-06-25 16:52:54 UTC
A short update. Up to now I cannot tell if the bug is or not with the "relax" setting. We had the errors sporadic (once in two weeks) so no direct before/after effect comparable. 

For setting the parameter I simply rely on Cole Robinsons comment 13.

Comment 20 Francesco Romani 2014-06-26 07:26:46 UTC
VDSM patch posted for review.

Comment 21 Francesco Romani 2014-06-27 09:00:09 UTC
VDSM patch merged, Engine patch posted

Comment 22 Francesco Romani 2014-07-18 08:31:29 UTC
turns out VDSM patch was merged after 3.5 branched.
Posted backports:
http://gerrit.ovirt.org/#/c/30254/
http://gerrit.ovirt.org/#/c/30255/

Comment 23 Pavel Novotny 2014-08-05 14:36:29 UTC
Verified in vdsm-4.16.0-42.git3bfad86.el6.x86_64 (oVirt 3.5 beta2).

Windows guests have now the hv_relaxed flag enabled, i.e., the QEMU process now looks like:

10774 ?        Sl     0:10 /usr/libexec/qemu-kvm -name win7 -S -M rhel6.5.0 -cpu Nehalem,hv_relaxed -enable-kvm -m 1024 ...

Comment 24 Sandro Bonazzola 2014-10-17 12:40:40 UTC
oVirt 3.5 has been released and should include the fix for this issue.