Description of problem: Windows 7 SP1 VM was killed with BSOD during normal operation. Version-Release number of selected component (if applicable): Hardware XEON x5650 Fedora 20 qemu 1.6.2 How reproducible: Unknown Steps to Reproduce: Don't know Actual results: VM crashed Expected results: VM should run Additional info: ******************* ****************** ***************** Might be related to BZ990824 ******************* ****************** ***************** Analysis of memory dump: 0: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* CLOCK_WATCHDOG_TIMEOUT (101) An expected clock interrupt was not received on a secondary processor in an MP system within the allocated interval. This indicates that the specified processor is hung and not processing interrupts. Arguments: Arg1: 0000000000000061, Clock interrupt time out interval in nominal clock ticks. Arg2: 0000000000000000, 0. Arg3: fffff88002e40180, The PRCB address of the hung processor. Arg4: 0000000000000001, 0. Debugging Details: ------------------ Page a0f3f not present in the dump file. Type ".hh dbgerr004" for details Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll Das System kann die angegebene Datei nicht finden. Unable to open image file: C:\Program Files (x86)\Debugging Tools for Windows (x86)\sym\hal.dll\4CE7C66949000\hal.dll Das System kann die angegebene Datei nicht finden. BUGCHECK_STR: CLOCK_WATCHDOG_TIMEOUT_2_PROC DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT PROCESS_NAME: WmiPrvSE.exe CURRENT_IRQL: d STACK_TEXT: fffff880`04a25318 fffff800`02931a4a : 00000000`00000101 00000000`00000061 00000000`00000000 fffff880`02e40180 : nt!KeBugCheckEx fffff880`04a25320 fffff800`028e46f7 : 00000000`00000000 fffff800`00000001 00000000`00026160 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4e3e fffff880`04a253b0 fffff800`02826895 : fffff800`0284c3c0 fffff880`04a25560 fffff800`0284c3c0 fffffa80`00000000 : nt!KeUpdateSystemTime+0x377 fffff880`04a254b0 fffff800`028d7113 : fffff800`02a55e80 00000000`00000001 ffffffff`fffffd80 00000000`00000005 : hal!HalpHpetClockInterrupt+0x8d fffff880`04a254e0 fffff800`028af939 : 00000000`016d8330 00000000`000007ff fffffa80`05d4f060 fffff800`02b97abd : nt!KiInterruptDispatchNoLock+0x163 fffff880`04a25670 fffff800`02b96bdf : 00000000`00000000 fffff880`04a25ca0 00000000`00000000 00000000`016d7ec0 : nt!KeFlushProcessWriteBuffers+0x65 fffff880`04a256e0 fffff800`02be6416 : 00000000`001ba350 fffff800`00000100 fffff880`04a25870 00000000`00000000 : nt!ExpGetProcessInformation+0x7f fffff880`04a25830 fffff800`02be6e6d : 00000000`001ba350 fffff960`001a61b3 00000000`001ba350 00000000`00000b3a : nt!ExpQuerySystemInformation+0xfb4 fffff880`04a25be0 fffff800`028d9e53 : fffffa80`05d1b640 00000000`00000001 fffff880`04a25ca0 fffffa80`03793cc0 : nt!NtQuerySystemInformation+0x4d fffff880`04a25c20 00000000`77b8161a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 00000000`0163f9f8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77b8161a STACK_COMMAND: kb SYMBOL_NAME: ANALYSIS_INCONCLUSIVE FOLLOWUP_NAME: MachineOwner MODULE_NAME: Unknown_Module IMAGE_NAME: Unknown_Image DEBUG_FLR_IMAGE_TIMESTAMP: 0 FAILURE_BUCKET_ID: X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE BUCKET_ID: X64_CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE Followup: MachineOwner --------- ******************* ****************** ***************** QEMU command line: /usr/bin/qemu-system-x86_64 -machine accel=kvm -name colvm42 -S -machine pc-1.0,accel=kvm,usb=off -cpu Nehalem -m 4096 -realtime mlock=off -smp 2,maxcpus=160,sockets=80,cores=2,threads=1 -uuid 3b839558-a7df-4d70-9f06-e2a0c4b8d095 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=20-3,serial=75E79C3D-B774-11DF-935C-0019998D0D3A,uuid=3b839558-a7df-4d70-9f06-e2a0c4b8d095 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/colvm42.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-06-16T15:55:55,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/mnt/10.10.30.251:_var_nas1_OVirtIB/965ca3b6-4f9c-4e81-b6e8-5ed4a9e58545/images/f2132f99-775c-4943-93e6-a56a9f42bf30/08d14339-d111-4eee-a91e-bbae2f681c52,if=none,id=drive-virtio-disk0,format=raw,serial=f2132f99-775c-4943-93e6-a56a9f42bf30,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=44,id=hostnet0,vhost=on,vhostfd=45 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:0c:29:b4:38:19,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3b839558-a7df-4d70-9f06-e2a0c4b8d095.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice tls-port=5908,addr=192.168.11.44,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8
Crash occured again on several VMs. This happened during start of a single VM. We are driving that node in a OVirt NFS environment and collect OS data. So I attach the graphs of everything. 1) CPU of node colovn04 - the hypervisor node 2) Timedrift of node colovn04 (just in case that helps) 3) Memory usage of node colovn04 - yellow are KSM pages - the black line shows "uncompressed" KSM pages 4) Infiniband interface bytes - NFS is residing on that interface 5) NFS server IO Bytes 6) NFS Server IOs 7) NFS server average IO times 8) NFS server CPU usage
Created attachment 909635 [details] 1 cpu hypervisor
Created attachment 909636 [details] 2 timedrift hypervisor
Created attachment 909638 [details] 3 memory hypervisor
Created attachment 909639 [details] 4 infiniband/NFS hypervisor
Created attachment 909640 [details] 5 - io bytes NFS server
Created attachment 909641 [details] 6 IOs NFS
Created attachment 909642 [details] 7 io times NFS
Created attachment 909643 [details] 8 cpu nfs server
Created attachment 909644 [details] 9 swap io hypervisor
Created attachment 909645 [details] 10 swap usage hypervisor
9/10 show swap IOs and usage on the hypervisor node Kernel on hypversior is 3.14.4-200.fc20.x86_64
There's a kbase article about this: https://access.redhat.com/site/solutions/755943 https://bugzilla.redhat.com/show_bug.cgi?id=990824 The suggested solution is to pass this with libvirt: <domain ...> <features> <hyperv> <relaxed state='on'/> </hyperv> </features> </domain> So ovirt should be doing that for windows 7 guests, reassigning
Similar bug where qemu parametrization could be enhanced: BZ1107835
Francesco, can we handle this? (In reply to Cole Robinson from comment #13) > There's a kbase article about this: > > https://access.redhat.com/site/solutions/755943 > https://bugzilla.redhat.com/show_bug.cgi?id=990824 > > The suggested solution is to pass this with libvirt: > > <domain ...> > <features> > <hyperv> > <relaxed state='on'/> > </hyperv> > </features> > </domain> > > So ovirt should be doing that for windows 7 guests, reassigning
Yes, there are already plans and patch floating: https://bugzilla.redhat.com/show_bug.cgi?id=1083529 http://gerrit.ovirt.org/#/c/27619/3 However, a few details still need to be sorted out to have proper support.
(fixing product)
(In reply to Francesco Romani from comment #16) we may try to expedite the hv_relaxed part…that's the simplest one since it's not a regression, AFAIK, I'd not block 3.5 for now
A short update. Up to now I cannot tell if the bug is or not with the "relax" setting. We had the errors sporadic (once in two weeks) so no direct before/after effect comparable. For setting the parameter I simply rely on Cole Robinsons comment 13.
VDSM patch posted for review.
VDSM patch merged, Engine patch posted
turns out VDSM patch was merged after 3.5 branched. Posted backports: http://gerrit.ovirt.org/#/c/30254/ http://gerrit.ovirt.org/#/c/30255/
Verified in vdsm-4.16.0-42.git3bfad86.el6.x86_64 (oVirt 3.5 beta2). Windows guests have now the hv_relaxed flag enabled, i.e., the QEMU process now looks like: 10774 ? Sl 0:10 /usr/libexec/qemu-kvm -name win7 -S -M rhel6.5.0 -cpu Nehalem,hv_relaxed -enable-kvm -m 1024 ...
oVirt 3.5 has been released and should include the fix for this issue.