Bug 612460
| Summary: | [WHQL] [vhost:on]W2k8-32 guest hang during virtio-win NDISTest6.5(MPE) testing | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Qunfang Zhang <qzhang> | ||||
| Component: | virtio-win | Assignee: | Yvugenfi <yvugenfi> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 6.0 | CC: | amit.shah, ddumas, lihuang, llim, mkenneth, mst, ndai, syeghiay, tburke, virt-maint, vrozenfe, ykaul, yvugenfi | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2010-11-11 16:31:00 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hi, Yan, Vadim and Michael I don't know this bug should be submitted to which component, so submit to qemu-kvm. please correct it if I am wrong. Any additional information needed, please ping me or add comment. As I see it there are two issues here: 1. host error seen in dmesg. I think this is same as bug 602607. Please test with this host kernel to verify: https://brewweb.devel.redhat.com/taskinfo?taskID=2574750 (fixes host error when guest ring is corrupted) and report. We have bz 602607 to track that. 2. some bug which corrupts the ring. Possibly virtio win. Assigning to that component for examination. (In reply to comment #3) > As I see it there are two issues here: > > 1. host error seen in dmesg. I think this is same as bug 602607. > Please test with this host kernel to verify: > https://brewweb.devel.redhat.com/taskinfo?taskID=2574750 > (fixes host error when guest ring is corrupted) > and report. We have bz 602607 to track that. Re-test with this kernel, and during the testing, guest did not hang anymore but got BSOD and the error code is 7e, memory dump file: http://10.66.65.120/mem-dump/MEMORY-win2k8-32-mstkernel.DMP http://10.66.65.120/mem-dump/Mini070910-01win2k8-32-mstkernel.dmp > > 2. some bug which corrupts the ring. Possibly virtio win. > Assigning to that component for examination. (In reply to comment #3) > 2. some bug which corrupts the ring. Possibly virtio win. > Assigning to that component for examination. Ring management didn't change for ages. I suggest to test without vhost first. Maybe it's the published used one? we have it in userspace too. You can try disabling with: -global virtio-net-pci.publish_used=off (In reply to comment #7) > we have it in userspace too. > You can try disabling with: > -global virtio-net-pci.publish_used=off Test with -global virtio-net-pci.publish_used=off, guest did not hang anymore. It get the same BSOD as Comment 4. Yan asks to test it w/o vhost. Can you please do it for isolating the issue? (In reply to comment #9) > Yan asks to test it w/o vhost. Can you please do it for isolating the issue? The test is running now, will update result later. Test without vhost=on, the client guest got BSOD, and the error code is 0x7e. Screenshot will be attached. But win2k8-32 can not get dump file when using 6G memory. So, if I need to re-test with a smaller mem to get the dump file? Please use following regestory settings: [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl] "AlwaysKeepMemoryDump"=dword:00000001 reboot after applying. Also please keep MPE memory dumps, if this is a dump that falls under MS errata - we might need to send it to MS for review in order to pass a test. (In reply to comment #12) > Please use following regestory settings: > [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl] > "AlwaysKeepMemoryDump"=dword:00000001 > > reboot after applying. > > Also please keep MPE memory dumps, if this is a dump that falls under MS errata > - we might need to send it to MS for review in order to pass a test. OK, will update bz after get result. (In reply to comment #13) > (In reply to comment #12) > > Please use following regestory settings: > > [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl] > > "AlwaysKeepMemoryDump"=dword:00000001 > > > > reboot after applying. > > > > Also please keep MPE memory dumps, if this is a dump that falls under MS errata > > - we might need to send it to MS for review in order to pass a test. > > OK, will update bz after get result. Microsoft (R) Windows Debugger Version 6.10.0003.233 X86 Copyright (c) Microsoft Corporation. All rights reserved. Loading Dump File [C:\Users\DTMLLUAdminUser\Desktop\MEMORY-#612460-without-vhost.DMP] Kernel Summary Dump File: Only kernel address space is available Symbol search path is: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols Executable search path is: Windows Server 2008/Windows Vista SP1 Kernel Version 6001 (Service Pack 1) MP (4 procs) Free x86 compatible Product: Server, suite: TerminalServer DataCenter SingleUserTS Built by: 6001.18427.x86fre.vistasp1_gdr.100218-0019 Machine Name: Kernel base = 0x8163b000 PsLoadedModuleList = 0x81752c70 Debug session time: Tue Jul 20 15:12:52.403 2010 (GMT-7) System Uptime: 0 days 0:51:33.966 Loading Kernel Symbols ............................................................... ........................................................ Loading User Symbols Loading unloaded module list ....... ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* Use !analyze -v to get detailed debugging information. BugCheck 7E, {80000003, 98e337b3, 94d60c1c, 94d60918} *** ERROR: Module load completed but symbols could not be loaded for ndprot61.sys Probably caused by : ndprot61.sys ( ndprot61+2d7b3 ) Followup: MachineOwner --------- 0: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e) This is a very common bugcheck. Usually the exception address pinpoints the driver/function that caused the problem. Always note this address as well as the link date of the driver/image that contains this address. Arguments: Arg1: 80000003, The exception code that was not handled Arg2: 98e337b3, The address that the exception occurred at Arg3: 94d60c1c, Exception Record Address Arg4: 94d60918, Context Record Address Debugging Details: ------------------ EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid FAULTING_IP: ndprot61+2d7b3 98e337b3 cc int 3 EXCEPTION_RECORD: 94d60c1c -- (.exr 0xffffffff94d60c1c) ExceptionAddress: 98e337b3 (ndprot61+0x0002d7b3) ExceptionCode: 80000003 (Break instruction exception) ExceptionFlags: 00000000 NumberParameters: 3 Parameter[0]: 00000000 Parameter[1]: 940b1d78 Parameter[2]: 00000023 CONTEXT: 94d60918 -- (.cxr 0xffffffff94d60918) eax=00000001 ebx=00000000 ecx=816f91de edx=00000023 esi=940b1d78 edi=00000000 eip=98e337b3 esp=94d60ce4 ebp=94d60cec iopl=0 nv up ei ng nz na pe nc cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000286 ndprot61+0x2d7b3: 98e337b3 cc int 3 Resetting default scope DEFAULT_BUCKET_ID: INTEL_CPU_MICROCODE_ZERO BUGCHECK_STR: 0x7E PROCESS_NAME: System CURRENT_IRQL: 0 ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached. EXCEPTION_PARAMETER1: 00000000 EXCEPTION_PARAMETER2: 940b1d78 EXCEPTION_PARAMETER3: 00000023 LAST_CONTROL_TRANSFER: from 98e99232 to 98e337b3 STACK_TEXT: WARNING: Stack unwind information not available. Following frames may be wrong. 94d60cec 98e99232 00000000 9861d624 0000000c ndprot61+0x2d7b3 94d60d10 98e996a6 9863e700 0000000f 00000064 ndprot61+0x93232 94d60d68 98eb47f0 00000000 9861d624 9861d640 ndprot61+0x936a6 94d60d7c 81810b54 9861d640 010c6c87 00000000 ndprot61+0xae7f0 94d60dc0 81669a5e 98eb47b0 9861d640 00000000 nt!PspSystemThreadStartup+0x9d 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16 FOLLOWUP_IP: ndprot61+2d7b3 98e337b3 cc int 3 SYMBOL_STACK_INDEX: 0 SYMBOL_NAME: ndprot61+2d7b3 FOLLOWUP_NAME: MachineOwner MODULE_NAME: ndprot61 IMAGE_NAME: ndprot61.sys DEBUG_FLR_IMAGE_TIMESTAMP: 4b08150e STACK_COMMAND: .cxr 0xffffffff94d60918 ; kb FAILURE_BUCKET_ID: 0x7E_VRF_ndprot61+2d7b3 BUCKET_ID: 0x7E_VRF_ndprot61+2d7b3 Followup: MachineOwner --------- Memory dump file: http://10.66.65.120/mem-dump/MEMORY-%23612460-without-vhost.DMP Summarize my test results here: Test with "-global virtio-net-pci.publish_used=off" for both vhost=on and off. 1. vhost=on Guest met a BSOD (error code is 0x7e) http://10.66.65.120/mem-dump/MEMORY-2k8-32-6.5MPE-vhostON-usedoff-612460.DMP 2. vhost=off For the first time, the test passed. For the second time, got BSOD (error code is 0x7e) http://10.66.65.120/mem-dump/MEMORY-2k8-32-6.5MPE-vhostoff-usedoff-612460-7E.DMP Hi, all To make the things more clear,I will change the status to ASSIGNED because it can be reproduced in virtio-win-1.1.8-0.And after it changes to ON_QA again, I will verify it in the new version. Thanks Qunfang PUBLISH_USED was removed in qemu-kvm-0.12.1.2-2.99.el6. Can you try that package? Just to be clear - BSOD in MPE test still doesn't mean the test fail. We have ERRATA from MS on their BUG that might cause BSOD. Each crash dump from MPE test should be investigated to see if this is related to MS bug or this is something else related to us. I will check those dumps. (In reply to comment #28) > PUBLISH_USED was removed in qemu-kvm-0.12.1.2-2.99.el6. Can you try that > package? The spice related issue bug 617463 blocking me. So will verify this bug after 617463 is fixed. Test with the qemu-kvm build provided by Alex in bug 617463: https://brewweb.devel.redhat.com/taskinfo?taskID=2625606 Boot the guest with vhost=on and DO NOT add "published_use" option. And the guest does not hang any more. Guest got BSOD and the error code is 0x7e. memory dump file: http://10.66.65.120/mem-dump/MEMORY-2k8-32-MPE-fixed.DMP qzhang -> Yan Could you help to check if it is MS errata? Then I can change the status to VERIFIED. :-) Thanks~ Hi, Yan As described in Comment 33, does the BSOD fall into MS errata? And could I change the status to VERIFIED? Thanks~ (In reply to comment #37) > Hi, Yan > > As described in Comment 33, does the BSOD fall into MS errata? And could I > change the status to VERIFIED? > > Thanks~ No, this is traffic hang crash, please retest withou vhost. See comment #26 - it was already tested without vhost and passed. According to Comment 38 and Comment 38, and I re-test with vhost=on using virtio-win-1.1.10-0, this issue still exists with vhost=on. So will change the status to ASSIGNED. Update: Finished testing win7-64, win2k8-R2 and win2k8-64 without vhost, NDISTest6.5 passed. (In fact, all jobs passed.) I will change the status to VERIFIED after finish all guests. Packages version: virtio-win-1.1.12.0 kernel-2.6.32-66.el6 qemu-kvm-0.12.1.2-2.112.el6 (In reply to comment #50) > Update: > Finished testing win7-64, win2k8-R2 and win2k8-64 without vhost, NDISTest6.5 > passed. (In fact, all jobs passed.) > I will change the status to VERIFIED after finish all guests. Sorry, for win2k3 and winxp, there's no NDISTest6.5(MPE), so this job is passed without vhost=on. > > Packages version: > virtio-win-1.1.12.0 > kernel-2.6.32-66.el6 > qemu-kvm-0.12.1.2-2.112.el6 According to Comment 51, I will change the status to VERIFIED. Red Hat Enterprise Linux 6.0 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. |
Created attachment 430284 [details] dmesg of host when win2k8 guest hang Description of problem: When I implement whql virtio-nic NDISTest6.5(MPE) testing, guest hangs at "Start NDISTest client" job and consume 100% cpu. Version-Release number of selected component (if applicable): qemu-kvm-0.12.1.2-2.90.el6.x86_64 virtio-win-1.1.7-2 2.6.32-37.el6.x86_64 Tried two different seabios version: seabios-0.5.1-0.5.20100108git669c991.el6.x86_64 seabios-0.5.1-2.el6 How reproducible: 100% Steps to Reproduce: 1.Boot win2k8-32 guest, with the command line: /usr/libexec/qemu-kvm -m 6G -smp 4 -cpu qemu64,+x2apic -usbdevice tablet -drive file=win2k8-32-nic1.qcow2,if=none,id=drive-ide0-0-0,werror=stop,rerror=stop,cache=none -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup-private -device virtio-net-pci,netdev=hostnet0,mac=00:1a:08:09:02:01,id=229-nic1-1,bus=pci.0,addr=0x4 -netdev tap,id=hostnet1,vhost=on,script=/etc/qemu-ifup-private -device virtio-net-pci,netdev=hostnet1,mac=00:1a:08:09:02:02,id=229-nic1-2,bus=pci.0,addr=0x5 -netdev tap,id=hostnet2,script=/etc/qemu-ifup -device e1000,netdev=hostnet2,mac=00:1a:08:09:04:01,id=229-nic1-3,bus=pci.0,addr=0x6 -boot c -uuid a4f39443-bdc8-4171-a8df-ba981fa58643 -rtc-td-hack -no-kvm-pit-reinjection -monitor stdio -name win2k8x86NIC1-229 -spice port=5930,disable-ticketing -vga qxl 2.Run virtio-win NDISTest6.5(MPE) testing 3. Actual results: Test passed. Expected results: Guest hang at "Start NDISTest client" job. Additional info: dmesg of host will be attached. #kvm_stat kvm statistics efer_reload 0 0 exits 276231562 4985 fpu_reload 35596534 1748 halt_exits 4187565 315 halt_wakeup 3581909 316 host_state_reload 35712126 1774 hypercalls 0 0 insn_emulation 148743765 1283 insn_emulation_fail 0 0 invlpg 0 0 io_exits 32028828 1174 irq_exits 88847512 2119 irq_injections 85466480 516 irq_window 0 0 largepages 5626 0 mmio_exits 392044 19 mmu_cache_miss 973 0 mmu_flooded 0 0 mmu_pde_zapped 0 0 mmu_pte_updated 0 0 mmu_pte_write 0 0 mmu_recycled 0 0 mmu_shadow_zapped 1097 0 mmu_unsync 0 0 nmi_injections 6 0 nmi_window 0 0 pf_fixed 87247 0 pf_guest 0 0 remote_tlb_flush 582 0 request_irq 0 0 signal_exits 10 0 tlb_flush 0 0