Bug 819457
| Summary: | windows 2k8r2sp1 guests hang after running over 10 days in a stressed host | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Chao Yang <chayang> |
| Component: | qemu-kvm | Assignee: | Yvugenfi <yvugenfi> |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.3 | CC: | acathrow, amit.shah, areis, bsarathy, dyasny, gleb, juzhang, michen, mkenneth, mst, rhod, shuang, syeghiay, tburke, virt-maint, vrozenfe |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-02-21 07:34:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 806717 | ||
| Attachments: | |||
|
Description
Chao Yang
2012-05-07 10:10:33 UTC
Most of the time, info register shows RIP at fffff88002ed7bb0 virsh # qemu-monitor-command 2k8r2sp1-e-5-4 --hmp "info registers" RAX=0000000000000005 RBX=fffffa8001642000 RCX=fffffa8001645000 RDX=000000000000c090 RSI=000000000000002d RDI=fffffa800134ed00 RBP=0000000000000000 RSP=fffff8800bfa7720 R8 =0000000000000001 R9 =fffff800014db990 R10=0000000000000440 R11=fffff8800bfa7738 R12=fffffa8001645000 R13=fffffa8001ad4440 R14=fffffa8001644d78 R15=0000000000001000 RIP=fffff88002ed7bb0 RFL=00000206 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA] CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA] SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA] FS =0053 000000007efa6000 00003c00 0040f300 DPL=3 DS [-WA] GS =002b fffff8000160fd00 ffffffff 00c0f300 DPL=3 DS [-WA] LDT=0000 0000000000000000 000fffff 00000000 TR =0040 fffff80001314080 00000067 00008b00 DPL=0 TSS64-busy GDT= fffff80001313000 0000007f IDT= fffff80001313080 00000fff CR0=80050031 CR2=000000000206e638 CR3=000000002e370000 CR4=000006f8 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 FCW=027f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=fa00000000000000 4008 FPR7=0000000000000000 0000 XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000 XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000 XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000 XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000 XMM08=00000000000000000000000000000000 XMM09=00000000000000000000000000000000 XMM10=00000000000000000000000000000000 XMM11=00000000000000000000000000000000 XMM12=00000000000000000000000000000000 XMM13=00000000000000000000000000000000 XMM14=00000000000000000000000000000000 XMM15=00000000000000000000000000000000 Additional Info: 1. top info: Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 33.8%us, 37.3%sy, 0.0%ni, 27.3%id, 0.7%wa, 0.0%hi, 0.9%si, 0.0%st Mem: 16320928k total, 14428284k used, 1892644k free, 138788k buffers Swap: 16465912k total, 2157312k used, 14308600k free, 1288080k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20941 qemu 20 0 1789m 840m 3352 S 177.5 5.3 33615:10 qemu-kvm 2. CLI: qemu 20941 114 5.2 1832928 860920 ? Sl Apr18 33612:50 /usr/libexec/qemu-kvm -S -M rhel6.3.0 -cpu Westmere -enable-kvm -m 1024 -smp 4,sockets=2,cores=2,threads=1 -name 2k8r2sp1-e-5-2 -uuid e397193c-ecfa-4ae9-a5d0-f4f22fb9e8fc -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.3.0.2.el6,serial=36363136-3935-4E43-4731-34315337474E_3C:D9:2B:09:AB:42,uuid=e397193c-ecfa-4ae9-a5d0-f4f22fb9e8fc -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/2k8r2sp1-e-5-2.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2012-04-18T03:17:20,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/rhev/data-center/de33e94d-36e6-44e7-ae8f-46c375dc2a53/e2a25fb8-18f5-44f0-82c5-3720f3740485/images/6513a62a-9399-426a-b32b-603fad536f20/c8a07a0f-5c93-4f39-9d7d-b5f8c0c4f6b7,if=none,id=drive-virtio-disk0,format=qcow2,serial=6513a62a-9399-426a-b32b-603fad536f20,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/run/vdsm/e397193c-ecfa-4ae9-a5d0-f4f22fb9e8fc.vfd,if=none,id=drive-fdc0-0-0,readonly=on,format=raw,serial= -global isa-fdc.driveA=drive-fdc0-0-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:42:0b:21,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/2k8r2sp1-e-5-2.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -spice port=5908,tls-port=5909,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=inputs -k en-us -vga qxl -global qxl-vga.vram_size=67108864 3. kvm_stat: efer_reload 0 0 exits 175851548805 1682941 fpu_reload 46923002684 182639 halt_exits 103190203397 105012 halt_wakeup 55249330387 75384 host_state_reload 98975416613 1151225 hypercalls 0 0 insn_emulation 100774171405 119187 insn_emulation_fail 0 0 invlpg 0 0 io_exits 67751809286 903260 irq_exits 41574151563 31472 irq_injections 119520678241 126902 irq_window 5462424774 3373 largepages 107867 0 mmio_exits 242841734 0 mmu_cache_miss 715527 0 mmu_flooded 0 0 mmu_pde_zapped 0 0 mmu_pte_updated 0 0 mmu_pte_write 507880 0 mmu_recycled 331990 0 mmu_shadow_zapped 707614 0 mmu_unsync 0 0 nmi_injections 2416248 1 nmi_window 2409231 1 pf_fixed 240254895 24 pf_guest 0 0 remote_tlb_flush 175100347 1891 request_irq 0 0 signal_exits 3100 0 tlb_flush 31580 0 --- I connected to libvirt, here is some output of monitor command, if it is not useful info, please tell me which info do you need, thanks. virsh # qemu-monitor-command 2k8r2sp1-e-5-4 --hmp "info status" VM status: running virsh # qemu-monitor-command 2k8r2sp1-e-5-4 --hmp "sendkey ctrl-alt-delete" virsh # --- top info: I noticed, thread id 3458 stuck at TIME+ 0:00.54, Processor: 15 top - 16:30:54 up 22 days, 1:58, 4 users, load average: 2.92, 3.02, 3.06 Tasks: 10 total, 3 running, 7 sleeping, 0 stopped, 0 zombie Cpu(s): 33.9%us, 40.3%sy, 0.0%ni, 23.7%id, 1.4%wa, 0.0%hi, 0.7%si, 0.0%st Mem: 16320928k total, 14194644k used, 2126284k free, 149280k buffers Swap: 16465912k total, 2181988k used, 14283924k free, 885124k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 3450 qemu 20 0 2086m 1.3g 13m R 83.2 8.0 19738:50 12 qemu-kvm 3451 qemu 20 0 2086m 1.3g 13m R 13.3 8.0 2347:59 4 qemu-kvm 3453 qemu 20 0 2086m 1.3g 13m S 13.3 8.0 2259:16 4 qemu-kvm 3454 qemu 20 0 2086m 1.3g 13m S 13.3 8.0 2247:40 12 qemu-kvm 3452 qemu 20 0 2086m 1.3g 13m R 13.0 8.0 2241:46 4 qemu-kvm 3455 qemu 20 0 2086m 1.3g 13m S 13.0 8.0 2264:39 4 qemu-kvm 3456 qemu 20 0 2086m 1.3g 13m S 13.0 8.0 2270:37 4 qemu-kvm 3457 qemu 20 0 2086m 1.3g 13m S 12.0 8.0 2145:19 4 qemu-kvm 3428 qemu 20 0 2086m 1.3g 13m S 1.7 8.0 334:40.95 4 qemu-kvm 3458 qemu 20 0 2086m 1.3g 13m S 0.0 8.0 0:00.54 15 qemu-kvm Can we get the dump too? What's the stack trace for the above thread? Any news? Is it in work? trace-cmd record -e kvm -F 3428 -c will attach stack trace later. Created attachment 584610 [details]
trace-cmd record -e kvm -F 3428 -c
(In reply to comment #14) > Created attachment 584610 [details] > trace-cmd record -e kvm -F 3428 -c My trace-cmd can interpret this file. Run trace-cmd report on it and use trace-cmd from upstream, not the one in RHEL6. The one in RHEL6 does not have kvm plugin. (In reply to comment #15) > (In reply to comment #14) > > Created attachment 584610 [details] > > trace-cmd record -e kvm -F 3428 -c > > My trace-cmd can interpret this file. Run trace-cmd report on it and use > trace-cmd from upstream, not the one in RHEL6. The one in RHEL6 does not have > kvm plugin. How about this time? Running trace-cmd report on it outputs a lot of kvm related logs. Created attachment 584655 [details]
/usr/local/bin/trace-cmd record -e kvm -P 3428 -c
(In reply to comment #16) > (In reply to comment #15) > > (In reply to comment #14) > > > Created attachment 584610 [details] > > > trace-cmd record -e kvm -F 3428 -c > > > > My trace-cmd can interpret this file. Run trace-cmd report on it and use > > trace-cmd from upstream, not the one in RHEL6. The one in RHEL6 does not have > > kvm plugin. > > How about this time? > Running trace-cmd report on it outputs a lot of kvm related logs. trace-cmd can parse this one, but you took the trace of only one (io) thread, so nothing much to see there. Take the trace for all vcpu threads too. If there is only one vm on the host just drop -P flag. (In reply to comment #18) > (In reply to comment #16) > > (In reply to comment #15) > > > (In reply to comment #14) > > > > Created attachment 584610 [details] > > > > trace-cmd record -e kvm -F 3428 -c > > > > > > My trace-cmd can interpret this file. Run trace-cmd report on it and use > > > trace-cmd from upstream, not the one in RHEL6. The one in RHEL6 does not have > > > kvm plugin. > > > > How about this time? > > Running trace-cmd report on it outputs a lot of kvm related logs. > > trace-cmd can parse this one, but you took the trace of only one (io) thread, > so nothing much to see there. Take the trace for all vcpu threads too. If there > is only one vm on the host just drop -P flag. Sorry, but doesn't -c mean trace the children of -P? There are dozens of VMs, so I appended -P and -c to expect to trace all threads. From the output of report, I am not sure if the id in square brackets means cpu#: ..... qemu-kvm-3428 [011] 2435425.804414: kvm_pic_set_irq: chip 1 pin 0 (edge|masked) qemu-kvm-3428 [011] 2435425.804415: kvm_apic_accept_irq: apicid 0 vec 209 (Fixed|edge) Should I run /usr/local/bin/trace-cmd record -e kvm -P 3428,3450,3451,3452,3453,3454,3455,3456,3457,3458 -o /opt/trace-cmd/trace.data instead? Hi Gleb, Can you please take a look at Comment #19, if the log is not what you want, please point me an exact cmd to collect it. I need to reinstall this host to try to reproduce this issue on latest qemu-kvm, libvirt, vdsm. Thanks. (In reply to comment #20) > Hi Gleb, > Can you please take a look at Comment #19, if the log is not what you want, > please point me an exact cmd to collect it. I need to reinstall this host to > try to reproduce this issue on latest qemu-kvm, libvirt, vdsm. Thanks. -c trace all child processes, not threads. I am not sure that -P support multiple threads like that. Try it. The id in square brackets means host cpu number, mostly irrelevant for us. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Created attachment 585924 [details]
Attaching trace-cmd with output of -P as well as thread stack printed by gdb.
Created attachment 585925 [details]
threads stack printed by gdb
HI Gleb, If it has no logs about threads, can you provide a way to trace the relevant threads in case trace-cmd cannot trace threads? (In reply to comment #23) > Created attachment 585924 [details] > Attaching trace-cmd with output of -P as well as thread stack printed by gdb. trace-cmd again traced only one thread. gdb stack is useless. (In reply to comment #25) > HI Gleb, > If it has no logs about threads, can you provide a way to trace the > relevant threads in case trace-cmd cannot trace threads? Kill all other guests and trace system wide or run separate trace-cmd for each thread. Post output of "info pci" please. (In reply to comment #29) > Post output of "info pci" please. Sorry, I have re-configured the env with newer vdsm, libvirt, qemu-kvm-rhev. I cannot post output of "info pci". (In reply to comment #30) > (In reply to comment #29) > > Post output of "info pci" please. > > Sorry, I have re-configured the env with newer vdsm, libvirt, qemu-kvm-rhev. > I cannot post output of "info pci". If you still have the same guest you can start it and run "info pci" after it boots. (In reply to comment #31) > (In reply to comment #30) > > (In reply to comment #29) > > > Post output of "info pci" please. > > > > Sorry, I have re-configured the env with newer vdsm, libvirt, qemu-kvm-rhev. > > I cannot post output of "info pci". > > If you still have the same guest you can start it and run "info pci" after > it boots. I have reinstalled all guests. Will info pci of the newly installed windows 2k8r2 help? (In reply to comment #32) > (In reply to comment #31) > > (In reply to comment #30) > > > (In reply to comment #29) > > > > Post output of "info pci" please. > > > > > > Sorry, I have re-configured the env with newer vdsm, libvirt, qemu-kvm-rhev. > > > I cannot post output of "info pci". > > > > If you still have the same guest you can start it and run "info pci" after > > it boots. > > I have reinstalled all guests. Will info pci of the newly installed windows > 2k8r2 help? If it has exactly same HW config and virtio drivers then yes. (In reply to comment #33) > (In reply to comment #32) > > (In reply to comment #31) > > > (In reply to comment #30) > > > > (In reply to comment #29) > > > > > Post output of "info pci" please. > > > > > > > > Sorry, I have re-configured the env with newer vdsm, libvirt, qemu-kvm-rhev. > > > > I cannot post output of "info pci". > > > > > > If you still have the same guest you can start it and run "info pci" after > > > it boots. > > > > I have reinstalled all guests. Will info pci of the newly installed windows > > 2k8r2 help? > > If it has exactly same HW config and virtio drivers then yes. # qemu-monitor-command win2k8r2-repro --hmp "info pci" Bus 0, device 0, function 0: Host bridge: PCI device 8086:1237 id "" Bus 0, device 1, function 0: ISA bridge: PCI device 8086:7000 id "" Bus 0, device 1, function 1: IDE controller: PCI device 8086:7010 BAR4: I/O at 0xc000 [0xc00f]. id "" Bus 0, device 1, function 2: USB controller: PCI device 8086:7020 IRQ 5. BAR4: I/O at 0xc020 [0xc03f]. id "usb" Bus 0, device 1, function 3: Bridge: PCI device 8086:7113 IRQ 9. id "" Bus 0, device 2, function 0: VGA controller: PCI device 1b36:0100 IRQ 10. BAR0: 32 bit memory at 0xf0000000 [0xf3ffffff]. BAR1: 32 bit memory at 0xf8000000 [0xfbffffff]. BAR2: 32 bit memory at 0xf4000000 [0xf4001fff]. BAR3: I/O at 0xc040 [0xc05f]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe]. id "" Bus 0, device 3, function 0: Ethernet controller: PCI device 1af4:1000 IRQ 0. BAR0: I/O at 0xc060 [0xc07f]. BAR1: 32 bit memory at 0xf4020000 [0xf4020fff]. BAR6: 32 bit memory at 0xffffffffffffffff [0x0000fffe]. id "net0" Bus 0, device 4, function 0: Class 0780: PCI device 1af4:1003 IRQ 5. BAR0: I/O at 0xc080 [0xc09f]. BAR1: 32 bit memory at 0xf4040000 [0xf4040fff]. id "virtio-serial0" Bus 0, device 5, function 0: SCSI controller: PCI device 1af4:1001 IRQ 0. BAR0: I/O at 0xc0c0 [0xc0ff]. BAR1: 32 bit memory at 0xf4041000 [0xf4041fff]. id "virtio-disk0" (In reply to comment #34) > Bus 0, device 4, function 0: > Class 0780: PCI device 1af4:1003 > IRQ 5. > BAR0: I/O at 0xc080 [0xc09f]. > BAR1: 32 bit memory at 0xf4040000 [0xf4040fff]. > id "virtio-serial0" Trace shows that virtio-serial0 IRQ is stuck. (In reply to comment #35) > (In reply to comment #34) > > Bus 0, device 4, function 0: > > Class 0780: PCI device 1af4:1003 > > IRQ 5. > > BAR0: I/O at 0xc080 [0xc09f]. > > BAR1: 32 bit memory at 0xf4040000 [0xf4040fff]. > > id "virtio-serial0" > Trace shows that virtio-serial0 IRQ is stuck. And vcpu0 writes to VIRTIO_PCI_QUEUE_NOTIFY in a tight loop. (In reply to comment #36) > (In reply to comment #35) > > (In reply to comment #34) > > > Bus 0, device 4, function 0: > > > Class 0780: PCI device 1af4:1003 > > > IRQ 5. > > > BAR0: I/O at 0xc080 [0xc09f]. > > > BAR1: 32 bit memory at 0xf4040000 [0xf4040fff]. > > > id "virtio-serial0" > > Trace shows that virtio-serial0 IRQ is stuck. > > And vcpu0 writes to VIRTIO_PCI_QUEUE_NOTIFY in a tight loop. Vadim says: Well, it can be such situation when vioserial port is trying to send a block of data but send virtual queue is completely stuck. Vadim will bound the endless wait loop that waits while the queue is full, and exit gracefully. This will only solve the hang guest, but the problem is still there, since the content of the queue is not consumed. Hi guys, could you please give a try to the latest driver, available at http://download.devel.redhat.com/brewroot/work/tasks/5605/4755605/virtio-win-prewhql-0.1.zip ? Thank you, Vadim. Hi Vadim, Can I try this driver at: https://brewweb.devel.redhat.com/buildinfo?buildID=245874 (In reply to comment #43) > Hi Vadim, > Can I try this driver at: > https://brewweb.devel.redhat.com/buildinfo?buildID=245874 Yes, please. Thank you, Vadim. Verified on virtio-win-prewhql-0.1-49(the build in Comment #44) in my longevity testing. All of the 5 windows 2k8r2sp1 guests with virtio serial dirver 52.64.104.4900 installed have running over 12 days, no hang was observed. So this issue has been fixed correctly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0527.html |