Bug 995931
Summary: | Qemu core dump (red_get_image: unknown type 184) when reboot a RHEL.6.4-64 guest 25 times | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | CongLi <coli> |
Component: | qemu-kvm | Assignee: | Gerd Hoffmann <kraxel> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 6.5 | CC: | bsarathy, cfergeau, chayang, coli, dblechte, flang, hdegoede, jen, juzhang, lersek, marcandre.lureau, michen, mkenneth, qiguo, qzhang, rbalakri, scui, shuang, virt-maint, xutian, xwei |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-0.12.1.2-2.438.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-10-14 06:49:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1054077, 1077076 |
Description
CongLi
2013-08-12 03:42:47 UTC
How exactly do you reboot the guest in comment #0 step 2? Thanks. My guess is that spice might need a reset notifier. The spice protocol byte stream could be abruptly restarted in the middle of a command. (In reply to Laszlo Ersek from comment #3) > How exactly do you reboot the guest in comment #0 step 2? Thanks. Login guest via ssh, then do 'shutdown -r now' in guest. Okay I've looked into some recent spice-server patches and they convinced me that the spice guys will be able to narrow this down (to get the component right, even) much faster than I could. Moving to spice-server; feel free to bump back if I'm wrong. Spice-server version is captured in comment #0. Thanks. Thread 1: IO thread Thread 2: AIO thread Thread 3: QXL worker thread Thread 4: VCPU thread Thread 1, the IO thread, is running qxl_hard_reset(), the QXL reset notification handler, as part of the VM reset. This operation seems to send a message of type RED_WORKER_MESSAGE_ADD_MEMSLOT to thread 3, the QXL worker thread. qxl_hard_reset() qemu_spice_create_host_memslot() memslot.slot_group_id = MEMSLOT_GROUP_HOST memslot.slot_id = 0 memslot.generation = 0 memslot.virt_start = 0 memslot.virt_end = ~0UL memslot.addr_delta = 0 memslot.virt_qxl_ram_size = 0 qemu_spice_add_memslot(memslot, ..., QXL_SYNC) qxl_worker_add_memslot() -- via funcptr red_dispatcher_add_memslot() dispatcher_send_message(..., RED_WORKER_MESSAGE_ADD_MEMSLOT) -- sends message -- is waiting for ACK Thread 3, the QXL worker thread, handles RED_WORKER_MESSAGE_ADD_MEMSLOT: handle_dev_add_memslot() red_memslot_info_add_slot() -- stores addr_delta (0) / virt_start (0) / virt_end (~0UL) / generation (0) in info->mem_slots[slot_group_id][slot_id], that is, info->mem_slots[MEMSLOT_GROUP_HOST][0], Thread 3, the QXL worker thread, goes on to process a QXL_CMD_DRAW command: red_worker_main() red_process_commands() red_get_drawable() red_get_native_drawable() -- QXL_COMMAND_FLAG_COMPAT is clear red_get_copy_ptr() -- red->type == QXL_DRAW_COPY red_get_image() qxl = get_virt(...) red->descriptor.type = qxl->descriptor.type -- cannot handle this unknown type Now, get_virt() tries to translate a (guest?)physical address to a host virtual address, and then work with the image found there. Unfortunately, the pointer returned by qxl is garbage, which points into garbage. This is because get_virt() implements the gpa->hva resolution by traversing the memmap that the RED_WORKER_MESSAGE_ADD_MEMSLOT command just destroyed. So, my interpretation: when the QXL reset handler is run in the IO thread, it queues a synchronous RED_WORKER_MESSAGE_ADD_MEMSLOT message for the QXL worker thread. This message clears the memmap that the QXL worker thread is maintaining. Apparently, the command queue of the QXL worker is not flushed at this point (although it should be), because the QXL worker tries to process another message (that has been queued earlier), of type QXL_CMD_DRAW. This message is unprocessable with the cleared memmap. This could also be a synchronization problem between the IO thread and the QXL worker thread. Maybe the QXL work queue *is* flushed before sending the RED_WORKER_MESSAGE_ADD_MEMSLOT command, but (perhaps) the QXL_CMD_DRAW command is unexpectedly enqueued afterwards. The "synchronization problem" theory is supported by the fact that QE had to run the test 25 times to trigger the problem once. Maybe heavy graphics activity in the guest, combined with the more abrupt "system_reset" HMP command (which doesn't stop guest processes nicely, like "shutdown -r now"), could trigger it with a higher rate. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. *** Bug 952447 has been marked as a duplicate of this bug. *** raising a bit severity to highlight that bug, although it's not critical, since a crash when the guest reboot is somehow "safe" Ok, it looks like rhel is missing this fix: commit 75c70e37bc4a6bdc394b4d1b163fe730abb82c72 Author: Gerd Hoffmann <kraxel> Date: Mon Dec 9 16:03:49 2013 +0100 spice: stop server for qxl hard reset Gerd, why did you propose that fix for RHEL too? otherwise, I think it's a good time :) moving to POST to reflect that (In reply to Marc-Andre Lureau from comment #11) > Ok, it looks like rhel is missing this fix: > > commit 75c70e37bc4a6bdc394b4d1b163fe730abb82c72 > Author: Gerd Hoffmann <kraxel> > Date: Mon Dec 9 16:03:49 2013 +0100 > > spice: stop server for qxl hard reset > > Gerd, why did you propose that fix for RHEL too? otherwise, I think it's a > good time :) (In reply to Marc-Andre Lureau from comment #12) > moving to POST to reflect that The patch should be backported - I don't see the patch in rhvirt-patches. We only move bugs to POST after it's posted downstream, not when it's upstream. Moving it back to assigned, owned by Gerd. Please correct me if I'm wrong. please try with this new scratchbuild https://brewweb.devel.redhat.com/taskinfo?taskID=7632466 thanks There was a second patch needed: commit b50f3e42b9438e033074222671c0502ecfeba82c Author: Gerd Hoffmann <kraxel> Date: Mon Dec 9 16:00:15 2013 +0100 spice: move spice_server_vm_{start,stop} calls into qemu_spice_display_*() Gerd, are you taking this over? thanks (In reply to Marc-Andre Lureau from comment #17) > please try with this new scratchbuild > https://brewweb.devel.redhat.com/taskinfo?taskID=7632466 > > thanks Tested 10 rounds, each round needs 50 times' reboot, all pass. package version: kernel-2.6.32-431.22.1.el6.x86_64 qemu-kvm-0.12.1.2-2.415.el6_5.11.x86_64 As the above info, I think this bug has been fixed in the above build. If there is anything wrong, feel free to correct me. Thanks, Cong *** Bug 1077076 has been marked as a duplicate of this bug. *** Thanks in advance Gerd! :) http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7657369 patches posted. do we need a duplicate for rhel7? (In reply to Marc-Andre Lureau from comment #25) > do we need a duplicate for rhel7? No, 1.5.0 has the patches. Fix included in qemu-kvm-0.12.1.2-2.438.el6 Test this bug as follow version,still can not reproduce Host: # uname -r 2.6.32-410.el6.x86_64 # rpm -q qemu-kvm-rhev qemu-kvm-rhev-0.12.1.2-2.386.el6_5.test.x86_64 Guest:RHEL6.4 Steps: 1.Boot RHEL6.4guest 2.Login and Reboot guest Use autotest script for the test #python ConfigTest.py --guestname=RHEL.6.4 --testcase=reboot --nrepeat=10 login (ssh)-->reboot(shutdown -r now) 3. Repeat step 2 for 25 *10=250 times. Actual results: guest work well Attdional info: According comment7, manually test about 10 times ,each time add GUI stress(use SPECviewperf) in guest, still can not reproduce. Test on latest version: Host: kernel-2.6.32-431.31.1.el6.x86_64 qemu-kvm-0.12.1.2-2.438.el6.x86_64 guest: RHEL6.4-GA Results: Test about 25*50 times ,guest work well Reproduced on qemu-kvm-0.12.1.2-2.437.el6.x86_64 and spice-server-0.12.4-2.el6.x86_64. Steps: 1. start a rhel7 guest by: /usr/libexec/qemu-kvm -name test -M rhel6.0.0 -enable-kvm -cpu Penryn -m 2048 -smp 2,sockets=2,cores=1,threads=1 -nodefaults -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=00:1a:4a:42:76:36,bus=pci.0 -k en-us -vga qxl -spice port=7000,disable-ticketing,streaming-video=all,agent-mouse=on,playback-compression=on -usb -monitor stdio -boot menu=on -drive file=/home/RHEL-Server-7.0-64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-virtio-disk0,id=virtio-disk0 2. login to guest 3. system_reset through HMP Actual Result: (qemu) system_reset (qemu) id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 (/usr/bin/gdb:28539): Spice-CRITICAL **: red_memslots.c:94:validate_virt: virtual address out of range virt=0x1000398+0xbf slot_id=1 group_id=1 slot=0x0-0x0 delta=0x0 Detaching after fork from child process 28690. Program received signal SIGABRT, Aborted. [Switching to Thread 0x7fffe65fd700 (LWP 28555)] 0x00007ffff483e915 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff483e915 in raise () from /lib64/libc.so.6 #1 0x00007ffff48400f5 in abort () from /lib64/libc.so.6 #2 0x00007ffff50a0875 in spice_logv (log_domain=0x7ffff5117a06 "Spice", log_level=SPICE_LOG_LEVEL_CRITICAL, strloc=0x7ffff511c43a "red_memslots.c:94", function=0x7ffff511c51f "validate_virt", format=0x7ffff511c248 "virtual address out of range\n virt=0x%lx+0x%x slot_id=%d group_id=%d\n slot=0x%lx-0x%lx delta=0x%lx", args=0x7fffe65fc660) at log.c:109 #3 0x00007ffff50a09aa in spice_log (log_domain=<value optimized out>, log_level=<value optimized out>, strloc=<value optimized out>, function=<value optimized out>, format=<value optimized out>) at log.c:123 #4 0x00007ffff505df23 in validate_virt (info=<value optimized out>, virt=16778136, slot_id=1, add_size=191, group_id=1) at red_memslots.c:90 #5 0x00007ffff505e073 in get_virt (info=<value optimized out>, addr=<value optimized out>, add_size=<value optimized out>, group_id=1, error=0x7fffe65fc80c) at red_memslots.c:142 #6 0x00007ffff5060060 in red_get_native_drawable (slots=0x7fff501d5e58, group_id=1, red=0x7fff5045cf00, addr=<value optimized out>, flags=0) at red_parse_qxl.c:934 #7 red_get_drawable (slots=0x7fff501d5e58, group_id=1, red=0x7fff5045cf00, addr=<value optimized out>, flags=0) at red_parse_qxl.c:1105 #8 0x00007ffff507447b in red_process_commands (worker=0x7fff500008c0, ring_is_empty=0x7fffe65fca3c, max_pipe_size=50) at red_worker.c:5190 #9 0x00007ffff507755b in flush_display_commands (worker=0x7fff500008c0) at red_worker.c:9712 #10 flush_all_qxl_commands (worker=0x7fff500008c0) at red_worker.c:9795 #11 0x00007ffff5078380 in dev_destroy_surfaces (opaque=<value optimized out>, payload=<value optimized out>) at red_worker.c:11270 #12 handle_dev_destroy_surfaces (opaque=<value optimized out>, payload=<value optimized out>) at red_worker.c:11299 #13 0x00007ffff505b607 in dispatcher_handle_single_read (dispatcher=0x7ffff88b0b68) at dispatcher.c:139 #14 dispatcher_handle_recv_read (dispatcher=0x7ffff88b0b68) at dispatcher.c:162 #15 0x00007ffff5077226 in red_worker_main (arg=<value optimized out>) at red_worker.c:12276 #16 0x00007ffff76e99d1 in start_thread () from /lib64/libpthread.so.0 #17 0x00007ffff48f4ccd in clone () from /lib64/libc.so.6 -- Verified pass with qemu-kvm-0.12.1.2-2.441.el6.x86_64. No qemu-kvm coredump any more after system_reset in HMP. Besides, I tested latest spice-server(spice-server-0.12.4-11.el6.x86_64) with -M rhel6.0.0 as well as -M rhel6.6.0. This issue no longer reproduces. As per above, this issue has fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1490.html |