Description of problem: For a reproducer, see https://bugzilla.redhat.com/show_bug.cgi?id=853369 Version-Release number of selected component (if applicable): qemu-1.2.0-0.4.rc1.fc19.x86_64 How reproducible: 100% Additional info: Stack trace to follow.
Stack trace below. Note that it's thread #1 at the bottom which actually segfaults. Thread 17 (Thread 0x7f16f17fa700 (LWP 23318)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f16f17f9bb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 16 (Thread 0x7f16f37fe700 (LWP 23314)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f16f37fdbb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 15 (Thread 0x7f172f83c700 (LWP 23305)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f172f83bbb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 14 (Thread 0x7f172c836700 (LWP 23311)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f172c835bb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 13 (Thread 0x7f16f2ffd700 (LWP 23315)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f16f2ffcbb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7f172d037700 (LWP 23310)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f172d036bb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7f1730d46700 (LWP 23302)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f1730d45bb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7f172d838700 (LWP 23309)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f172d837bb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7f16f3fff700 (LWP 23313)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f16f3ffebb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7f16f1ffb700 (LWP 23317)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f16f1ffabb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7f172f03b700 (LWP 23306)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f172f03abb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7f172e83a700 (LWP 23307)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f172e839bb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f172e039700 (LWP 23308)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f172e038bb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f16f27fc700 (LWP 23316)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f16f27fbbb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f17089fe700 (LWP 23312)): #0 0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f173d486b19 in cond_timedwait (ts=0x7f17089fdbb0, cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107 #2 aio_thread (unused=<optimized out>) at posix-aio-compat.c:337 #3 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f1730545700 (LWP 23303)): #0 0x00007f173b35fe4d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f173b35bca6 in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x00007f173b35bba8 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f173d4a55d9 in qemu_mutex_lock (mutex=mutex@entry= 0x7f173e1c34e0 <qemu_global_mutex>) at qemu-thread-posix.c:54 #4 0x00007f173d4e4df0 in qemu_mutex_lock_iothread () at /usr/src/debug/qemu-kvm-1.2.0-rc1/cpus.c:908 #5 0x00007f173d532e86 in kvm_cpu_exec (env=env@entry=0x7f17401a4090) at /usr/src/debug/qemu-kvm-1.2.0-rc1/kvm-all.c:1582 #6 0x00007f173d4e3981 in qemu_kvm_cpu_thread_fn (arg=0x7f17401a4090) at /usr/src/debug/qemu-kvm-1.2.0-rc1/cpus.c:757 #7 0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f17379ea96d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f1730d4ea00 (LWP 23301)): #0 0x00007f173792eba5 in raise () from /lib64/libc.so.6 #1 0x00007f1737930358 in abort () from /lib64/libc.so.6 #2 0x00007f173796e01b in __libc_message () from /lib64/libc.so.6 #3 0x00007f1737a02b97 in __fortify_fail () from /lib64/libc.so.6 #4 0x00007f1737a00d10 in __chk_fail () from /lib64/libc.so.6 #5 0x00007f1737a02b07 in __fdelt_warn () from /lib64/libc.so.6 #6 0x00007f173d3d373d in qemu_iohandler_poll (readfds=readfds@entry= 0x7f173dd97b60 <rfds>, writefds=writefds@entry=0x7f173dd97be0 <wfds>, xfds=xfds@entry=0x7f173dd97c60 <xfds>, ret=ret@entry=1) at iohandler.c:156 #7 0x00007f173d477eb8 in main_loop_wait (nonblocking=<optimized out>) at main-loop.c:497 #8 0x00007f173d355b13 in main_loop () at /usr/src/debug/qemu-kvm-1.2.0-rc1/vl.c:1643 #9 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /usr/src/debug/qemu-kvm-1.2.0-rc1/vl.c:3790
(gdb) frame 6 #6 0x00007f173d3d373d in qemu_iohandler_poll (readfds=readfds@entry= 0x7f173dd97b60 <rfds>, writefds=writefds@entry=0x7f173dd97be0 <wfds>, xfds=xfds@entry=0x7f173dd97c60 <xfds>, ret=ret@entry=1) at iohandler.c:156 156 if (!ioh->deleted && ioh->fd_read && FD_ISSET(ioh->fd, readfds)) { (gdb) print *ioh $1 = { fd_read_poll = 0x7f173d49a7f0 <tcp_chr_read_poll>, fd_read = 0x7f173d49be30 <tcp_chr_read>, fd_write = 0x0, opaque = 0x7f173fe3a380, next = { le_next = 0x7f1728003310, le_prev = 0x7f173dd96a70 <io_handlers> }, fd = -1, deleted = false } Note that ioh->fd == -1 and ioh->deleted == false, so it's trying to do FD_ISSET (-1, readfds) which is failing. Also note that tcp_chr_read indicates that this 'ioh' is associated with a qemu char device. libguestfs uses a console device and a virtio-serial device so it might be either of these (or something else).
I cannot reproduce this problem with either * upstream qemu-kvm.git, nor with * qemu-kvm tag qemu-kvm-1.2.0-rc1, nor with * qemu.git I wonder if it has something to do with all the chardev patches that we pile on top? (What are those patches even for??) I will try these patches next.
(In reply to comment #3) > I cannot reproduce this problem with either > * upstream qemu-kvm.git, nor with > * qemu-kvm tag qemu-kvm-1.2.0-rc1, nor with > * qemu.git Please ignore this comment completely. For some reason the problem is only happening when qemu runs via libvirt, and my testing did not cover this case.
Results of tests: * upstream qemu-kvm.git => CAN reproduce it * upstream qemu.git => CANNOT reproduce it Test procedure is: ./configure --with-default-attach-method=libvirt make export LIBGUESTFS_QEMU=~/d/qemu/qemu.wrapper ./run ./fish/guestfish -N fs exit echo $?
Back from holiday now! The bug still happens with qemu-1.2.0-2.fc18.x86_64. I will try to get a minimal standalone reproducer (ie. w/o libguestfs) for this today.
Reproducer is proving hard to pin down. However here are some more details from the core file: (gdb) print *ioh $1 = { fd_read_poll = 0x7f251d9d8e00 <tcp_chr_read_poll>, fd_read = 0x7f251d9da440 <tcp_chr_read>, fd_write = 0x0, opaque = 0x7f251fb9f400, next = { le_next = 0x7f250c003310, le_prev = 0x7f251e2d7dd0 <io_handlers> }, fd = -1, deleted = false } (gdb) print *(CharDriverState*)ioh->opaque $4 = { init = 0x0, chr_write = 0x7f251d9dc640 <tcp_chr_write>, chr_update_read_handler = 0x0, chr_ioctl = 0x0, get_msgfd = 0x7f251d9d8e50 <tcp_get_msgfd>, chr_add_client = 0x7f251d9db550 <tcp_chr_add_client>, chr_event = 0x7f251d993cc0 <serial_event>, chr_can_read = 0x7f251d994410 <serial_can_receive1>, chr_read = 0x7f251d993d30 <serial_receive1>, chr_write_unblocked = 0x0, chr_enable_write_fd_handler = 0x7f251d9d9a10 <tcp_enable_write_fd_handler>, chr_disable_write_fd_handler = 0x7f251d9d99d0 <tcp_disable_write_fd_handler>, handler_opaque = 0x7f251ff80858, chr_close = 0x7f251d9da780 <tcp_chr_close>, chr_accept_input = 0x0, chr_set_echo = 0x0, chr_guest_open = 0x0, chr_guest_close = 0x0, opaque = 0x7f251fb9f4e0, bh = 0x7f250c000b40, label = 0x7f251fb9f690 "charserial0", filename = 0x7f251fb9f580 "unix:/home/rjones/d/libguestfs/libguestfswfMneI/console.sock", opened = 0, avail_connections = 0, write_blocked = false, next = { tqe_next = 0x7f251fb9f6b0, tqe_prev = 0x7f251fb9f240 } } (gdb) print *(TCPCharDriver*)$4->opaque $5 = { fd = -1, listen_fd = -1, connected = 1, max_size = 8, do_telnetopt = 0, do_nodelay = 0, is_unix = 1, msgfd = -1 } Some notes: (1) console.sock is created via the following command line: -chardev socket,id=charserial0,path=/home/rjones/d/libguestfs/libguestfswfMneI/console.sock -device isa-serial,chardev=charserial0,id=serial0 (2) crash happens on shutdown (3) libguestfs is reading from console.sock, and the guest is printing a lot of stuff both during boot up and shut down
This should fix it, but it's a bit ugly that we do not know the cause: diff --git a/qemu-char.c b/qemu-char.c index 767da93..6e1b9e0 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2334,8 +2334,10 @@ static void tcp_chr_connect(void *opaque) TCPCharDriver *s = chr->opaque; s->connected = 1; - qemu_set_fd_handler2(s->fd, tcp_chr_read_poll, - tcp_chr_read, NULL, chr); + if (s->fd == -1) { + qemu_set_fd_handler2(s->fd, tcp_chr_read_poll, + tcp_chr_read, NULL, chr); + } qemu_chr_generic_open(chr); }
That patch causes it to hang at start-up ...
So here's an interesting thing. I can "fix" this by changing the order of this code in libguestfs: /* BROKEN. */ if (g->fd[0] >= 0) // this is the console socket close (g->fd[0]); if (g->attach_ops->shutdown (g, check_for_errors) == -1) // kills qemu ret = -1; /* WORKING. */ if (g->attach_ops->shutdown (g, check_for_errors) == -1) ret = -1; if (g->fd[0] >= 0) close (g->fd[0]); What would happen if qemu was receiving a lot of serial data from the guest and writing it to this socket, and the other end closed down unexpectedly? I'm guessing a write somewhere would get -EPIPE, and that qemu doesn't handle this situation well. I will try to strace qemu and see if I can verify this.
The strace shows: # libguestfs closes the socket 9330 close(8 <unfinished ...> # qemu immediately wakes up, receives an (8 byte?) message, and closes # the socket too 9495 <... select resumed> ) = 1 (in [4]) 9495 recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"", 8}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 0 9495 close(4 <unfinished ...> # a few moments after this, another qemu thread starts doing # Bad Stuff like this (this did not happen earlier in the strace) 9497 write(-1, "r", 1) = -1 EBADF (Bad file descriptor) So it appears as if (some struct) -> fd is being set to -1 because of the closed socket.
This is dumb. If the socket is closed, qemu "reconnects" it, or to be more precise, it sets the s->connected flag and carries on regardless. Reverting this commit fixes it for me: http://git.qemu.org/?p=qemu.git;a=commitdiff;h=c3767ed0eb5d0bb25fe409ae5dec06e3411ff1b6 Furthermore, I see that aliguori already found a related bug and fixed it in qemu 6 days ago: http://git.qemu.org/?p=qemu.git;a=commitdiff;h=455aa1e0818653c41fd794435b982426ce21ba2f HOWEVER Anthony's patch does NOT fix this bug for me. Only a complete revert of c3767ed0eb5d0bb25 fixes it.
virt-viewer-0.5.3-5.fc18, spice-gtk-0.13.29-2.fc18, qemu-1.2.0-3.fc18, spice-0.11.3-1.fc18, spice-protocol-0.12.1-1.fc18, usbredir-0.5-1.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/FEDORA-2012-13513/spice-0.11.3-1.fc18,spice-protocol-0.12.1-1.fc18,virt-viewer-0.5.3-5.fc18,spice-gtk-0.13.29-2.fc18,qemu-1.2.0-3.fc18,usbredir-0.5-1.fc18
Anthony's response is encouraging: https://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01814.html although at time of writing this change has not gone upstream.
qemu-1.2.0-3.fc18, virt-viewer-0.5.3-6.fc18, spice-gtk-0.13.29-4.fc18, spice-0.11.3-1.fc18, spice-protocol-0.12.1-1.fc18, usbredir-0.5-1.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report.
Closing since this should be fixed in rawhide and F18 now.