Bug 853408 - qemu segfaults on shutdown when writing to a closed serial socket (not in server mode)
Summary: qemu segfaults on shutdown when writing to a closed serial socket (not in ser...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: qemu
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Paolo Bonzini
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-08-31 12:42 UTC by Richard W.M. Jones
Modified: 2013-01-09 12:07 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2012-09-30 23:01:42 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Richard W.M. Jones 2012-08-31 12:42:34 UTC
Description of problem:

For a reproducer, see https://bugzilla.redhat.com/show_bug.cgi?id=853369

Version-Release number of selected component (if applicable):

qemu-1.2.0-0.4.rc1.fc19.x86_64

How reproducible:

100%

Additional info:

Stack trace to follow.

Comment 1 Richard W.M. Jones 2012-08-31 12:53:05 UTC
Stack trace below.  Note that it's thread #1 at the bottom
which actually segfaults.

Thread 17 (Thread 0x7f16f17fa700 (LWP 23318)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f16f17f9bb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f16f37fe700 (LWP 23314)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f16f37fdbb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f172f83c700 (LWP 23305)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f172f83bbb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f172c836700 (LWP 23311)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f172c835bb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f16f2ffd700 (LWP 23315)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f16f2ffcbb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f172d037700 (LWP 23310)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f172d036bb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f1730d46700 (LWP 23302)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f1730d45bb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f172d838700 (LWP 23309)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f172d837bb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f16f3fff700 (LWP 23313)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f16f3ffebb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f16f1ffb700 (LWP 23317)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f16f1ffabb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f172f03b700 (LWP 23306)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f172f03abb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f172e83a700 (LWP 23307)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f172e839bb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f172e039700 (LWP 23308)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f172e038bb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f16f27fc700 (LWP 23316)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f16f27fbbb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f17089fe700 (LWP 23312)):
#0  0x00007f173b35d952 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f173d486b19 in cond_timedwait (ts=0x7f17089fdbb0, 
    cond=<optimized out>, mutex=<optimized out>) at posix-aio-compat.c:107
#2  aio_thread (unused=<optimized out>) at posix-aio-compat.c:337
#3  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f1730545700 (LWP 23303)):
#0  0x00007f173b35fe4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f173b35bca6 in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f173b35bba8 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f173d4a55d9 in qemu_mutex_lock (mutex=mutex@entry=
    0x7f173e1c34e0 <qemu_global_mutex>) at qemu-thread-posix.c:54
#4  0x00007f173d4e4df0 in qemu_mutex_lock_iothread ()
    at /usr/src/debug/qemu-kvm-1.2.0-rc1/cpus.c:908
#5  0x00007f173d532e86 in kvm_cpu_exec (env=env@entry=0x7f17401a4090)
    at /usr/src/debug/qemu-kvm-1.2.0-rc1/kvm-all.c:1582
#6  0x00007f173d4e3981 in qemu_kvm_cpu_thread_fn (arg=0x7f17401a4090)
    at /usr/src/debug/qemu-kvm-1.2.0-rc1/cpus.c:757
#7  0x00007f173b359d15 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f17379ea96d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f1730d4ea00 (LWP 23301)):
#0  0x00007f173792eba5 in raise () from /lib64/libc.so.6
#1  0x00007f1737930358 in abort () from /lib64/libc.so.6
#2  0x00007f173796e01b in __libc_message () from /lib64/libc.so.6
#3  0x00007f1737a02b97 in __fortify_fail () from /lib64/libc.so.6
#4  0x00007f1737a00d10 in __chk_fail () from /lib64/libc.so.6
#5  0x00007f1737a02b07 in __fdelt_warn () from /lib64/libc.so.6
#6  0x00007f173d3d373d in qemu_iohandler_poll (readfds=readfds@entry=
    0x7f173dd97b60 <rfds>, writefds=writefds@entry=0x7f173dd97be0 <wfds>, 
    xfds=xfds@entry=0x7f173dd97c60 <xfds>, ret=ret@entry=1) at iohandler.c:156
#7  0x00007f173d477eb8 in main_loop_wait (nonblocking=<optimized out>)
    at main-loop.c:497
#8  0x00007f173d355b13 in main_loop ()
    at /usr/src/debug/qemu-kvm-1.2.0-rc1/vl.c:1643
#9  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at /usr/src/debug/qemu-kvm-1.2.0-rc1/vl.c:3790

Comment 2 Richard W.M. Jones 2012-08-31 12:55:52 UTC
(gdb) frame 6
#6  0x00007f173d3d373d in qemu_iohandler_poll (readfds=readfds@entry=
    0x7f173dd97b60 <rfds>, writefds=writefds@entry=0x7f173dd97be0 <wfds>, 
    xfds=xfds@entry=0x7f173dd97c60 <xfds>, ret=ret@entry=1) at iohandler.c:156
156	            if (!ioh->deleted && ioh->fd_read && FD_ISSET(ioh->fd, readfds)) {
(gdb) print *ioh
$1 = {
  fd_read_poll = 0x7f173d49a7f0 <tcp_chr_read_poll>, 
  fd_read = 0x7f173d49be30 <tcp_chr_read>, 
  fd_write = 0x0, 
  opaque = 0x7f173fe3a380, 
  next = {
    le_next = 0x7f1728003310, 
    le_prev = 0x7f173dd96a70 <io_handlers>
  }, 
  fd = -1, 
  deleted = false
}

Note that ioh->fd == -1 and ioh->deleted == false, so it's
trying to do FD_ISSET (-1, readfds) which is failing.

Also note that tcp_chr_read indicates that this 'ioh' is
associated with a qemu char device.  libguestfs uses a
console device and a virtio-serial device so it might be
either of these (or something else).

Comment 3 Richard W.M. Jones 2012-09-05 12:09:10 UTC
I cannot reproduce this problem with either
 * upstream qemu-kvm.git, nor with
 * qemu-kvm tag qemu-kvm-1.2.0-rc1, nor with
 * qemu.git

I wonder if it has something to do with all the chardev
patches that we pile on top?  (What are those patches
even for??)  I will try these patches next.

Comment 4 Richard W.M. Jones 2012-09-05 13:02:05 UTC
(In reply to comment #3)
> I cannot reproduce this problem with either
>  * upstream qemu-kvm.git, nor with
>  * qemu-kvm tag qemu-kvm-1.2.0-rc1, nor with
>  * qemu.git

Please ignore this comment completely.  For some
reason the problem is only happening when qemu runs
via libvirt, and my testing did not cover this case.

Comment 5 Richard W.M. Jones 2012-09-05 13:06:17 UTC
Results of tests:

 * upstream qemu-kvm.git => CAN reproduce it
 * upstream qemu.git => CANNOT reproduce it

Test procedure is:

./configure --with-default-attach-method=libvirt
make
export LIBGUESTFS_QEMU=~/d/qemu/qemu.wrapper
./run ./fish/guestfish -N fs exit
echo $?

Comment 6 Richard W.M. Jones 2012-09-12 12:28:47 UTC
Back from holiday now!  The bug still happens with
qemu-1.2.0-2.fc18.x86_64.  I will try to get a minimal
standalone reproducer (ie. w/o libguestfs) for this today.

Comment 7 Richard W.M. Jones 2012-09-12 15:56:35 UTC
Reproducer is proving hard to pin down.  However here are some
more details from the core file:

(gdb) print *ioh
$1 = {
  fd_read_poll = 0x7f251d9d8e00 <tcp_chr_read_poll>, 
  fd_read = 0x7f251d9da440 <tcp_chr_read>, 
  fd_write = 0x0, 
  opaque = 0x7f251fb9f400, 
  next = {
    le_next = 0x7f250c003310, 
    le_prev = 0x7f251e2d7dd0 <io_handlers>
  }, 
  fd = -1, 
  deleted = false
}

(gdb) print *(CharDriverState*)ioh->opaque
$4 = {
  init = 0x0, 
  chr_write = 0x7f251d9dc640 <tcp_chr_write>, 
  chr_update_read_handler = 0x0, 
  chr_ioctl = 0x0, 
  get_msgfd = 0x7f251d9d8e50 <tcp_get_msgfd>, 
  chr_add_client = 0x7f251d9db550 <tcp_chr_add_client>, 
  chr_event = 0x7f251d993cc0 <serial_event>, 
  chr_can_read = 0x7f251d994410 <serial_can_receive1>, 
  chr_read = 0x7f251d993d30 <serial_receive1>, 
  chr_write_unblocked = 0x0, 
  chr_enable_write_fd_handler = 0x7f251d9d9a10 <tcp_enable_write_fd_handler>, 
  chr_disable_write_fd_handler = 
    0x7f251d9d99d0 <tcp_disable_write_fd_handler>, 
  handler_opaque = 0x7f251ff80858, 
  chr_close = 0x7f251d9da780 <tcp_chr_close>, 
  chr_accept_input = 0x0, 
  chr_set_echo = 0x0, 
  chr_guest_open = 0x0, 
  chr_guest_close = 0x0, 
  opaque = 0x7f251fb9f4e0, 
  bh = 0x7f250c000b40, 
  label = 0x7f251fb9f690 "charserial0", 
  filename = 
    0x7f251fb9f580 "unix:/home/rjones/d/libguestfs/libguestfswfMneI/console.sock", 
  opened = 0, 
  avail_connections = 0, 
  write_blocked = false, 
  next = {
    tqe_next = 0x7f251fb9f6b0, 
    tqe_prev = 0x7f251fb9f240
  }
}

(gdb) print *(TCPCharDriver*)$4->opaque
$5 = {
  fd = -1, 
  listen_fd = -1, 
  connected = 1, 
  max_size = 8, 
  do_telnetopt = 0, 
  do_nodelay = 0, 
  is_unix = 1, 
  msgfd = -1
}

Some notes:

(1) console.sock is created via the following command line:

-chardev socket,id=charserial0,path=/home/rjones/d/libguestfs/libguestfswfMneI/console.sock -device isa-serial,chardev=charserial0,id=serial0

(2) crash happens on shutdown

(3) libguestfs is reading from console.sock, and the guest is
printing a lot of stuff both during boot up and shut down

Comment 8 Paolo Bonzini 2012-09-12 16:28:10 UTC
This should fix it, but it's a bit ugly that we do not know the cause:

diff --git a/qemu-char.c b/qemu-char.c
index 767da93..6e1b9e0 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2334,8 +2334,10 @@ static void tcp_chr_connect(void *opaque)
     TCPCharDriver *s = chr->opaque;
 
     s->connected = 1;
-    qemu_set_fd_handler2(s->fd, tcp_chr_read_poll,
-                         tcp_chr_read, NULL, chr);
+    if (s->fd == -1) {
+        qemu_set_fd_handler2(s->fd, tcp_chr_read_poll,
+                             tcp_chr_read, NULL, chr);
+    }
     qemu_chr_generic_open(chr);
 }

Comment 9 Richard W.M. Jones 2012-09-12 17:31:26 UTC
That patch causes it to hang at start-up ...

Comment 10 Richard W.M. Jones 2012-09-12 17:48:56 UTC
So here's an interesting thing.

I can "fix" this by changing the order of this code in libguestfs:

  /* BROKEN. */
  if (g->fd[0] >= 0)  // this is the console socket
    close (g->fd[0]);
  if (g->attach_ops->shutdown (g, check_for_errors) == -1) // kills qemu
    ret = -1;

  /* WORKING. */
  if (g->attach_ops->shutdown (g, check_for_errors) == -1)
    ret = -1;
  if (g->fd[0] >= 0)
    close (g->fd[0]);

What would happen if qemu was receiving a lot of serial data
from the guest and writing it to this socket, and the
other end closed down unexpectedly?  I'm guessing a write
somewhere would get -EPIPE, and that qemu doesn't handle
this situation well.

I will try to strace qemu and see if I can verify this.

Comment 11 Richard W.M. Jones 2012-09-12 18:11:46 UTC
The strace shows:

# libguestfs closes the socket
9330  close(8 <unfinished ...>

# qemu immediately wakes up, receives an (8 byte?) message, and closes
# the socket too
9495  <... select resumed> )            = 1 (in [4])
9495  recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"", 8}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 0
9495  close(4 <unfinished ...>

# a few moments after this, another qemu thread starts doing
# Bad Stuff like this (this did not happen earlier in the strace)
9497  write(-1, "r", 1)                 = -1 EBADF (Bad file descriptor)


So it appears as if (some struct) -> fd is being set to -1
because of the closed socket.

Comment 12 Richard W.M. Jones 2012-09-12 18:27:36 UTC
This is dumb.  If the socket is closed, qemu "reconnects"
it, or to be more precise, it sets the s->connected flag
and carries on regardless.

Reverting this commit fixes it for me:

http://git.qemu.org/?p=qemu.git;a=commitdiff;h=c3767ed0eb5d0bb25fe409ae5dec06e3411ff1b6

Furthermore, I see that aliguori already found a
related bug and fixed it in qemu 6 days ago:

http://git.qemu.org/?p=qemu.git;a=commitdiff;h=455aa1e0818653c41fd794435b982426ce21ba2f

HOWEVER Anthony's patch does NOT fix this bug for
me.  Only a complete revert of c3767ed0eb5d0bb25 fixes it.

Comment 13 Fedora Update System 2012-09-12 20:15:54 UTC
virt-viewer-0.5.3-5.fc18, spice-gtk-0.13.29-2.fc18, qemu-1.2.0-3.fc18, spice-0.11.3-1.fc18, spice-protocol-0.12.1-1.fc18, usbredir-0.5-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/FEDORA-2012-13513/spice-0.11.3-1.fc18,spice-protocol-0.12.1-1.fc18,virt-viewer-0.5.3-5.fc18,spice-gtk-0.13.29-2.fc18,qemu-1.2.0-3.fc18,usbredir-0.5-1.fc18

Comment 14 Richard W.M. Jones 2012-09-17 13:06:08 UTC
Anthony's response is encouraging:
https://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01814.html
although at time of writing this change has not gone upstream.

Comment 15 Fedora Update System 2012-09-17 22:50:00 UTC
qemu-1.2.0-3.fc18, virt-viewer-0.5.3-6.fc18, spice-gtk-0.13.29-4.fc18, spice-0.11.3-1.fc18, spice-protocol-0.12.1-1.fc18, usbredir-0.5-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 16 Cole Robinson 2012-09-30 23:01:42 UTC
Closing since this should be fixed in rawhide and F18 now.


Note You need to log in before you can comment on or make changes to this bug.