Bug 240406

Summary: xen-vncfb processes remain around for long-dead domains
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: xenAssignee: Markus Armbruster <armbru>
Status: CLOSED RAWHIDE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: katzj, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-06-21 11:09:02 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
Parts of xen/FC-6/xen-pvfb-terminate.patch lost in rebase none

Description Richard W.M. Jones 2007-05-17 07:30:17 EDT
Description of problem:

After a morning of stress-testing Xen using four Fedora Core 6 PV guests, there
are xen-vncfb processes hanging around even when no domains are running:

# /usr/sbin/xm list
Name                                      ID   Mem VCPUs      State   Time(s)
Domain-0                                   0  2984     4     r-----  12725.0
centos5                                        256     1                 0.2
fc6                                            256     1                25.5
fc6-2                                          256     1                23.9
fc6-3                                          256     1                24.0
fc6-4                                          256     1                21.0
freebsd32                                      256     1                 0.0

# ps axww | grep xen-vncfb
 5648 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 44 --title fc6-3
 6528 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 133 --title fc6-3
11368 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 50 --title fc6-3
15325 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 156 --title fc6-2
16758 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 215 --title fc6-3
18614 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 217 --title fc6-3
21397 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 176 --title fc6-3
22396 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 233 --title fc6-4
23209 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 235 --title fc6-4
24376 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 238 --title fc6-4
25080 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 30 --title fc6-4
27493 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 180 --title fc6
27820 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 35 --title fc6-2
28767 pts/10   S+     0:00 grep xen-vncfb
29970 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 39 --title fc6-2
30632 ?        Sl     0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen
127.0.0.1 -k en-us --domid 184 --title fc6-4

The stack traces look strange:

# gdb /usr/lib/debug/usr/lib64/xen/bin/xen-vncfb.debug 30632 
GNU gdb Red Hat Linux (6.6-8.fc7rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Using host libthread_db library "/lib64/libthread_db.so.1".
Attaching to program: /usr/lib/debug/usr/lib64/xen/bin/xen-vncfb.debug, process
30632

warning: .dynamic section for "/usr/lib64/libxenctrl.so.3.0" is not at the
expected address (wrong library or version mismatch?)

warning: .dynamic section for "/usr/lib64/libxenstore.so.3.0" is not at the
expected address (wrong library or version mismatch?)
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 46912500671552 (LWP 30632)]
[New Thread 1084229952 (LWP 30633)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libz.so.1...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /usr/lib64/libxenctrl.so.3.0...Reading symbols from
/usr/lib/debug/usr/lib64/libxenctrl.so.3.0.0.debug...done.
done.
Loaded symbols for /usr/lib64/libxenctrl.so.3.0
Reading symbols from /usr/lib64/libxenstore.so.3.0...Reading symbols from
/usr/lib/debug/usr/lib64/libxenstore.so.3.0.0.debug...done.
done.
Loaded symbols for /usr/lib64/libxenstore.so.3.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2

warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
(gdb) bt
#0  0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002aaaaacfcc53 in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) info threads
  2 Thread 1084229952 (LWP 30633)  0x000000303100cabb in read ()
   from /lib64/libpthread.so.0
  1 Thread 46912500671552 (LWP 30632)  0x000000303100a2c6 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread 1084229952 (LWP 30633))]#0  0x000000303100cabb in
read () from /lib64/libpthread.so.0
(gdb) bt
#0  0x000000303100cabb in read () from /lib64/libpthread.so.0
#1  0x00002aaaaacfb97a in ?? ()
#2  0x0000000000645a60 in ?? ()
#3  0x0000000000645d00 in ?? ()
#4  0x0000000000000000 in ?? ()

Restarting xend does not make them go away.

Version-Release number of selected component (if applicable):

Up2date Fedora 7.

Additional info:

Methodology of the stress tests: http://et.redhat.com/~rjones/xen-stress-tests/
Comment 1 Daniel Berrange 2007-05-17 15:04:53 EDT
This appears to be a problem with the back compat code. If a guest startup
process gets aborted during device hotplug, the vnc process will be stuck
waiting for the frontend to initialize and never quit.

Way to reproduce is thus:

 - Create a managed guest  eg  'xm new foo'
 - Edit /etc/xen/scripts/vif-bridge and add 'exit 1' in 2nd line
 - Run 'xm start foo'
 - Wait for it to time out & fail
 - Run 'xm destroy foo'

You should now have a vnc process stuck in xenfb_using_old_protocol waiting on
some watch event that I guess will never come.
Comment 2 Richard W.M. Jones 2007-05-17 15:26:22 EDT
I'm having trouble getting a correct core dump for these processes (they're
running, and neither sending them SIGQUIT nor using gcore(1) is working). 
However here is a collection of stack traces when the hanging processes have
been compiled with -O0 -g.



This one is very common:
(gdb) bt
#0  0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00002aaaaacfcc53 in xs_read_watch (h=0x651a60, num=0x7fff30d35f28)
    at xs.c:587
#2  0x000000000040518f in xenfb_attach_dom (xenfb_pub=0x651880, domid=733)
    at xenfb.c:333
#3  0x0000000000403848 in main (argc=<value optimized out>, 
    argv=0x7fff30d36708) at vncfb.c:335
#4  0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000402df9 in _start ()
(gdb) frame 0
#0  0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
(gdb) frame 1
#1  0x00002aaaaacfcc53 in xs_read_watch (h=0x651a60, num=0x7fff30d35f28)
    at xs.c:587
587                     pthread_cond_wait(&h->watch_condvar, &h->watch_mutex);
(gdb) print *h
$1 = {fd = 5, read_thr = 1084229952, read_thr_exists = 1, watch_list = {
    next = 0x651a78, prev = 0x651a78}, watch_mutex = {__data = {__lock = 0, 
      __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, 
      __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\0' <repeats 12 times>, "\001", '\0' <repeats 26 times>, 
    __align = 0}, watch_condvar = {__data = {__lock = 0, __futex = 3, 
      __total_seq = 2, __wakeup_seq = 1, __woken_seq = 1, __mutex = 0x651a88, 
      __nwaiters = 2, __broadcast_seq = 0}, 
    __size =
"\000\000\000\000\003\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\210\032e\000\000\000\000\000\002\000\000\000\000\000\000",
__align = 12884901888}, 
  watch_pipe = {-1, -1}, reply_list = {next = 0x651ae8, prev = 0x651ae8}, 
  reply_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
      __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\0' <repeats 39 times>, __align = 0}, reply_condvar = {__data = {
      __lock = 0, __futex = 74, __total_seq = 37, __wakeup_seq = 37, 
      __woken_seq = 37, __mutex = 0x651af8, __nwaiters = 0, 
      __broadcast_seq = 0}, 
    __size =
"\000\000\000\000J\000\000\000%\000\000\000\000\000\000\000%\000\000\000\000\000\000\000%\000\000\000\000\000\000\000�\032e",
'\0' <repeats 12 times>, __align = 317827579904}, request_mutex = {__data =
{__lock = 0, 
      __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
      __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\0' <repeats 39 times>, __align = 0}}
(gdb) frame 2
#2  0x000000000040518f in xenfb_attach_dom (xenfb_pub=0x651880, domid=733)
    at xenfb.c:333
333                     vec = xs_read_watch(xsh, &dummy);
(gdb) print xsh
$2 = (struct xs_handle *) 0x651a60
(gdb) print dummy
No symbol "dummy" in current context.
(gdb) frame 3
#3  0x0000000000403848 in main (argc=<value optimized out>, 
    argv=0x7fff30d36708) at vncfb.c:335
335             ret = xenfb_attach_dom(xenfb, domid);
(gdb) print xenfb
$4 = (struct xenfb *) 0x651880
(gdb) print *xenfb
$5 = {pixels = 0x0, row_stride = 0, depth = 0, width = 0, height = 0, 
  abs_pointer_wanted = 0, user_data = 0x0, update = 0}
(gdb) frame 4
#4  0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6



Only seen this one three or four times:
#0  0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000000000408696 in rfbClientConnectionGone (cl=0x653100)
    at rfbserver.c:496
#2  0x000000000040747a in rfbScreenCleanup (screen=0x652220) at main.c:988
#3  0x0000000000403c06 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at vncfb.c:427
#4  0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000402df9 in _start ()


Comment 3 Markus Armbruster 2007-05-17 16:29:18 EDT
Re last backtrace in comment #2: this is almost certainly another symptom of bug
240012.
Comment 4 Markus Armbruster 2007-05-17 17:39:39 EDT
The vfb backend bug that caused it not to terminate when the frontend vanishes
before the connection is established was fixed upstream in cset
14208:0d5d7d472024.  That fix doesn't cover our compatibility code.  Our
complete fix went into FC-[56]/xen-pvfb-terminate.patch.  When we rebased to
3.1.0, we replaced our complete fix by upstream's fix, and thus regressed.

Comment 5 Markus Armbruster 2007-05-17 17:41:57 EDT
Created attachment 154958 [details]
Parts of xen/FC-6/xen-pvfb-terminate.patch lost in rebase
Comment 6 Richard W.M. Jones 2007-05-18 08:54:43 EDT
Updated patch from Markus.

http://pastebin.ca/494800

Needs me to test.
Comment 7 Markus Armbruster 2007-05-18 09:26:37 EDT
The patch on pastebin is the wrong one, ignore it.  The patch attached here is
still up-to-date.  Testing still appreciated.
Comment 8 Richard W.M. Jones 2007-06-12 04:58:02 EDT
Testing still underway (my F7 machine is dead), but just a note to say that
I could also try the patch in bug 230634.