Description of problem: After a morning of stress-testing Xen using four Fedora Core 6 PV guests, there are xen-vncfb processes hanging around even when no domains are running: # /usr/sbin/xm list Name ID Mem VCPUs State Time(s) Domain-0 0 2984 4 r----- 12725.0 centos5 256 1 0.2 fc6 256 1 25.5 fc6-2 256 1 23.9 fc6-3 256 1 24.0 fc6-4 256 1 21.0 freebsd32 256 1 0.0 # ps axww | grep xen-vncfb 5648 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 44 --title fc6-3 6528 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 133 --title fc6-3 11368 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 50 --title fc6-3 15325 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 156 --title fc6-2 16758 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 215 --title fc6-3 18614 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 217 --title fc6-3 21397 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 176 --title fc6-3 22396 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 233 --title fc6-4 23209 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 235 --title fc6-4 24376 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 238 --title fc6-4 25080 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 30 --title fc6-4 27493 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 180 --title fc6 27820 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 35 --title fc6-2 28767 pts/10 S+ 0:00 grep xen-vncfb 29970 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 39 --title fc6-2 30632 ? Sl 0:00 /usr/lib64/xen/bin/xen-vncfb --unused --listen 127.0.0.1 -k en-us --domid 184 --title fc6-4 The stack traces look strange: # gdb /usr/lib/debug/usr/lib64/xen/bin/xen-vncfb.debug 30632 GNU gdb Red Hat Linux (6.6-8.fc7rh) Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... Using host libthread_db library "/lib64/libthread_db.so.1". Attaching to program: /usr/lib/debug/usr/lib64/xen/bin/xen-vncfb.debug, process 30632 warning: .dynamic section for "/usr/lib64/libxenctrl.so.3.0" is not at the expected address (wrong library or version mismatch?) warning: .dynamic section for "/usr/lib64/libxenstore.so.3.0" is not at the expected address (wrong library or version mismatch?) Reading symbols from /lib64/libnsl.so.1...done. Loaded symbols for /lib64/libnsl.so.1 Reading symbols from /lib64/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 46912500671552 (LWP 30632)] [New Thread 1084229952 (LWP 30633)] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libz.so.1...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /usr/lib64/libxenctrl.so.3.0...Reading symbols from /usr/lib/debug/usr/lib64/libxenctrl.so.3.0.0.debug...done. done. Loaded symbols for /usr/lib64/libxenctrl.so.3.0 Reading symbols from /usr/lib64/libxenstore.so.3.0...Reading symbols from /usr/lib/debug/usr/lib64/libxenstore.so.3.0.0.debug...done. done. Loaded symbols for /usr/lib64/libxenstore.so.3.0 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 warning: Unable to find dynamic linker breakpoint function. GDB will be unable to debug shared library initializers and track explicitly loaded dynamic code. 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) bt #0 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002aaaaacfcc53 in ?? () #2 0x0000000000000000 in ?? () (gdb) info threads 2 Thread 1084229952 (LWP 30633) 0x000000303100cabb in read () from /lib64/libpthread.so.0 1 Thread 46912500671552 (LWP 30632) 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) thread 2 [Switching to thread 2 (Thread 1084229952 (LWP 30633))]#0 0x000000303100cabb in read () from /lib64/libpthread.so.0 (gdb) bt #0 0x000000303100cabb in read () from /lib64/libpthread.so.0 #1 0x00002aaaaacfb97a in ?? () #2 0x0000000000645a60 in ?? () #3 0x0000000000645d00 in ?? () #4 0x0000000000000000 in ?? () Restarting xend does not make them go away. Version-Release number of selected component (if applicable): Up2date Fedora 7. Additional info: Methodology of the stress tests: http://et.redhat.com/~rjones/xen-stress-tests/
This appears to be a problem with the back compat code. If a guest startup process gets aborted during device hotplug, the vnc process will be stuck waiting for the frontend to initialize and never quit. Way to reproduce is thus: - Create a managed guest eg 'xm new foo' - Edit /etc/xen/scripts/vif-bridge and add 'exit 1' in 2nd line - Run 'xm start foo' - Wait for it to time out & fail - Run 'xm destroy foo' You should now have a vnc process stuck in xenfb_using_old_protocol waiting on some watch event that I guess will never come.
I'm having trouble getting a correct core dump for these processes (they're running, and neither sending them SIGQUIT nor using gcore(1) is working). However here is a collection of stack traces when the hanging processes have been compiled with -O0 -g. This one is very common: (gdb) bt #0 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002aaaaacfcc53 in xs_read_watch (h=0x651a60, num=0x7fff30d35f28) at xs.c:587 #2 0x000000000040518f in xenfb_attach_dom (xenfb_pub=0x651880, domid=733) at xenfb.c:333 #3 0x0000000000403848 in main (argc=<value optimized out>, argv=0x7fff30d36708) at vncfb.c:335 #4 0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6 #5 0x0000000000402df9 in _start () (gdb) frame 0 #0 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) frame 1 #1 0x00002aaaaacfcc53 in xs_read_watch (h=0x651a60, num=0x7fff30d35f28) at xs.c:587 587 pthread_cond_wait(&h->watch_condvar, &h->watch_mutex); (gdb) print *h $1 = {fd = 5, read_thr = 1084229952, read_thr_exists = 1, watch_list = { next = 0x651a78, prev = 0x651a78}, watch_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\0' <repeats 12 times>, "\001", '\0' <repeats 26 times>, __align = 0}, watch_condvar = {__data = {__lock = 0, __futex = 3, __total_seq = 2, __wakeup_seq = 1, __woken_seq = 1, __mutex = 0x651a88, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\003\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\210\032e\000\000\000\000\000\002\000\000\000\000\000\000", __align = 12884901888}, watch_pipe = {-1, -1}, reply_list = {next = 0x651ae8, prev = 0x651ae8}, reply_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\0' <repeats 39 times>, __align = 0}, reply_condvar = {__data = { __lock = 0, __futex = 74, __total_seq = 37, __wakeup_seq = 37, __woken_seq = 37, __mutex = 0x651af8, __nwaiters = 0, __broadcast_seq = 0}, __size = "\000\000\000\000J\000\000\000%\000\000\000\000\000\000\000%\000\000\000\000\000\000\000%\000\000\000\000\000\000\000�\032e", '\0' <repeats 12 times>, __align = 317827579904}, request_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\0' <repeats 39 times>, __align = 0}} (gdb) frame 2 #2 0x000000000040518f in xenfb_attach_dom (xenfb_pub=0x651880, domid=733) at xenfb.c:333 333 vec = xs_read_watch(xsh, &dummy); (gdb) print xsh $2 = (struct xs_handle *) 0x651a60 (gdb) print dummy No symbol "dummy" in current context. (gdb) frame 3 #3 0x0000000000403848 in main (argc=<value optimized out>, argv=0x7fff30d36708) at vncfb.c:335 335 ret = xenfb_attach_dom(xenfb, domid); (gdb) print xenfb $4 = (struct xenfb *) 0x651880 (gdb) print *xenfb $5 = {pixels = 0x0, row_stride = 0, depth = 0, width = 0, height = 0, abs_pointer_wanted = 0, user_data = 0x0, update = 0} (gdb) frame 4 #4 0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6 Only seen this one three or four times: #0 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000408696 in rfbClientConnectionGone (cl=0x653100) at rfbserver.c:496 #2 0x000000000040747a in rfbScreenCleanup (screen=0x652220) at main.c:988 #3 0x0000000000403c06 in main (argc=<value optimized out>, argv=<value optimized out>) at vncfb.c:427 #4 0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6 #5 0x0000000000402df9 in _start ()
Re last backtrace in comment #2: this is almost certainly another symptom of bug 240012.
The vfb backend bug that caused it not to terminate when the frontend vanishes before the connection is established was fixed upstream in cset 14208:0d5d7d472024. That fix doesn't cover our compatibility code. Our complete fix went into FC-[56]/xen-pvfb-terminate.patch. When we rebased to 3.1.0, we replaced our complete fix by upstream's fix, and thus regressed.
Created attachment 154958 [details] Parts of xen/FC-6/xen-pvfb-terminate.patch lost in rebase
Updated patch from Markus. http://pastebin.ca/494800 Needs me to test.
The patch on pastebin is the wrong one, ignore it. The patch attached here is still up-to-date. Testing still appreciated.
Testing still underway (my F7 machine is dead), but just a note to say that I could also try the patch in bug 230634.