Bug 240406
| Summary: | xen-vncfb processes remain around for long-dead domains | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Richard W.M. Jones <rjones> | ||||
| Component: | xen | Assignee: | Markus Armbruster <armbru> | ||||
| Status: | CLOSED RAWHIDE | QA Contact: | |||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | rawhide | CC: | katzj, xen-maint | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-06-21 15:09:02 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Richard W.M. Jones
2007-05-17 11:30:17 UTC
This appears to be a problem with the back compat code. If a guest startup process gets aborted during device hotplug, the vnc process will be stuck waiting for the frontend to initialize and never quit. Way to reproduce is thus: - Create a managed guest eg 'xm new foo' - Edit /etc/xen/scripts/vif-bridge and add 'exit 1' in 2nd line - Run 'xm start foo' - Wait for it to time out & fail - Run 'xm destroy foo' You should now have a vnc process stuck in xenfb_using_old_protocol waiting on some watch event that I guess will never come. I'm having trouble getting a correct core dump for these processes (they're
running, and neither sending them SIGQUIT nor using gcore(1) is working).
However here is a collection of stack traces when the hanging processes have
been compiled with -O0 -g.
This one is very common:
(gdb) bt
#0 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x00002aaaaacfcc53 in xs_read_watch (h=0x651a60, num=0x7fff30d35f28)
at xs.c:587
#2 0x000000000040518f in xenfb_attach_dom (xenfb_pub=0x651880, domid=733)
at xenfb.c:333
#3 0x0000000000403848 in main (argc=<value optimized out>,
argv=0x7fff30d36708) at vncfb.c:335
#4 0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6
#5 0x0000000000402df9 in _start ()
(gdb) frame 0
#0 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
(gdb) frame 1
#1 0x00002aaaaacfcc53 in xs_read_watch (h=0x651a60, num=0x7fff30d35f28)
at xs.c:587
587 pthread_cond_wait(&h->watch_condvar, &h->watch_mutex);
(gdb) print *h
$1 = {fd = 5, read_thr = 1084229952, read_thr_exists = 1, watch_list = {
next = 0x651a78, prev = 0x651a78}, watch_mutex = {__data = {__lock = 0,
__count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0,
__list = {__prev = 0x0, __next = 0x0}},
__size = '\0' <repeats 12 times>, "\001", '\0' <repeats 26 times>,
__align = 0}, watch_condvar = {__data = {__lock = 0, __futex = 3,
__total_seq = 2, __wakeup_seq = 1, __woken_seq = 1, __mutex = 0x651a88,
__nwaiters = 2, __broadcast_seq = 0},
__size =
"\000\000\000\000\003\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\210\032e\000\000\000\000\000\002\000\000\000\000\000\000",
__align = 12884901888},
watch_pipe = {-1, -1}, reply_list = {next = 0x651ae8, prev = 0x651ae8},
reply_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\0' <repeats 39 times>, __align = 0}, reply_condvar = {__data = {
__lock = 0, __futex = 74, __total_seq = 37, __wakeup_seq = 37,
__woken_seq = 37, __mutex = 0x651af8, __nwaiters = 0,
__broadcast_seq = 0},
__size =
"\000\000\000\000J\000\000\000%\000\000\000\000\000\000\000%\000\000\000\000\000\000\000%\000\000\000\000\000\000\000�\032e",
'\0' <repeats 12 times>, __align = 317827579904}, request_mutex = {__data =
{__lock = 0,
__count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0,
__list = {__prev = 0x0, __next = 0x0}},
__size = '\0' <repeats 39 times>, __align = 0}}
(gdb) frame 2
#2 0x000000000040518f in xenfb_attach_dom (xenfb_pub=0x651880, domid=733)
at xenfb.c:333
333 vec = xs_read_watch(xsh, &dummy);
(gdb) print xsh
$2 = (struct xs_handle *) 0x651a60
(gdb) print dummy
No symbol "dummy" in current context.
(gdb) frame 3
#3 0x0000000000403848 in main (argc=<value optimized out>,
argv=0x7fff30d36708) at vncfb.c:335
335 ret = xenfb_attach_dom(xenfb, domid);
(gdb) print xenfb
$4 = (struct xenfb *) 0x651880
(gdb) print *xenfb
$5 = {pixels = 0x0, row_stride = 0, depth = 0, width = 0, height = 0,
abs_pointer_wanted = 0, user_data = 0x0, update = 0}
(gdb) frame 4
#4 0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6
Only seen this one three or four times:
#0 0x000000303100a2c6 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x0000000000408696 in rfbClientConnectionGone (cl=0x653100)
at rfbserver.c:496
#2 0x000000000040747a in rfbScreenCleanup (screen=0x652220) at main.c:988
#3 0x0000000000403c06 in main (argc=<value optimized out>,
argv=<value optimized out>) at vncfb.c:427
#4 0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6
#5 0x0000000000402df9 in _start ()
Re last backtrace in comment #2: this is almost certainly another symptom of bug 240012. The vfb backend bug that caused it not to terminate when the frontend vanishes before the connection is established was fixed upstream in cset 14208:0d5d7d472024. That fix doesn't cover our compatibility code. Our complete fix went into FC-[56]/xen-pvfb-terminate.patch. When we rebased to 3.1.0, we replaced our complete fix by upstream's fix, and thus regressed. Created attachment 154958 [details]
Parts of xen/FC-6/xen-pvfb-terminate.patch lost in rebase
Updated patch from Markus. http://pastebin.ca/494800 Needs me to test. The patch on pastebin is the wrong one, ignore it. The patch attached here is still up-to-date. Testing still appreciated. Testing still underway (my F7 machine is dead), but just a note to say that I could also try the patch in bug 230634. |