Bug 624895
Summary: | libvirt-0.8.1-26.el6 will die when you conduct some basic operation(such as connect localhost QEMU) by virt-manager. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | xhu |
Component: | libvirt | Assignee: | Daniel Veillard <veillard> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 6.0 | CC: | berrange, borgan, eblake, eli, llim, syeghiay, xen-maint, yoyzhang |
Target Milestone: | rc | Keywords: | TestBlocker |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | libvirt-0.8.1-27.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-11-10 21:05:42 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
xhu
2010-08-18 02:47:54 UTC
I could reproduce this easilly ... one need to run virt-manager to talk to libvirtd, at connection virt-manager will go though a number of libvirt calls to setup the environemnt, a number suceeds but an apparently trivial one leads to a crash NumOfDevices: 10:05:34.775: warning : lxcStartup:1884 : Unable to create cgroup for driver: No such device or address Detaching after fork from child process 26722. Detaching after fork from child process 26723. Detaching after fork from child process 26724. Detaching after fork from child process 26725. 10:05:40.501: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '46' as XDR 10:05:40.506: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 10:05:40.510: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 10:05:40.511: error : virStoragePoolObjIsDuplicate:1689 : operation failed: pool 'default' already exists with uuid 5a4b94b3-c55a-ba32-72be-0679d919d2e3 libvir: Storage error : operation failed: pool 'default' already exists with uuid 5a4b94b3-c55a-ba32-72be-0679d919d2e3 10:05:40.513: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 10:05:40.514: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 10:05:40.524: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR 10:05:40.527: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR [Switching to Thread 0x7fffeffff710 (LWP 26671)] Breakpoint 3, remoteDispatchNodeNumOfDevices (server=0x707640, client=0x7ffff0001270, conn=0x7fffe0000960, hdr=0x7ffff0081430, rerr=0x7fffefffeb80, args=0x7fffefffec10, ret=0x7fffefffec10) at remote.c:4853 4853 { (gdb) n 4854 CHECK_CONN(client); (gdb) 4857 args->cap ? *args->cap : NULL, (gdb) 4856 ret->num = virNodeNumOfDevices (conn, (gdb) 4859 if (ret->num == -1) { (gdb) 4856 ret->num = virNodeNumOfDevices (conn, (gdb) 4859 if (ret->num == -1) { (gdb) 4865 } (gdb) p ret->num $20 = 69 (gdb) p ret $21 = (remote_node_num_of_devices_ret *) 0x7fffefffec10 (gdb) p *ret $22 = {num = 69} (gdb) n remoteDispatchClientCall (server=0x707640, client=0x7ffff0001270, msg=0x7ffff0041420) at dispatch.c:510 510 virMutexLock(&server->lock); 1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>} (gdb) n 511 virMutexLock(&client->lock); 1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>} (gdb) n 512 virMutexUnlock(&server->lock); 1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>} (gdb) n 514 xdr_free (data->args_filter, (char*)&args); 1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>} (gdb) p *data $23 = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>} (gdb) p *data->args_filter $24 = {bool_t (XDR *, void *, ...)} 0x42bd30 <xdr_remote_node_num_of_devices_args> (gdb) n Program received signal SIGSEGV, Segmentation fault. 0x00000034b4910933 in xdr_string_internal () from /lib64/libc.so.6 (gdb) up #1 0x000000000042c0be in xdr_remote_nonnull_string ( xdrs=<value optimized out>, objp=<value optimized out>) at ../src/remote/remote_protocol.c:24 24 if (!xdr_string (xdrs, objp, REMOTE_STRING_MAX)) (gdb) #2 0x00000034b4911fc0 in xdr_reference_internal () from /lib64/libc.so.6 (gdb) #3 0x00000034b4911f81 in xdr_pointer () from /lib64/libc.so.6 (gdb) #4 0x000000000042bc93 in xdr_remote_string (xdrs=<value optimized out>, objp=<value optimized out>) at ../src/remote/remote_protocol.c:33 33 if (!xdr_pointer (xdrs, (char **)objp, sizeof (remote_nonnull_string), (xdrproc_t) xdr_remote_nonnull_string)) (gdb) #5 0x000000000042bd49 in xdr_remote_node_num_of_devices_args ( xdrs=0x7fffefffeb10, objp=0x7fffefffec10) at ../src/remote/remote_protocol.c:2554 2554 if (!xdr_remote_string (xdrs, &objp->cap)) (gdb) #6 0x00000034b4910185 in xdr_free () from /lib64/libc.so.6 (gdb) up #7 0x000000000042944f in remoteDispatchClientCall (server=0x707640, client=0x7ffff0001270, msg=0x7ffff0041420) at dispatch.c:514 514 xdr_free (data->args_filter, (char*)&args); (gdb) seems a priori the patch leads to this even if the xdr_free() in remoteDispatchClientCall is not the one changed by the patch as far as I can tell: libvirt-0.8.1-improve-error-messages-when-RPC-reply-cannot-be-sent.patch There is also a number of case where libvirtd emits the new "failed to Serialize" message. I'm continuing investigations Daniel Valgrind output on a rebuild without the last 6 patches ... [root@test ~]# valgrind /usr/sbin/libvirtd ==7496== Memcheck, a memory error detector ==7496== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==7496== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==7496== Command: /usr/sbin/libvirtd ==7496== ==7496== Conditional jump or move depends on uninitialised value(s) ==7496== at 0x434518: virUUIDParse (in /usr/sbin/libvirtd) ==7496== by 0x434935: virSetHostUUIDStr (in /usr/sbin/libvirtd) ==7496== by 0x41C979: ??? (in /usr/sbin/libvirtd) ==7496== by 0x3B9C81EC5C: (below main) (in /lib64/libc-2.12.so) ==7496== 19:09:52.431: error : virRunWithHook:933 : internal error '/sbin/iptables --table filter --delete INPUT --in-interface virbr0 --protocol udp --destination-port 69 --jump ACCEPT' exited with non-zero status 1 and signal 0: iptables: Bad rule (does a matching rule exist in that chain?). 19:10:21.050: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '46' as XDR 19:10:21.112: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 19:10:21.132: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 19:10:21.155: error : virStoragePoolObjIsDuplicate:1689 : operation failed: pool 'default' already exists with uuid 109ec413-e3dc-7da8-1494-aa8be9efd0b9 libvir: Storage error : operation failed: pool 'default' already exists with uuid 109ec413-e3dc-7da8-1494-aa8be9efd0b9 19:10:21.198: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 19:10:21.202: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR 19:10:21.544: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR 19:10:21.578: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR 19:10:21.614: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR ==7496== Thread 3: ==7496== Invalid read of size 8 ==7496== at 0x3B9C910933: xdr_string (in /lib64/libc-2.12.so) ==7496== by 0x42C0BD: ??? (in /usr/sbin/libvirtd) ==7496== by 0x3B9C911FBF: xdr_reference (in /lib64/libc-2.12.so) ==7496== by 0x3B9C911F80: xdr_pointer (in /lib64/libc-2.12.so) ==7496== by 0x42BC92: ??? (in /usr/sbin/libvirtd) ==7496== by 0x42BD48: ??? (in /usr/sbin/libvirtd) ==7496== by 0x3B9C910184: xdr_free (in /lib64/libc-2.12.so) ==7496== by 0x42944E: ??? (in /usr/sbin/libvirtd) ==7496== by 0x429882: ??? (in /usr/sbin/libvirtd) ==7496== by 0x418FB7: ??? (in /usr/sbin/libvirtd) ==7496== by 0x3B9D0077E0: start_thread (in /lib64/libpthread-2.12.so) ==7496== by 0x3B9C8E151C: clone (in /lib64/libc-2.12.so) ==7496== Address 0x38 is not stack'd, malloc'd or (recently) free'd ==7496== ==7496== ==7496== Process terminating with default action of signal 11 (SIGSEGV) ==7496== Access not within mapped region at address 0x38 ==7496== at 0x3B9C910933: xdr_string (in /lib64/libc-2.12.so) ==7496== by 0x42C0BD: ??? (in /usr/sbin/libvirtd) ==7496== by 0x3B9C911FBF: xdr_reference (in /lib64/libc-2.12.so) ==7496== by 0x3B9C911F80: xdr_pointer (in /lib64/libc-2.12.so) ==7496== by 0x42BC92: ??? (in /usr/sbin/libvirtd) ==7496== by 0x42BD48: ??? (in /usr/sbin/libvirtd) ==7496== by 0x3B9C910184: xdr_free (in /lib64/libc-2.12.so) ==7496== by 0x42944E: ??? (in /usr/sbin/libvirtd) ==7496== by 0x429882: ??? (in /usr/sbin/libvirtd) ==7496== by 0x418FB7: ??? (in /usr/sbin/libvirtd) ==7496== by 0x3B9D0077E0: start_thread (in /lib64/libpthread-2.12.so) ==7496== by 0x3B9C8E151C: clone (in /lib64/libc-2.12.so) ==7496== If you believe this happened as a result of a stack ==7496== overflow in your program's main thread (unlikely but ==7496== possible), you can try to increase the size of the ==7496== main thread stack using the --main-stacksize= flag. ==7496== The main thread stack size used in this run was 10485760. ==7496== ==7496== HEAP SUMMARY: ==7496== in use at exit: 2,398,899 bytes in 18,054 blocks ==7496== total heap usage: 615,290 allocs, 597,236 frees, 410,306,343 bytes allocated ==7496== ==7496== LEAK SUMMARY: ==7496== definitely lost: 111 bytes in 13 blocks ==7496== indirectly lost: 50 bytes in 3 blocks ==7496== possibly lost: 526,791 bytes in 5,833 blocks ==7496== still reachable: 1,871,947 bytes in 12,205 blocks ==7496== suppressed: 0 bytes in 0 blocks ==7496== Rerun with --leak-check=full to see details of leaked memory ==7496== ==7496== For counts of detected and suppressed errors, rerun with: -v ==7496== Use --track-origins=yes to see where uninitialised values come from ==7496== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 29 from 9) Killed [root@test ~]# Actually comment #5 was a valgrind of the brew -26.el6 official build. valgrinding a locally rebuild version does not show the problem [root@test bugged]# !valgrind valgrind /usr/sbin/libvirtd ==24931== Memcheck, a memory error detector ==24931== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==24931== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==24931== Command: /usr/sbin/libvirtd ==24931== ==24931== Conditional jump or move depends on uninitialised value(s) ==24931== at 0x434528: virUUIDParse (uuid.c:181) ==24931== by 0x434945: virSetHostUUIDStr (uuid.c:290) ==24931== by 0x41C979: main (libvirtd.c:2852) ==24931== 20:10:35.732: error : virRunWithHook:933 : internal error '/sbin/iptables --table filter --delete INPUT --in-interface virbr0 --protocol udp --destination-port 69 --jump ACCEPT' exited with non-zero status 1 and signal 0: iptables: Bad rule (does a matching rule exist in that chain?). 20:11:05.583: error : qemudDomainGetVcpus:6325 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain libvir: QEMU error : Requested operation is not valid: cannot list vcpu pinning for an inactive domain 20:11:05.699: error : qemudDomainGetVcpus:6325 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain libvir: QEMU error : Requested operation is not valid: cannot list vcpu pinning for an inactive domain 20:11:05.708: error : qemudDomainGetVcpus:6325 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain libvir: QEMU error : Requested operation is not valid: cannot list vcpu pinning for an inactive domain ^C20:11:39.144: warning : qemudDispatchSignalEvent:392 : Shutting down on signal 2 ==24931== ==24931== HEAP SUMMARY: ==24931== in use at exit: 396,472 bytes in 5,824 blocks ==24931== total heap usage: 827,958 allocs, 822,134 frees, 634,198,965 bytes allocated ==24931== ==24931== LEAK SUMMARY: ==24931== definitely lost: 361 bytes in 18 blocks ==24931== indirectly lost: 50 bytes in 3 blocks ==24931== possibly lost: 349 bytes in 18 blocks ==24931== still reachable: 395,712 bytes in 5,785 blocks ==24931== suppressed: 0 bytes in 0 blocks ==24931== Rerun with --leak-check=full to see details of leaked memory ==24931== ==24931== For counts of detected and suppressed errors, rerun with: -v ==24931== Use --track-origins=yes to see where uninitialised values come from ==24931== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 29 from 9) [root@test bugged]# rpm -qif /usr/sbin/libvirtd Name : libvirt Relocations: (not relocatable) Version : 0.8.1 Vendor: (none) Release : 26.el6 Build Date: Wed 18 Aug 2010 08:08:42 PM CEST Install Date: Wed 18 Aug 2010 08:09:31 PM CEST Build Host: test Group : Development/Libraries Source RPM: libvirt-0.8.1-26.el6.src.rpm this really seems to be the brew build only exhibiting the behaviour, Daniel After some debugging and Dan Berrange findings, it seems likely to be due to a broken parallel make, where remote_dispatch_args.h isn't finished generating when remote.c is being compiled, leading to an incomplete structure being allocated at run-time and severe issues following. As an immediate fix I removed the %{?_smp_mflags} from the spec file and regenerated libvirt-0.8.1-27.el6 This should be a safe temporary solution to the problem, the real fix being correcting the Makefiles, will be done upstream and for 6.1 Daniel The problem function was remoteDispatchClientCall. This declares two variables dispatch_args args; dispatch_ret ret; GDB shows that either of these occasionally was 0 length in certain builds (gdb) print sizeof(args) $1 = 80 (gdb) print sizeof(ret) $2 = 0 This appears to be because remote_dispatch_{args,ret}.h were empty. This in turn was because of a parallel makefile dependency bug. *** Bug 602160 has been marked as a duplicate of this bug. *** I have verified it with RHEL6.0-20100818.0 including libvirt-0.8.1-27.el6, kernel-2.6.32-66.el6 and qemu-kvm-0.12.1.2-2.112.el6 for x86_64 and i386. for x86_64: The test steps are as follows: 1. Start libvirtd service by issuing the below command: # service libvirtd start Starting libvirtd daemon: [ OK ] 2. Start virt-manager and connect to localhost(QEMU) and start guest successfully for i386: The test steps are as follows: 1. Start libvirtd service by issuing the below command: # service libvirtd start Starting libvirtd daemon: [ OK ] 2. execute `virsh nodeinfo` and 'virsh list --all' for 5 times and the libvirt is ok Red Hat Enterprise Linux 6.0 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. |