Bug 624895

Summary: libvirt-0.8.1-26.el6 will die when you conduct some basic operation(such as connect localhost QEMU) by virt-manager.
Product: Red Hat Enterprise Linux 6 Reporter: xhu
Component: libvirtAssignee: Daniel Veillard <veillard>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.0CC: berrange, borgan, eblake, eli, llim, syeghiay, xen-maint, yoyzhang
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.8.1-27.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-10 21:05:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description xhu 2010-08-18 02:47:54 UTC
Description of problem:
libvirt-0.8.1-26.el6 will die when you conduct some basic operation(such as connect localhost QEMU) by virt-manager.

Version-Release number of selected component (if applicable):
kernel-2.6.32-63.el6.x86_64
libvirt-0.8.1-25.el6.x86_64
qemu-kvm-0.12.1.2-2.111.el6.x86_64
virt-manager-0.8.4-8.el6.noarch

How reproducible:
everytime

Steps to Reproduce:
1. Start libvirtd service by issuing the below command:
# service libvirtd start
Starting libvirtd daemon:                                  [  OK  ]
2. Start virt-manager and connect to localhost(QEMU)
  
Actual results:
After step 2, libvirtd died. And when start the libvirtd service again, error occus as follows:
# service libvirtd start
Starting libvirtd daemon: error: Unable to obtain pidfile. Check /var/log/messages or run without --daemon for more info.
                                                           [FAILED]


Expected results:
After step 2, virt-manager should connect to localhost(QEMU) successfully with no error.

Additional info:

Comment 4 Daniel Veillard 2010-08-18 08:33:02 UTC
I could reproduce this easilly ... one need to run virt-manager to talk to libvirtd, at connection virt-manager will go though a number of libvirt
calls to setup the environemnt, a number suceeds but an apparently trivial
one leads to a crash NumOfDevices:

10:05:34.775: warning : lxcStartup:1884 : Unable to create cgroup for driver: No such device or address
Detaching after fork from child process 26722.
Detaching after fork from child process 26723.
Detaching after fork from child process 26724.
Detaching after fork from child process 26725.
10:05:40.501: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '46' as XDR
10:05:40.506: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
10:05:40.510: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
10:05:40.511: error : virStoragePoolObjIsDuplicate:1689 : operation failed: pool 'default' already exists with uuid 5a4b94b3-c55a-ba32-72be-0679d919d2e3
libvir: Storage error : operation failed: pool 'default' already exists with uuid 5a4b94b3-c55a-ba32-72be-0679d919d2e3
10:05:40.513: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
10:05:40.514: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
10:05:40.524: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR
10:05:40.527: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR
[Switching to Thread 0x7fffeffff710 (LWP 26671)]

Breakpoint 3, remoteDispatchNodeNumOfDevices (server=0x707640, 
    client=0x7ffff0001270, conn=0x7fffe0000960, hdr=0x7ffff0081430, 
    rerr=0x7fffefffeb80, args=0x7fffefffec10, ret=0x7fffefffec10)
    at remote.c:4853
4853	{
(gdb) n
4854	    CHECK_CONN(client);
(gdb) 
4857	                                    args->cap ? *args->cap : NULL,
(gdb) 
4856	    ret->num = virNodeNumOfDevices (conn,
(gdb) 
4859	    if (ret->num == -1) {
(gdb) 
4856	    ret->num = virNodeNumOfDevices (conn,
(gdb) 
4859	    if (ret->num == -1) {
(gdb) 
4865	}
(gdb) p ret->num
$20 = 69
(gdb) p ret
$21 = (remote_node_num_of_devices_ret *) 0x7fffefffec10
(gdb) p *ret
$22 = {num = 69}
(gdb) n
remoteDispatchClientCall (server=0x707640, client=0x7ffff0001270, 
    msg=0x7ffff0041420) at dispatch.c:510
510	    virMutexLock(&server->lock);
1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, 
  args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, 
  ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>}
(gdb) n
511	    virMutexLock(&client->lock);
1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, 
  args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, 
  ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>}
(gdb) n
512	    virMutexUnlock(&server->lock);
1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, 
  args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, 
  ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>}
(gdb) n
514	    xdr_free (data->args_filter, (char*)&args);
1: *data = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, 
  args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, 
  ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>}
(gdb) p *data
$23 = {fn = 0x421830 <remoteDispatchNodeNumOfDevices>, 
  args_filter = 0x42bd30 <xdr_remote_node_num_of_devices_args>, 
  ret_filter = 0x42a8c0 <xdr_remote_node_num_of_devices_ret>}
(gdb) p *data->args_filter
$24 = {bool_t (XDR *, void *, 
    ...)} 0x42bd30 <xdr_remote_node_num_of_devices_args>
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x00000034b4910933 in xdr_string_internal () from /lib64/libc.so.6
(gdb) up
#1  0x000000000042c0be in xdr_remote_nonnull_string (
    xdrs=<value optimized out>, objp=<value optimized out>)
    at ../src/remote/remote_protocol.c:24
24	         if (!xdr_string (xdrs, objp, REMOTE_STRING_MAX))
(gdb) 
#2  0x00000034b4911fc0 in xdr_reference_internal () from /lib64/libc.so.6
(gdb) 
#3  0x00000034b4911f81 in xdr_pointer () from /lib64/libc.so.6
(gdb) 
#4  0x000000000042bc93 in xdr_remote_string (xdrs=<value optimized out>, 
    objp=<value optimized out>) at ../src/remote/remote_protocol.c:33
33	         if (!xdr_pointer (xdrs, (char **)objp, sizeof (remote_nonnull_string), (xdrproc_t) xdr_remote_nonnull_string))
(gdb) 
#5  0x000000000042bd49 in xdr_remote_node_num_of_devices_args (
    xdrs=0x7fffefffeb10, objp=0x7fffefffec10)
    at ../src/remote/remote_protocol.c:2554
2554	         if (!xdr_remote_string (xdrs, &objp->cap))
(gdb) 
#6  0x00000034b4910185 in xdr_free () from /lib64/libc.so.6
(gdb) up
#7  0x000000000042944f in remoteDispatchClientCall (server=0x707640, 
    client=0x7ffff0001270, msg=0x7ffff0041420) at dispatch.c:514
514	    xdr_free (data->args_filter, (char*)&args);
(gdb) 

seems a priori the patch leads to this even if the xdr_free() in
remoteDispatchClientCall is not the one changed by the patch as far as
I can tell:
libvirt-0.8.1-improve-error-messages-when-RPC-reply-cannot-be-sent.patch

There is also a number of case where libvirtd emits the new "failed to
Serialize" message.

I'm continuing investigations

Daniel

Comment 5 Daniel Veillard 2010-08-18 09:21:52 UTC
Valgrind output on a rebuild without the last 6 patches ... 

[root@test ~]# valgrind /usr/sbin/libvirtd 
==7496== Memcheck, a memory error detector
==7496== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==7496== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==7496== Command: /usr/sbin/libvirtd
==7496== 
==7496== Conditional jump or move depends on uninitialised value(s)
==7496==    at 0x434518: virUUIDParse (in /usr/sbin/libvirtd)
==7496==    by 0x434935: virSetHostUUIDStr (in /usr/sbin/libvirtd)
==7496==    by 0x41C979: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x3B9C81EC5C: (below main) (in /lib64/libc-2.12.so)
==7496== 
19:09:52.431: error : virRunWithHook:933 : internal error '/sbin/iptables --table filter --delete INPUT --in-interface virbr0 --protocol udp --destination-port 69 --jump ACCEPT' exited with non-zero status 1 and signal 0: iptables: Bad rule (does a matching rule exist in that chain?).

19:10:21.050: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '46' as XDR
19:10:21.112: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
19:10:21.132: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
19:10:21.155: error : virStoragePoolObjIsDuplicate:1689 : operation failed: pool 'default' already exists with uuid 109ec413-e3dc-7da8-1494-aa8be9efd0b9
libvir: Storage error : operation failed: pool 'default' already exists with uuid 109ec413-e3dc-7da8-1494-aa8be9efd0b9
19:10:21.198: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
19:10:21.202: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '84' as XDR
19:10:21.544: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR
19:10:21.578: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR
19:10:21.614: warning : remoteDispatchClientCall:588 : Failed to serialize reply for program '536903814' proc '128' as XDR
==7496== Thread 3:
==7496== Invalid read of size 8
==7496==    at 0x3B9C910933: xdr_string (in /lib64/libc-2.12.so)
==7496==    by 0x42C0BD: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x3B9C911FBF: xdr_reference (in /lib64/libc-2.12.so)
==7496==    by 0x3B9C911F80: xdr_pointer (in /lib64/libc-2.12.so)
==7496==    by 0x42BC92: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x42BD48: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x3B9C910184: xdr_free (in /lib64/libc-2.12.so)
==7496==    by 0x42944E: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x429882: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x418FB7: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x3B9D0077E0: start_thread (in /lib64/libpthread-2.12.so)
==7496==    by 0x3B9C8E151C: clone (in /lib64/libc-2.12.so)
==7496==  Address 0x38 is not stack'd, malloc'd or (recently) free'd
==7496== 
==7496== 
==7496== Process terminating with default action of signal 11 (SIGSEGV)
==7496==  Access not within mapped region at address 0x38
==7496==    at 0x3B9C910933: xdr_string (in /lib64/libc-2.12.so)
==7496==    by 0x42C0BD: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x3B9C911FBF: xdr_reference (in /lib64/libc-2.12.so)
==7496==    by 0x3B9C911F80: xdr_pointer (in /lib64/libc-2.12.so)
==7496==    by 0x42BC92: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x42BD48: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x3B9C910184: xdr_free (in /lib64/libc-2.12.so)
==7496==    by 0x42944E: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x429882: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x418FB7: ??? (in /usr/sbin/libvirtd)
==7496==    by 0x3B9D0077E0: start_thread (in /lib64/libpthread-2.12.so)
==7496==    by 0x3B9C8E151C: clone (in /lib64/libc-2.12.so)
==7496==  If you believe this happened as a result of a stack
==7496==  overflow in your program's main thread (unlikely but
==7496==  possible), you can try to increase the size of the
==7496==  main thread stack using the --main-stacksize= flag.
==7496==  The main thread stack size used in this run was 10485760.
==7496== 
==7496== HEAP SUMMARY:
==7496==     in use at exit: 2,398,899 bytes in 18,054 blocks
==7496==   total heap usage: 615,290 allocs, 597,236 frees, 410,306,343 bytes allocated
==7496== 
==7496== LEAK SUMMARY:
==7496==    definitely lost: 111 bytes in 13 blocks
==7496==    indirectly lost: 50 bytes in 3 blocks
==7496==      possibly lost: 526,791 bytes in 5,833 blocks
==7496==    still reachable: 1,871,947 bytes in 12,205 blocks
==7496==         suppressed: 0 bytes in 0 blocks
==7496== Rerun with --leak-check=full to see details of leaked memory
==7496== 
==7496== For counts of detected and suppressed errors, rerun with: -v
==7496== Use --track-origins=yes to see where uninitialised values come from
==7496== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 29 from 9)
Killed
[root@test ~]#

Comment 6 Daniel Veillard 2010-08-18 10:14:49 UTC
Actually comment #5 was a valgrind of the brew -26.el6 official build.
valgrinding a locally rebuild version does not show the problem

[root@test bugged]# !valgrind
valgrind /usr/sbin/libvirtd 
==24931== Memcheck, a memory error detector
==24931== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==24931== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==24931== Command: /usr/sbin/libvirtd
==24931== 
==24931== Conditional jump or move depends on uninitialised value(s)
==24931==    at 0x434528: virUUIDParse (uuid.c:181)
==24931==    by 0x434945: virSetHostUUIDStr (uuid.c:290)
==24931==    by 0x41C979: main (libvirtd.c:2852)
==24931== 
20:10:35.732: error : virRunWithHook:933 : internal error '/sbin/iptables --table filter --delete INPUT --in-interface virbr0 --protocol udp --destination-port 69 --jump ACCEPT' exited with non-zero status 1 and signal 0: iptables: Bad rule (does a matching rule exist in that chain?).

20:11:05.583: error : qemudDomainGetVcpus:6325 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain
libvir: QEMU error : Requested operation is not valid: cannot list vcpu pinning for an inactive domain
20:11:05.699: error : qemudDomainGetVcpus:6325 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain
libvir: QEMU error : Requested operation is not valid: cannot list vcpu pinning for an inactive domain
20:11:05.708: error : qemudDomainGetVcpus:6325 : Requested operation is not valid: cannot list vcpu pinning for an inactive domain
libvir: QEMU error : Requested operation is not valid: cannot list vcpu pinning for an inactive domain
^C20:11:39.144: warning : qemudDispatchSignalEvent:392 : Shutting down on signal 2
==24931== 
==24931== HEAP SUMMARY:
==24931==     in use at exit: 396,472 bytes in 5,824 blocks
==24931==   total heap usage: 827,958 allocs, 822,134 frees, 634,198,965 bytes allocated
==24931== 
==24931== LEAK SUMMARY:
==24931==    definitely lost: 361 bytes in 18 blocks
==24931==    indirectly lost: 50 bytes in 3 blocks
==24931==      possibly lost: 349 bytes in 18 blocks
==24931==    still reachable: 395,712 bytes in 5,785 blocks
==24931==         suppressed: 0 bytes in 0 blocks
==24931== Rerun with --leak-check=full to see details of leaked memory
==24931== 
==24931== For counts of detected and suppressed errors, rerun with: -v
==24931== Use --track-origins=yes to see where uninitialised values come from
==24931== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 29 from 9)
[root@test bugged]# rpm -qif /usr/sbin/libvirtd
Name        : libvirt                      Relocations: (not relocatable)
Version     : 0.8.1                             Vendor: (none)
Release     : 26.el6                        Build Date: Wed 18 Aug 2010 08:08:42 PM CEST
Install Date: Wed 18 Aug 2010 08:09:31 PM CEST      Build Host: test
Group       : Development/Libraries         Source RPM: libvirt-0.8.1-26.el6.src.rpm

  this really seems to be the brew build only exhibiting the behaviour,

Daniel

Comment 7 Daniel Veillard 2010-08-18 12:26:47 UTC
After some debugging and Dan Berrange findings, it seems likely to be
due to a broken parallel make, where remote_dispatch_args.h isn't
finished generating when remote.c is being compiled, leading to 
an incomplete structure being allocated at run-time and severe
issues following.
As an immediate fix I removed the %{?_smp_mflags} from the spec file
and regenerated libvirt-0.8.1-27.el6

This should be a safe temporary solution to the problem, the
real fix being correcting the Makefiles, will be done upstream
and for 6.1

Daniel

Comment 9 Daniel Berrangé 2010-08-18 12:52:57 UTC
The problem function was  remoteDispatchClientCall. This declares two variables

    dispatch_args args;
    dispatch_ret ret;

GDB shows that either of these occasionally was 0 length in certain builds

(gdb) print sizeof(args)
$1 = 80
(gdb) print sizeof(ret)
$2 = 0


This appears to be because remote_dispatch_{args,ret}.h were empty. This in turn was because of a parallel makefile dependency bug.

Comment 10 Daniel Berrangé 2010-08-18 12:53:28 UTC
*** Bug 602160 has been marked as a duplicate of this bug. ***

Comment 12 xhu 2010-08-20 03:41:48 UTC
I have verified it with RHEL6.0-20100818.0 including libvirt-0.8.1-27.el6, kernel-2.6.32-66.el6 and qemu-kvm-0.12.1.2-2.112.el6 for x86_64 and i386. 
for x86_64:
The test steps are as follows:
1. Start libvirtd service by issuing the below command:
# service libvirtd start
Starting libvirtd daemon:                                  [  OK  ]
2. Start virt-manager and connect to localhost(QEMU) and start guest successfully
for i386:
The test steps are as follows:
1. Start libvirtd service by issuing the below command:
# service libvirtd start
Starting libvirtd daemon:                                  [  OK  ]
2. execute `virsh nodeinfo` and 'virsh list --all' for 5 times and the libvirt is ok

Comment 13 releng-rhel@redhat.com 2010-11-10 21:05:42 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.