Bug 1039991

Summary: vdsm-tool segfaults during vdsmd start
Product: Red Hat Enterprise Linux 7 Reporter: Jiri Denemark <jdenemar>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 7.0CC: acathrow, asegurap, bazulay, berrange, bili, clalancette, dallan, danken, didi, dyuan, iheim, jdenemar, jforbes, laine, libvirt-maint, mgoldboi, mzhan, sbonazzo, shyu, smizrahi, veillard, ybronhei, yeylon, zhwang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: libvirt-1.1.1-15.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1034312 Environment:
Last Closed: 2014-06-13 13:29:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jiri Denemark 2013-12-10 13:23:44 UTC
+++ This bug was initially created as a clone of Bug #1034312 +++

Description of problem:

During hosted-engine --deploy, '/bin/systemctl start vdsmd.service' fails.

vdsm-4.13.0-196.gitb871832.fc19.x86_64
libvirt-1.0.5.7-2.fc19.x86_64

Reproduced by command line:

 # /usr/libexec/vdsm/vdsmd_init_common.sh --pre-start
 vdsm: Running mkdirs
 vdsm: Running configure_coredump
 vdsm: Running run_init_hooks
 vdsm: Running gencerts
 vdsm: Running check_is_configured
 libvirt is already configured for vdsm
 sanlock service is already configured
 vdsm: Running validate_configuration
 SUCCESS: ssl configured to true. No conflicts
 vdsm: Running prepare_transient_repository
 vdsm: Running syslog_available
 vdsm: Running nwfilter
 /usr/libexec/vdsm/vdsmd_init_common.sh: line 70:  8708 Errore di segmentazione "${VDSM_TOOL}" nwfilter
 vdsm: failed to execute nwfilter, error code 139

dmesg:
[ 2099.578442] python[11612]: segfault at 340b38aa40 ip 000000340b38aa40 sp 00007fb5c0e6bf38 error 15 in libpython2.7.so.1.0[340b375000+3e000]

gdb:

 [Thread debugging using libthread_db enabled]
 Using host libthread_db library "/lib64/libthread_db.so.1".
 Core was generated by `/usr/bin/python /usr/bin/vdsm-tool nwfilter'.
 Program terminated with signal 11, Segmentation fault.
 #0  0x0000000000000007 in ?? ()
 (gdb) py-bt
 #13 Frame 0x7f72680010b0, for file /usr/lib64/python2.7/site-packages/libvirt.py, line 299, in virEventRunDefaultImpl ()
    ret = libvirtmod.virEventRunDefaultImpl()

(gdb) bt
 #0  0x0000000000000007 in ?? ()
 #1  0x000000312480c2ce in _sasl_log (conn=<optimized out>, level=5, fmt=0x7f7267398218 "DIGEST-MD5 client mech dispose") at common.c:1985
 #2  0x00007f7267391d38 in digestmd5_client_mech_dispose (conn_context=0x143c5b0, utils=0x10ddec0) at digestmd5.c:4580
 #3  0x00000031248084c4 in client_dispose (pconn=0x10dcac0) at client.c:332
 #4  0x000000312480b4fb in sasl_dispose (pconn=0xe63338) at common.c:851
 #5  0x0000003411885a3b in virObjectUnref (anyobj=<optimized out>) at util/virobject.c:264
 #6  0x0000003411970908 in virNetSocketDispose (obj=0xe5dc50) at rpc/virnetsocket.c:1026
 #7  0x0000003411885a3b in virObjectUnref (anyobj=<optimized out>) at util/virobject.c:264
 #8  0x000000341186d182 in virEventPollCleanupHandles () at util/vireventpoll.c:582
 #9  0x000000341186e3cf in virEventPollRunOnce () at util/vireventpoll.c:651
 #10 0x000000341186cd0d in virEventRunDefaultImpl () at util/virevent.c:273
 #11 0x00007f726ce6cd06 in libvirt_virEventRunDefaultImpl (self=<optimized out>, args=<optimized out>) at libvirt.c:3076
 #12 0x000000340b0ddcee in call_function (oparg=<optimized out>, pp_stack=0x7f726ca02390) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4098
 #13 PyEval_EvalFrameEx (f=f@entry=Frame 0x7f72680010b0, for file /usr/lib64/python2.7/site-packages/libvirt.py, line 299, in virEventRunDefaultImpl (), throwflag=throwflag@entry=0)
    at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740

(gdb) t 2
 [Switching to thread 2 (Thread 0x7f72771b5740 (LWP 3295))]
 #0  _PyType_Lookup (type=type@entry=0x340b3875c0 <PyInt_Type>, name=name@entry='__class__') at /usr/src/debug/Python-2.7.5/Objects/typeobject.c:2518
2518	            dict = ((PyTypeObject *)base)->tp_dict;
 (gdb) py-bt
 #10 Frame 0x14b5e30, for file /usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py, line 78, in wrapper (args=(), kwargs={}, ret=0)
    if isinstance(ret, libvirt.virDomain):
 #14 Frame 0x13f73e0, for file /usr/lib64/python2.7/site-packages/vdsm/tool/nwfilter.py, line 35, in main (conn=<virConnect(restore=<function at remote 0x140a8c0>, nodeDeviceLookupSCSIHostByWWN=<function at remote 0x140a0c8>, listAllSecrets=<function at remote 0x1406398>, saveImageGetXMLDesc=<function at remote 0x140aa28>, virConnGetLastError=<function at remote 0x1433230>, listDomainsID=<function at remote 0x14066e0>, storagePoolLookupByUUIDString=<function at remote 0x140af50>, getCPUMap=<function at remote 0x13b56e0>, setKeepAlive=<function at remote 0x140ac80>, pingLibvirt=<instancemethod at remote 0xf60a00>, changeBegin=<function at remote 0x1432ed8>, dispatchDomainEventBlockPullCallback=<function at remote 0x13b5320>, _o=None, saveImageDefineXML=<function at remote 0x140a9b0>, listDefinedDomains=<function at remote 0x1406488>, getCapabilities=<function at remote 0x13b57d0>, getFreeMemory=<function at remote 0x13b58c0>, numOfDomains=<function at remote 0x140a398>, networkCreateXML=<function at remote 0x1406c8...(truncated)
    conn.close()
 #19 Frame 0xe62d90, for file /usr/bin/vdsm-tool, line 142, in main (opts=[], args=['nwfilter'], cmd='nwfilter')
    return tool_command[cmd]["command"](*args[1:])
 #22 Frame 0xe3d530, for file /usr/bin/vdsm-tool, line 145, in <module> ()
    sys.exit(main())

 bt
 #0  _PyType_Lookup (type=type@entry=0x340b3875c0 <PyInt_Type>, name=name@entry='__class__') at /usr/src/debug/Python-2.7.5/Objects/typeobject.c:2518
 #1  0x000000340b0847a5 in _PyObject_GenericGetAttrWithDict (obj=0, name='__class__', dict=0x0) at /usr/src/debug/Python-2.7.5/Objects/object.c:1389
 #2  0x000000340b046a2e in recursive_isinstance (inst=0, cls=<type at remote 0x1363850>) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2892
 #3  0x000000340b04af55 in _PyObject_RealIsInstance (inst=<optimized out>, cls=<optimized out>) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:3060
 #4  0x000000340b09caf2 in type___instancecheck__ (type=<optimized out>, inst=<optimized out>) at /usr/src/debug/Python-2.7.5/Objects/typeobject.c:595
 #5  0x000000340b049dd3 in PyObject_Call (func=func@entry=<built-in method __instancecheck__ of type object at remote 0x1363850>, arg=arg@entry=(0,), kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
 #6  0x000000340b04a65f in PyObject_CallFunctionObjArgs (callable=callable@entry=<built-in method __instancecheck__ of type object at remote 0x1363850>)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2760
 #7  0x000000340b04ac5f in PyObject_IsInstance (inst=0, cls=<type at remote 0x1363850>) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2963
 #8  0x000000340b0d50c9 in builtin_isinstance (self=<optimized out>, args=<optimized out>) at /usr/src/debug/Python-2.7.5/Python/bltinmodule.c:2452
 #9  0x000000340b0ddcee in call_function (oparg=<optimized out>, pp_stack=0x7fffaed4b120) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4098
 #10 PyEval_EvalFrameEx (f=f@entry=Frame 0x14b5e30, for file /usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py, line 78, in wrapper (args=(), kwargs={}, ret=0), 
    throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740

--- Additional comment from Jiri Denemark on 2013-12-02 13:13:49 UTC ---

The backtrace looks very similar to the following commit (v1.2.0-rc1-3-g13fdc6d):

commit 13fdc6d63ef64f8e231a087e1dab7d90145c3c10
Author: Christophe Fergeau <cfergeau>
Date:   Fri Nov 22 17:27:21 2013 +0100

    Tie SASL callbacks lifecycle to virNetSessionSASLContext
    
    The array of sasl_callback_t callbacks which is passed to sasl_client_new()
    must be kept alive as long as the created sasl_conn_t object is alive as
    cyrus-sasl uses this structure internally for things like logging, so
    the memory used for callbacks must only be freed after sasl_dispose() has
    been called.
    
    During testing of successful SASL logins with
    virsh -c qemu+tls:///system list --all
    I've been getting invalid read reports from valgrind
    
    ==9237== Invalid read of size 8
    ==9237==    at 0x6E93B6F: _sasl_getcallback (common.c:1745)
    ==9237==    by 0x6E95430: _sasl_log (common.c:1850)
    ==9237==    by 0x16593D87: digestmd5_client_mech_dispose (digestmd5.c:4580)
    ==9237==    by 0x6E91653: client_dispose (client.c:332)
    ==9237==    by 0x6E9476A: sasl_dispose (common.c:851)
    ==9237==    by 0x4E225A1: virNetSASLSessionDispose (virnetsaslcontext.c:678)
    ==9237==    by 0x4CBC551: virObjectUnref (virobject.c:262)
    ==9237==    by 0x4E254D1: virNetSocketDispose (virnetsocket.c:1042)
    ==9237==    by 0x4CBC551: virObjectUnref (virobject.c:262)
    ==9237==    by 0x4E2701C: virNetSocketEventFree (virnetsocket.c:1794)
    ==9237==    by 0x4C965D3: virEventPollCleanupHandles (vireventpoll.c:583)
    ==9237==    by 0x4C96987: virEventPollRunOnce (vireventpoll.c:652)
    ==9237==    by 0x4C94730: virEventRunDefaultImpl (virevent.c:274)
    ==9237==    by 0x12C7BA: vshEventLoop (virsh.c:2407)
    ==9237==    by 0x4CD3D04: virThreadHelper (virthreadpthread.c:161)
    ==9237==    by 0x7DAEF32: start_thread (pthread_create.c:309)
    ==9237==    by 0x8C86EAC: clone (clone.S:111)
    ==9237==  Address 0xe2d61b0 is 0 bytes inside a block of size 168 free'd
    ==9237==    at 0x4A07577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==9237==    by 0x4C73827: virFree (viralloc.c:580)
    ==9237==    by 0x4DE4BC7: remoteAuthSASL (remote_driver.c:4219)
    ==9237==    by 0x4DE33D0: remoteAuthenticate (remote_driver.c:3639)
    ==9237==    by 0x4DDBFAA: doRemoteOpen (remote_driver.c:832)
    ==9237==    by 0x4DDC8DC: remoteConnectOpen (remote_driver.c:1031)
    ==9237==    by 0x4D8595F: do_open (libvirt.c:1239)
    ==9237==    by 0x4D863F3: virConnectOpenAuth (libvirt.c:1481)
    ==9237==    by 0x12762B: vshReconnect (virsh.c:337)
    ==9237==    by 0x12C9B0: vshInit (virsh.c:2470)
    ==9237==    by 0x12E9A5: main (virsh.c:3338)
    
    This commit changes virNetSASLSessionNewClient() to take ownership of the SASL
    callbacks. Then we can free them in virNetSASLSessionDispose() after the corresponding
    sasl_conn_t has been freed.

--- Additional comment from Sandro Bonazzola on 2013-12-02 14:18:13 UTC ---

(In reply to Yaniv Bronhaim from comment #9)
> Dan, this relates to nwfilter which uses libvirt. This looks like the
> segfault happens internally in libvirt code. doesn't it?
> 
> Sandro, reproducible or just happened?

reproducible quite often especially just after the boot.

Comment 3 zhenfeng wang 2013-12-16 11:35:57 UTC
Hi jiri
Since there was a dependency issue for vdsm installing in rhel7, so I installed it omit the dependent paket ,however, after i finished installing the vdsm packet, i didn't find the vdsmd_init_common.sh  script in /usr/libexec/vdsm/ , so i can reproduce this bug as the comment 0 description, also I'm not sure which steps i missed, can you give me some advise? thanks

pkg info
libvirt-1.1.1-15.el7.x86_64
qemu-kvm-rhev-1.5.3-21.el7.x86_64
kernel-3.10.0-61.el7.x86_64
qemu-img-1.5.3-21.el7.x86_64

steps
1. download the following four packet from the brew
# ll
-rw-r--r--. 1 root root 652140 Dec 16 18:56 vdsm-4.10.3-17.el7.x86_64.rpm
-rw-r--r--. 1 root root  58184 Dec 16 18:56 vdsm-cli-4.10.3-17.el7.noarch.rpm
-rw-r--r--. 1 root root  77032 Dec 16 18:56 vdsm-python-4.10.3-17.el7.x86_64.rpm
-rw-r--r--. 1 root root  26924 Dec 16 18:56 vdsm-xmlrpc-4.10.3-17.el7.noarch.rpm

2.install the vdsm with the rpm command(I have install the qemu-img packet and other dependent packet except the qemu-kvm-img) , we can see it depends on the qemu-kvm-img packet
# rpm -ivh vdsm-*
error: Failed dependencies:
	qemu-kvm-img is needed by vdsm-4.10.3-17.el7.x86_64

# rpm -ivh vdsm-* --nodeps --force 
Preparing...                          ################################# [100%]
Updating / installing...
   1:vdsm-python-4.10.3-17.el7        ################################# [ 25%]
   2:vdsm-xmlrpc-4.10.3-17.el7        ################################# [ 50%]
   3:vdsm-4.10.3-17.el7               ################################# [ 75%]
   4:vdsm-cli-4.10.3-17.el7           ################################# [100%]

3.After installing, check the /usr/libexec/vdsm/ , find there wasn't vdsmd_init_common.sh script over there, so i can't reproduce this bug as the comment0 description
# ll /usr/libexec/vdsm/
total 56
drwxr-xr-x. 28 root root  4096 Dec 16 19:08 hooks
-rwxr-xr-x.  1 root root   692 Jun 11  2013 ovirt_functions.sh
-rwxr-xr-x.  1 root root   475 Jun 11  2013 persist-vdsm-hooks
-rwxr-xr-x.  1 root root 24176 Jun 11  2013 safelease
-rwxr-xr-x.  1 root root  6812 Jun 11  2013 spmprotect.sh
-rwxr-xr-x.  1 root root  1954 Jun 11  2013 spmstop.sh
-rwxr-xr-x.  1 root root   696 Jun 11  2013 unpersist-vdsm-hook
-rwxr-xr-x.  1 root root  2424 Jun 11  2013 vdsm-gencerts.sh

Comment 4 zhenfeng wang 2013-12-17 08:45:33 UTC
hi jiri
I just re-check the patch, then do some testing about it ,can you help me have a look aboat my steps, were they enough to reproduce this bug? thanks

pkg info
libvirt-1.1.1-14.el7.x86_64
kernel-3.10.0-60.el7.x86_64
qemu-kvm-rhev-1.5.3-21.el7.x86_64

steps
1.Configure the tls environment
2.Connect to the hypervisor running on host using TLS without SASL via ipv4
after connected, i could get some invalid read reports from valgrind

# valgrind -v --leak-check=full virsh -c qemu+tls://zhwang71/system list --all
Please enter your authentication name: redhat
Please enter your password: 
--15980-- REDIR: 0x80526c0 (__stpcpy_sse2_unaligned) redirected to 0x4a0b260 (stpcpy)
--15980-- REDIR: 0x8053ed0 (__strcat_sse2_unaligned) redirected to 0x4a08fa0 (strcat)
 Id    Name                           State
----------------------------------------------------
 -     rhel                           shut off
 -     rhel6                          shut off
 -     rhel7                          shut off
 -     rhel7com                       shut off
 -     rhel7qcow2                     shut off
 -     rheltest2                      shut off
 -     test                           shut off
 -     win7                           shut off

==15980== Thread 2:
==15980== Invalid read of size 8
==15980==    at 0x6D4F95F: _sasl_getcallback (in /usr/lib64/libsasl2.so.3.0.0)
==15980==    by 0x6D511B0: _sasl_log (in /usr/lib64/libsasl2.so.3.0.0)
==15980==    by 0x14D9CD37: ??? (in /usr/lib64/sasl2/libdigestmd5.so.3.0.0)
==15980==    by 0x6D4D4C3: ??? (in /usr/lib64/libsasl2.so.3.0.0)
==15980==    by 0x6D504FA: sasl_dispose (in /usr/lib64/libsasl2.so.3.0.0)
==15980==    by 0x4CA892A: virObjectUnref (virobject.c:262)
==15980==    by 0x4D99513: virNetSocketDispose (virnetsocket.c:1024)
==15980==    by 0x4CA892A: virObjectUnref (virobject.c:262)
==15980==    by 0x4C8C8B9: virEventPollCleanupHandles (vireventpoll.c:580)
==15980==    by 0x4C8DB27: virEventPollRunOnce (vireventpoll.c:649)
==15980==    by 0x4C8C44C: virEventRunDefaultImpl (virevent.c:273)
==15980==    by 0x129374: vshEventLoop (virsh.c:2290)
==15980==  Address 0xd94c800 is 0 bytes inside a block of size 168 free'd
==15980==    at 0x4A07577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15980==    by 0x4C753D9: virFree (viralloc.c:566)
==15980==    by 0x4D816F0: doRemoteOpen (remote_driver.c:4206)
==15980==    by 0x4D81F75: remoteConnectOpen (remote_driver.c:1027)
==15980==    by 0x4D27FCC: do_open (libvirt.c:1239)
==15980==    by 0x4D2A838: virConnectOpenAuth (libvirt.c:1481)
==15980==    by 0x12BB99: vshReconnect (virsh.c:336)
==15980==    by 0x126DBA: main (virsh.c:2353)
==15980== 
--
==16522== HEAP SUMMARY:
==16522==     in use at exit: 571,462 bytes in 2,303 blocks
==16522==   total heap usage: 18,606 allocs, 16,303 frees, 5,733,185 bytes allocated
==16522== 
==16522== Searching for pointers to 2,303 not-freed blocks
==16522== Checked 1,598,904 bytes
==16522== 
==16522== LEAK SUMMARY:
==16522==    definitely lost: 0 bytes in 0 blocks
==16522==    indirectly lost: 0 bytes in 0 blocks
==16522==      possibly lost: 0 bytes in 0 blocks
==16522==    still reachable: 571,462 bytes in 2,303 blocks
==16522==         suppressed: 0 bytes in 0 blocks

you can reference my attachment for more info

3.After i update the libvirt to the libvirt-1.1.1-15, the invalid read reports from valgrind in step 2 gone.

Comment 5 Jiri Denemark 2013-12-17 10:37:04 UTC
Yes, I believe that's the ideal way of verifying this bug. Good job.

Comment 6 zhenfeng wang 2013-12-17 11:05:06 UTC
hi jiri
thanks for you quick response, however, i found other invalid read report from valgrind while i update the libvirt to the libvirt-1.1.1-15, not sure whether they are harmfull or harmful or harmless, sorry to miss it in the comment4. thanks

# valgrind -v --leak-check=full virsh -c qemu+tls://zhwang71/system list --all
Please enter your authentication name: redhat
Please enter your password: 

 =3545== Invalid read of size 1
==3545==    at 0x8004EBB: vfprintf (in /usr/lib64/libc-2.17.so)
==3545==    by 0x80C9414: __vasprintf_chk (in /usr/lib64/libc-2.17.so)
==3545==    by 0x4CB5075: virVasprintfInternal (stdio2.h:210)
==3545==    by 0x4C9A745: virLogVMessage (virlog.c:842)
==3545==    by 0x4C9AC26: virLogMessage (virlog.c:778)
==3545==    by 0x4D98BC4: virNetSASLSessionClientStep (virnetsaslcontext.c:460)
==3545==    by 0x4D819D5: doRemoteOpen (remote_driver.c:4131)
==3545==    by 0x4D81FE5: remoteConnectOpen (remote_driver.c:1027)
==3545==    by 0x4D2802C: do_open (libvirt.c:1239)
==3545==    by 0x4D2A898: virConnectOpenAuth (libvirt.c:1481)
==3545==    by 0x12BB99: vshReconnect (virsh.c:336)
==3545==    by 0x126DBA: main (virsh.c:2353)
==3545==  Address 0xd9a184e is 0 bytes after a block of size 126 alloc'd
==3545==    at 0x4A081D4: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==3545==    by 0x80E45A4: xdr_array (in /usr/lib64/libc-2.17.so)
==3545==    by 0x4D86251: xdr_remote_auth_sasl_start_ret (remote_protocol.c:3134)
==3545==    by 0x4D9729F: virNetMessageDecodePayload (virnetmessage.c:405)
==3545==    by 0x4D8DC6B: virNetClientProgramCall (virnetclientprogram.c:377)
==3545==    by 0x4D69391: callFull.isra.2 (remote_driver.c:5727)
==3545==    by 0x4D8190A: doRemoteOpen (remote_driver.c:5749)
==3545==    by 0x4D81FE5: remoteConnectOpen (remote_driver.c:1027)
==3545==    by 0x4D2802C: do_open (libvirt.c:1239)
==3545==    by 0x4D2A898: virConnectOpenAuth (libvirt.c:1481)
==3545==    by 0x12BB99: vshReconnect (virsh.c:336)
==3545==    by 0x126DBA: main (virsh.c:2353)

Comment 7 Jiri Denemark 2013-12-17 11:17:03 UTC
Invalid reads are rarely harmless but we don't try to fix all of them in this bug. Please, file a new bug for it.

Comment 8 zhenfeng wang 2013-12-17 11:23:04 UTC
According to comment7, mark this bug verified

Comment 9 Ludek Smid 2014-06-13 13:29:04 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.