Bug 1277781

Summary:	Libvirtd segment fault when create and destroy a fc_host pool with a short pause
Product:	Red Hat Enterprise Linux 7	Reporter:	Han Han <hhan>
Component:	libvirt	Assignee:	John Ferlan <jferlan>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.2	CC:	dyuan, jferlan, rbalakri, xuzhang, yanyang, yisun
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-1.3.1-1.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-03 18:29:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Han Han 2015-11-04 03:59:31 UTC

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libvirt-1.2.17-13.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a fc_host pool xml:
<pool type="scsi">
<name>p1</name>
<source>
<adapter type='fc_host' wwnn='2101001b32a9da4e' wwpn='2101001b32a90001' managed='yes'/>
</source>
<target>
<path>/dev/disk/by-path</path>
</target>
</pool>

2. Run following cmd:
# for i in 1 2 3; do virsh pool-create test.xml; sleep 1; virsh pool-destroy p1; sleep 1; done
Pool p1 created from test.xml

Pool p1 destroyed

Pool p1 created from test.xml

error: failed to connect to the hypervisor
error: no valid connection
error: Cannot recv data: Connection reset by peer

error: Failed to create pool from test.xml
error: operation failed: pool 'p1' already exists with uuid ec55efe5-2f6e-49c5-98c2-330d4b0b4e8f

Pool p1 destroyed

3. Show abrt reports:
# abrt-cli list --since 1446450941
id 38617217512ee894c089bd641f937e10711b0815
reason:         libvirtd killed by SIGSEGV
time:           Mon 02 Nov 2015 04:12:42 PM CST
cmdline:        /usr/sbin/libvirtd
package:        libvirt-daemon-1.2.17-13.el7
uid:            0 (root)
Directory:      /var/spool/abrt/ccpp-2015-11-02-16:12:42-28812
Run 'abrt-cli report /var/spool/abrt/ccpp-2015-11-02-16:12:42-28812' for creating a case in Red Hat Customer Portal

id f1894e50fc0e46d0bd8fe3109b4769e5236b9a2c
reason:         libvirtd killed by SIGSEGV
time:           Mon 02 Nov 2015 03:39:26 PM CST
cmdline:        /usr/sbin/libvirtd
package:        libvirt-daemon-1.2.17-13.el7
uid:            0 (root)
count:          6
Directory:      /var/spool/abrt/ccpp-2015-11-02-15:39:26-1494
Run 'abrt-cli report /var/spool/abrt/ccpp-2015-11-02-15:39:26-1494' for creating a case in Red Hat Customer Portal

id 9155fb4aee1b0001caf3e3c6c97837fc68eb6a80
reason:         libvirtd killed by SIGSEGV
time:           Mon 02 Nov 2015 03:40:57 PM CST
cmdline:        /usr/sbin/libvirtd
package:        libvirt-daemon-1.2.17-13.el7
uid:            0 (root)
count:          3
Directory:      /var/spool/abrt/ccpp-2015-11-02-15:40:57-3654
Run 'abrt-cli report /var/spool/abrt/ccpp-2015-11-02-15:40:57-3654' for creating a case in Red Hat Customer Portal

Actual results:
As above.

Expected results:
No segment fault.

Additional info:
Without sleep or sleep more than 2s, bug not reproduced.
The gdb backtrace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f81202af700 (LWP 18853)]
0x00007f812c9d5aad in malloc_consolidate () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f812c9d5aad in malloc_consolidate () from /lib64/libc.so.6
#1  0x00007f812c9d7e35 in _int_malloc () from /lib64/libc.so.6
#2  0x00007f812c9d987c in malloc () from /lib64/libc.so.6
#3  0x00007f812ca13381 in __alloc_dir () from /lib64/libc.so.6
#4  0x00007f812f6c0cda in virGetFCHostNameByWWN (sysfs_prefix=sysfs_prefix@entry=0x0, wwnn=wwnn@entry=0x7f8110006f30 "2101001b32a9da4e", 
    wwpn=wwpn@entry=0x7f8110001260 "2101001b32a90005") at util/virutil.c:2113
#5  0x00007f81160fd353 in deleteVport (conn=<optimized out>, adapter=...) at storage/storage_backend_scsi.c:841
#6  virStorageBackendSCSIStopPool (conn=0x7f81080039d0, pool=<optimized out>) at storage/storage_backend_scsi.c:959
#7  0x00007f81160ed487 in storagePoolDestroy (obj=0x7f81180d6270) at storage/storage_driver.c:976
#8  0x00007f812f779738 in virStoragePoolDestroy (pool=pool@entry=0x7f81180d6270) at libvirt-storage.c:736
#9  0x00007f81303a29cc in remoteDispatchStoragePoolDestroy (server=0x7f8132121f30, msg=0x7f81321b6f70, args=<optimized out>, rerr=0x7f81202aec30, client=0x7f81321b7f00)
    at remote_dispatch.h:14607
#10 remoteDispatchStoragePoolDestroyHelper (server=0x7f8132121f30, client=0x7f81321b7f00, msg=0x7f81321b6f70, rerr=0x7f81202aec30, args=<optimized out>, 
    ret=0x7f81180d5e10) at remote_dispatch.h:14583
#11 0x00007f812f7c2342 in virNetServerProgramDispatchCall (msg=0x7f81321b6f70, client=0x7f81321b7f00, server=0x7f8132121f30, prog=0x7f813213e180)
    at rpc/virnetserverprogram.c:437
#12 virNetServerProgramDispatch (prog=0x7f813213e180, server=server@entry=0x7f8132121f30, client=0x7f81321b7f00, msg=0x7f81321b6f70) at rpc/virnetserverprogram.c:307
#13 0x00007f812f7bd5bd in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7f8132121f30) at rpc/virnetserver.c:135
#14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7f8132121f30) at rpc/virnetserver.c:156
#15 0x00007f812f6b84c5 in virThreadPoolWorker (opaque=opaque@entry=0x7f8132121a70) at util/virthreadpool.c:145
#16 0x00007f812f6b79e8 in virThreadHelper (data=<optimized out>) at util/virthread.c:206
#17 0x00007f812cd22dc5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f812ca501cd in clone () from /lib64/libc.so.6

Comment 3 John Ferlan 2015-11-04 11:27:46 UTC

While you're correct it's not a normal scenario, it does point out a
flaw that could happen at other times. As you note if you wait 3 seconds
the bug doesn't happen and if you wait 5 seconds there'd be even lesser
chance (if at all).

Long story short, FC/NPIV/SCSI is dependent upon udev to create the
infrastructure necessary.  That occurs asynchronously, so rather than
"wait" for that to finish we create a thread to handle that which runs
once a second over the next 5 seconds until done. See the following:

https://bugzilla.redhat.com/show_bug.cgi?id=1152382

and upstream commit message:

http://www.redhat.com/archives/libvir-list/2014-November/msg00695.html

I do have a couple patches ready to post - I just wanted to test that
they worked prior to sending them. I didn't want to interrupt anything
you were doing though.

Comment 6 John Ferlan 2015-11-04 22:05:44 UTC

Patches posted upstream:

http://www.redhat.com/archives/libvir-list/2015-November/msg00139.html

Comment 7 John Ferlan 2015-11-13 14:39:19 UTC

Patches pushed.

$ git describe d3fa510a759b180a2a87b11d9ed57e437d1914e1
v1.2.21-62-gd3fa510
$

Comment 9 yisun 2016-03-16 04:36:20 UTC

Verified on libvirt-1.3.2-1.el7.x86_64 and PASSED

scenario 1, using pool name.

# cat pool.xml
<pool type="scsi">
<name>p1</name>
<source>
<adapter type='fc_host' wwnn='2101001b32a9da4e' wwpn='2101001b32a90001' managed='yes'/>
</source>
<target>
<path>/dev/disk/by-path</path>
</target>
</pool>

============ do not sleep ===============
# date +%s
1458101818

# for i in {1..100}; do virsh pool-create pool.xml; virsh pool-list | grep p1;  virsh pool-destroy p1; virsh pool-list --all | grep p1; done
Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

...

# abrt-cli list --since 1458101818
//nothing output



============= sleep in a short time (0.1 sec) =================
# date +%s
1458101534

# for i in {1..100}; do virsh pool-create pool.xml; virsh pool-list | grep p1; sleep 0.1; virsh pool-destroy p1; virsh pool-list --all | grep p1; done
Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed
...


# abrt-cli list --since 1458101534
//nothing output


===================================
sleep in a longer time (3 sec)

# date +%s
1458101908

# for i in {1..100}; do virsh pool-create pool.xml; virsh pool-list | grep p1; sleep 3; virsh pool-destroy p1; virsh pool-list --all | grep p1; done
Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

Pool p1 created from pool.xml

 p1                   active     no        
Pool p1 destroyed

...


# abrt-cli list --since 1458101908
// nothing output



=======================

scenario 2, using pool uuid (just test no-sleep case is enough)

# cat pool_uuid.xml 
<pool type='scsi'>
  <name>p1</name>
  <uuid>823de2fd-2e24-4eea-a1ca-888888888888</uuid>
  <source>
    <adapter type='fc_host' managed='yes' wwnn='2101001b32a9da4e' wwpn='2101001b32a90001'/>
  </source>
  <target>
    <path>/dev/disk/by-path</path>
  </target>
</pool>


# date +%s
1458102689

# for i in {1..100}; do virsh pool-create pool_uuid.xml; virsh pool-list | grep p1;  virsh pool-destroy 823de2fd-2e24-4eea-a1ca-888888888888; virsh pool-list --all | grep p1; done
Pool p1 created from pool_uuid.xml

 p1                   active     no        
Pool 823de2fd-2e24-4eea-a1ca-888888888888 destroyed

Pool p1 created from pool_uuid.xml

 p1                   active     no        
Pool 823de2fd-2e24-4eea-a1ca-888888888888 destroyed

Pool p1 created from pool_uuid.xml

 p1                   active     no        
Pool 823de2fd-2e24-4eea-a1ca-888888888888 destroyed

Pool p1 created from pool_uuid.xml

 p1                   active     no        
Pool 823de2fd-2e24-4eea-a1ca-888888888888 destroyed



# abrt-cli list --since 1458102689
//nothing output

Comment 11 errata-xmlrpc 2016-11-03 18:29:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2577.html