Bug 918798

Summary: qemu segmentation fault on hotplug/hotunplug when the vm is paused
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: qemu-kvmAssignee: Markus Armbruster <armbru>
Status: CLOSED DUPLICATE QA Contact: Haim <hateya>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4CC: abaron, acathrow, bsarathy, fsimonce, iheim, jkt, juzhang, kwolf, lpeer, mkenneth, Rhev-m-bugs, sluo, stefanha, virt-maint, yeylon
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-09 10:50:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
logs none

Description Dafna Ron 2013-03-06 21:29:08 UTC
Created attachment 706270 [details]
logs

Description of problem:

I created a vm with 2 disks, detached one, installed OS on the vm and after the vm was installed I moved its disk. 
during the move disk I attached the second disk -> deactivated and activated again and the vm crashed. 
looking at the engine log it seems as if engine is sending destroy to the vm

Version-Release number of selected component (if applicable):

sf9
vdsm-4.10.2-10.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.eblake.2.x86_64

How reproducible:

50%

Steps to Reproduce:
1. create a vm with two disks and detach the second disk
2. install rhel -> after install finishes -> move the disk
3. after live storage migration snapshot is created -> attach the 2ed disk
4. deactivate the second disk -> activate the second disk
  
Actual results:

vm appears to crash but looking at engine log it seems that engine sent a destroy vm which also fails the live storage migration

Expected results:

1. either block attaching a disk to a vm which is live migrating a disk
2. or fix crash

Additional info:logs


2013-03-06 11:00:06,749 INFO  [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (pool-3-thread-41) [491b1d94] Running command: HotPlugDiskToVmCommand internal: false. Entities affected :  ID: 53c25ad5-184e-49fe-b3d9-09b46db4d4df Type: VM


2013-03-06 11:00:21,801 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-93) [2348d228] START, DestroyVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, vmId=53c25ad5-184e-49fe-b3d9-09b46db4d4df, force=false, secondsToWait=0, gracefully=false), log id: 23289ee1
2013-03-06 11:00:21,826 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-93) [2348d228] FINISH, DestroyVDSCommand, log id: 23289ee1
2013-03-06 11:00:21,837 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-93) [2348d228] vm LALA running in db and not running in vds - add to rerun treatment. vds gold-vdsc


013-03-06 11:06:22,741 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (QuartzScheduler_Worker-9) [36703988] EntityAsyncTask::EndAction: Ending action for 1 tasks (entity ID
: 01acfc7f-2dec-4e2c-a950-73c53c2f4e11): calling EndAction for action type LiveMigrateDisk.
2013-03-06 11:06:22,742 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-3-thread-41) [36703988] EntityAsyncTask::EndCommandAction [within thread] context: Attempting to
 EndAction LiveMigrateDisk, executionIndex: 1
2013-03-06 11:06:22,753 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-41) [61534099] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-03-06 11:06:22,753 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-41) [61534099] Reverting task deleteImage, handler: org.ovirt.engine.core.bll.lsm.CreateImagePlaceholderTaskHandler
2013-03-06 11:06:22,757 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-3-thread-41) [61534099] START, DeleteImageGroupVDSCommand( storagePoolId = 696adcda-73f3-46b3-ae74-a188c459dfcc, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 964a332f-64e4-43cf-9c04-d581a21f0ad2, imageGroupId = 5f286c92-6d1d-4827-a9c0-5a29284a96ed, postZeros = false, forceDelete = false), log id: 33dc8b6e


vdsm: 

ibvirtEventLoop::ERROR::2013-03-06 10:46:13,494::guestIF::243::vm.Vm::(desktopLock) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::desktopLock failed
Traceback (most recent call last):
  File "/usr/share/vdsm/guestIF.py", line 241, in desktopLock
    self._forward("lock-screen")
  File "/usr/share/vdsm/guestIF.py", line 135, in _forward
    self._sock.send(message)
  File "/usr/lib64/python2.6/socket.py", line 167, in _dummy
    raise error(EBADF, 'Bad file descriptor')
error: [Errno 9] Bad file descriptor
libvirtEventLoop::DEBUG::2013-03-06 10:46:13,495::libvirtconnection::80::vm.Vm::(__eventCallback) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::graphics event phase 2 localAddr 
{'node': '10.35.115.21', 'service': '', 'family': 0} remoteAddr {'node': '10.35.201.83', 'service': '', 'family': 0}authScheme spice subject []
libvirtEventLoop::DEBUG::2013-03-06 10:46:13,496::guestIF::240::vm.Vm::(desktopLock) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::desktopLock called
libvirtEventLoop::ERROR::2013-03-06 10:46:13,496::guestIF::243::vm.Vm::(desktopLock) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::desktopLock failed
Traceback (most recent call last):
  File "/usr/share/vdsm/guestIF.py", line 241, in desktopLock
    self._forward("lock-screen")
  File "/usr/share/vdsm/guestIF.py", line 135, in _forward
    self._sock.send(message)
  File "/usr/lib64/python2.6/socket.py", line 167, in _dummy
    raise error(EBADF, 'Bad file descriptor')
error: [Errno 9] Bad file descriptor


21a891f3-fc35-4d1e-afa1-e5ca8b1b7791::ERROR::2013-03-06 11:00:19,880::task::833::TaskManager.Task::(_setError) Task=`e377557f-a05f-4dfd-9abf-824989b4fa66`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/vm.py", line 907, in __afterVolumeExtension
    apparentSize, trueSize = self.__verifyVolumeExtension(volInfo)
  File "/usr/share/vdsm/vm.py", line 857, in __verifyVolumeExtension
    self.__refreshDriveVolume(volInfo)
  File "/usr/share/vdsm/vm.py", line 848, in __refreshDriveVolume
    self.cont(guestCpuLocked=True)
  File "/usr/share/vdsm/vm.py", line 948, in cont
    self.updateGuestCpuRunning()
  File "/usr/share/vdsm/libvirtvm.py", line 1370, in updateGuestCpuRunning
    self._guestCpuRunning = (self._dom.info()[0] ==
  File "/usr/share/vdsm/libvirtvm.py", line 529, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 104, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1855, in info
    if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '53c25ad5-184e-49fe-b3d9-09b46db4d4df'
21a891f3-fc35-4d1e-afa1-e5ca8b1b7791::DEBUG::2013-03-06 11:00:19,885::task::852::TaskManager.Task::(_run) Task=`e377557f-a05f-4dfd-9abf-824989b4fa66`::Task._run: e377557f-a05f-
4dfd-9abf-824989b4fa66 ({'newSize': 3072, 'domainID': '52e4917b-10bc-4b03-9b82-a2e15732c15b', 'name': u'vda', 'volumeID': '0d842363-d484-4fd5-983b-db443e1ec0cc', 'imageID': '5f
286c92-6d1d-4827-a9c0-5a29284a96ed', 'poolID': '696adcda-73f3-46b3-ae74-a188c459dfcc'},) {} failed - stopping task
libvirtEventLoop::ERROR::2013-03-06 11:00:19,881::libvirtconnection::91::vds::(__eventCallback) Error running VM callback
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 67, in __eventCallback
    v._onLibvirtLifecycleEvent(event, detail, None)
  File "/usr/share/vdsm/libvirtvm.py", line 3033, in _onLibvirtLifecycleEvent
    hooks.after_vm_cont(self._dom.XMLDesc(0), self.conf)
  File "/usr/share/vdsm/libvirtvm.py", line 529, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 104, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '53c25ad5-184e-49fe-b3d9-09b46db4d4df'

Comment 1 Liron Aravot 2013-03-10 10:14:27 UTC
Dafna,  the vm is reported as Down by vdsm after the connection was lost with the qemu process and then engine sends the destory.

libvirtEventLoop::DEBUG::2013-03-06 11:00:20,028::vm::1047::vm.Vm::(setDownStatus) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::Changed state to Down: Lost connection with qemu process

Thread-46827::DEBUG::2013-03-06 11:00:21,342::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.49]::call vmGetStats with ('53c25ad5-184e-49fe-b3d9-09b46db4d4df',) {}
Thread-46827::DEBUG::2013-03-06 11:00:21,342::BindingXMLRPC::920::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': '
Down', 'hash': '-6607417337149127009', 'exitMessage': 'Lost connection with qemu process', 'vmId': '53c25ad5-184e-49fe-b3d9-09b46db4d4df', 'timeOffset': '-43200', 'exitCode'
: 1}]}

Thread-46828::DEBUG::2013-03-06 11:00:21,370::libvirtvm::2588::vm.Vm::(destroy) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::destroy Called


can you elaborate if there's an issue here?

Comment 2 Dafna Ron 2013-03-12 08:53:00 UTC
there was no issue - reproduction was as written in the steps. 
please talk to vdsm, kvm or libvirt people if you think this issue is not related to engine component.

Comment 3 Liron Aravot 2013-03-12 09:15:35 UTC
Dafna,  please attach all the other relevant logs (libvirt/qemu process) so it's possible to see what happend to the vm and why it went down - if you can please also update the header to the current bug as the sent destory is not a bug.

Comment 5 Dafna Ron 2013-03-12 11:07:06 UTC
I just got the same issue when live migrating a preallocated disk twice. 
attaching the logs from it.

Comment 6 Dafna Ron 2013-03-12 11:07:50 UTC
Created attachment 708912 [details]
logs

Comment 7 Liron Aravot 2013-03-12 12:02:06 UTC
Federico, please take a look - vdsm reports that the vm is down.

Comment 8 Federico Simoncelli 2013-03-12 15:20:08 UTC
Yes the qemu process crashed for some reason. There's no trace of the reason on Dafna's host but I have been able to reproduce the issue looping a pause/hot(un)plug/resume:

while true; do
  virsh suspend vm1
  virsh attach-disk vm1 /tmp/hotplug-disk.qcow vdb
  sleep 5
  virsh resume vm1
  sleep 5
  virsh suspend vm1
  virsh detach-disk vm1 /tmp/hotplug-disk.qcow vdb
  sleep 5
  virsh resume vm1
  sleep 5
done

After a while the qemu process crashed with:

Program received signal SIGSEGV, Segmentation fault.
qemu_bh_delete (bh=0x0) at async.c:119
119	    bh->scheduled = 0;

#0  qemu_bh_delete (bh=0x0) at async.c:119
#1  0x00007f0917b3fe80 in virtio_blk_dma_restart_bh (opaque=0x7f0919136c00) at /usr/src/debug/qemu-kvm-1.2.0/hw/virtio-blk.c:440
#2  0x00007f0917971026 in qemu_bh_poll () at async.c:71
#3  0x00007f0917a8e7f3 in main_loop_wait (nonblocking=<optimized out>) at main-loop.c:506
#4  0x00007f091796ad63 in main_loop () at /usr/src/debug/qemu-kvm-1.2.0/vl.c:1643
#5  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /usr/src/debug/qemu-kvm-1.2.0/vl.c:3792

# rpm -q qemu libvirt
qemu-1.2.2-6.fc18.x86_64
libvirt-0.10.2.3-1.fc18.x86_64

We don't have any real proof that it's exactly the same issue even though I feel that there's an high chance that it is.
Qemu probably needs to fix this anyway.

Comment 9 Ayal Baron 2013-03-12 20:25:00 UTC
(In reply to comment #8)
> Yes the qemu process crashed for some reason. There's no trace of the reason
> on Dafna's host but I have been able to reproduce the issue looping a
> pause/hot(un)plug/resume:
> 
> while true; do
>   virsh suspend vm1
>   virsh attach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
>   virsh suspend vm1
>   virsh detach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
> done
> 
> After a while the qemu process crashed with:
> 
> Program received signal SIGSEGV, Segmentation fault.
> qemu_bh_delete (bh=0x0) at async.c:119
> 119	    bh->scheduled = 0;
> 
> #0  qemu_bh_delete (bh=0x0) at async.c:119
> #1  0x00007f0917b3fe80 in virtio_blk_dma_restart_bh (opaque=0x7f0919136c00)
> at /usr/src/debug/qemu-kvm-1.2.0/hw/virtio-blk.c:440
> #2  0x00007f0917971026 in qemu_bh_poll () at async.c:71
> #3  0x00007f0917a8e7f3 in main_loop_wait (nonblocking=<optimized out>) at
> main-loop.c:506
> #4  0x00007f091796ad63 in main_loop () at
> /usr/src/debug/qemu-kvm-1.2.0/vl.c:1643
> #5  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
> at /usr/src/debug/qemu-kvm-1.2.0/vl.c:3792
> 
> # rpm -q qemu libvirt
> qemu-1.2.2-6.fc18.x86_64
> libvirt-0.10.2.3-1.fc18.x86_64
> 
> We don't have any real proof that it's exactly the same issue even though I
> feel that there's an high chance that it is.
> Qemu probably needs to fix this anyway.

So why didn't we move the bug to qemu?

Comment 10 juzhang 2013-04-01 09:06:50 UTC
(In reply to comment #8)
> Yes the qemu process crashed for some reason. There's no trace of the reason
> on Dafna's host but I have been able to reproduce the issue looping a
> pause/hot(un)plug/resume:
> 
> while true; do
>   virsh suspend vm1
>   virsh attach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
>   virsh suspend vm1
>   virsh detach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
> done
> 
> After a while the qemu process crashed with:
> 
> Program received signal SIGSEGV, Segmentation fault.
> qemu_bh_delete (bh=0x0) at async.c:119
> 119	    bh->scheduled = 0;
> 
> #0  qemu_bh_delete (bh=0x0) at async.c:119
> #1  0x00007f0917b3fe80 in virtio_blk_dma_restart_bh (opaque=0x7f0919136c00)
> at /usr/src/debug/qemu-kvm-1.2.0/hw/virtio-blk.c:440
> #2  0x00007f0917971026 in qemu_bh_poll () at async.c:71
> #3  0x00007f0917a8e7f3 in main_loop_wait (nonblocking=<optimized out>) at
> main-loop.c:506
> #4  0x00007f091796ad63 in main_loop () at
> /usr/src/debug/qemu-kvm-1.2.0/vl.c:1643
> #5  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
> at /usr/src/debug/qemu-kvm-1.2.0/vl.c:3792
> 
> # rpm -q qemu libvirt
> qemu-1.2.2-6.fc18.x86_64
> libvirt-0.10.2.3-1.fc18.x86_64
> 
> We don't have any real proof that it's exactly the same issue even though I
> feel that there's an high chance that it is.
> Qemu probably needs to fix this anyway.

Comment 12 Sibiao Luo 2013-04-12 02:44:12 UTC
I tried it with the scenario of the comment #8 for more than 10 times, but did not met this issue.

host info:
kernel-2.6.32-369.el6.x86_64
qemu-kvm-0.12.1.2-2.360.el6.x86_64
guest info:
kernel-2.6.32-369.el6.x86_64

Steps:
1.boot a guest with '-global PIIX4_PM.disable_s3=0'.
e.g:# /usr/libexec/qemu-kvm -S -M rhel6.4.0 -cpu host -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -name sluo-test -uuid b18fdd6c-a213-4022-9ca4-5d07225e40b0 -rtc base=localtime,clock=host,driftfix=slew -device virtio-serial-pci,id=virtio-serial0,max_ports=16,vectors=0,bus=pci.0,addr=0x3 -chardev socket,id=channel1,path=/tmp/helloworld1,server,nowait -device virtserialport,chardev=channel1,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port1 -chardev socket,id=channel2,path=/tmp/helloworld2,server,nowait -device virtserialport,chardev=channel2,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port2 -drive file=/home/RHEL-Server-6.4-64.qcow2,if=none,id=drive-system-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,serial=QEMU-DISK1 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-system-disk,id=system-disk,bootindex=1 -netdev tap,id=hostnet0,vhost=off,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=08:2e:5f:0a:1d:b1,bus=pci.0,addr=0x5,bootindex=2 -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x6 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -serial unix:/tmp/ttyS0,server,nowait -qmp tcp:0:4444,server,nowait -k en-us -boot menu=on -vnc :1 -spice disable-ticketing,port=5931 -vga qxl -monitor stdio
2.suspend the guest.
guest] # pm-suspend
3.do hot-plug a virtio-blk data disk.
(qemu) __com.redhat_drive_add file=/home/my-data-disk.raw,id=drive-data-disk,format=raw,cache=none,aio=native
(qemu) device_add virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-data-disk,id=data-disk
4.wait a moment then resuming the guest.
(qemu) system_wakeup
5.wait a moment then do S3 again.
guest] # pm-suspend
6.hot-remove the data disk.
(qemu) device_del data-disk
7.resuming the guest after a while.
(qemu) system_wakeup
8.repeat the step2 to step7 for loop.

Results:
test it for 10 times, but did not met any promblem, both qemu, guest and host works well.

Comment 13 Sibiao Luo 2013-04-12 02:55:39 UTC
tried it with anther scenario can hit it, qemu core dump.

host info:
kernel-2.6.32-369.el6.x86_64
qemu-kvm-0.12.1.2-2.360.el6.x86_64
guest info:
kernel-2.6.32-369.el6.x86_64

Steps:
1.boot a guest with '-global PIIX4_PM.disable_s3=0'.
e.g:# /usr/libexec/qemu-kvm -S -M rhel6.4.0 -cpu host -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -name sluo-test -uuid b18fdd6c-a213-4022-9ca4-5d07225e40b0 -rtc base=localtime,clock=host,driftfix=slew -device virtio-serial-pci,id=virtio-serial0,max_ports=16,vectors=0,bus=pci.0,addr=0x3 -chardev socket,id=channel1,path=/tmp/helloworld1,server,nowait -device virtserialport,chardev=channel1,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port1 -chardev socket,id=channel2,path=/tmp/helloworld2,server,nowait -device virtserialport,chardev=channel2,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port2 -drive file=/home/RHEL-Server-6.4-64.qcow2,if=none,id=drive-system-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,serial=QEMU-DISK1 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-system-disk,id=system-disk,bootindex=1 -netdev tap,id=hostnet0,vhost=off,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=08:2e:5f:0a:1d:b1,bus=pci.0,addr=0x5,bootindex=2 -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x6 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -serial unix:/tmp/ttyS0,server,nowait -qmp tcp:0:4444,server,nowait -k en-us -boot menu=on -vnc :1 -spice disable-ticketing,port=5931 -vga qxl -monitor stdio
2.stop the guest.
(qemu) stop
3.do hot-plug a virtio-blk data disk.
(qemu) __com.redhat_drive_add file=/home/my-data-disk.raw,id=drive-data-disk,format=raw,cache=none,aio=native
(qemu) device_add virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-data-disk,id=data-disk
4.wait a moment then resuming the guest.
(qemu) cont
5.wait a moment then do stop it again.
(qemu) stop
6.hot-remove the data disk.
(qemu) device_del data-disk
7.resuming the guest after a while.
(qemu) cont
8.repeat the step2 to step7 for loop.

Results:
after some times repeat testing, the qemu-kvm will core dump.
(qemu) Segmentation fault (core dumped)

Program terminated with signal 11, Segmentation fault.
#0  virtio_blk_handle_request (req=0x21, mrb=0x7fff44bea580) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:379
379	    if (req->elem.out_num < 1 || req->elem.in_num < 1) {
(gdb) bt
#0  virtio_blk_handle_request (req=0x21, mrb=0x7fff44bea580) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:379
#1  0x00007f8972acfe4b in virtio_blk_dma_restart_bh (opaque=0x7f89747eedf0)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:471
#2  0x00007f8972af2161 in qemu_bh_poll () at /usr/src/debug/qemu-kvm-0.12.1.2/async.c:70
#3  0x00007f8972abb479 in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4017
#4  0x00007f8972add9ea in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2244
#5  0x00007f8972abe078 in main_loop (argc=58, argv=<value optimized out>, envp=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4187
#6  main (argc=58, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6526
(gdb) bt full
#0  virtio_blk_handle_request (req=0x21, mrb=0x7fff44bea580) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:379
No locals.
#1  0x00007f8972acfe4b in virtio_blk_dma_restart_bh (opaque=0x7f89747eedf0)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:471
        s = 0x7f89747eedf0
        req = 0x21
        mrb = {blkreq = {{sector = 0, nb_sectors = 0, qiov = 0x0, cb = 0, opaque = 0x0, error = 0} <repeats 32 times>}, 
          num_writes = 0, old_bs = 0x0}
#2  0x00007f8972af2161 in qemu_bh_poll () at /usr/src/debug/qemu-kvm-0.12.1.2/async.c:70
        bh = <value optimized out>
        bhp = <value optimized out>
        next = 0x7f89747eefd0
        ret = 1
        nesting = 1
#3  0x00007f8972abb479 in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4017
        ioh = <value optimized out>
        rfds = {fds_bits = {1, 0 <repeats 15 times>}}
        wfds = {fds_bits = {0 <repeats 16 times>}}
        xfds = {fds_bits = {0 <repeats 16 times>}}
        ret = 1
        nfds = 38
        tv = {tv_sec = 0, tv_usec = 999996}
#4  0x00007f8972add9ea in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2244
        fds = {26, 27}
        mask = {__val = {268443712, 0 <repeats 15 times>}}
        sigfd = 28
---Type <return> to continue, or q <return> to quit---
#5  0x00007f8972abe078 in main_loop (argc=58, argv=<value optimized out>, envp=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4187
        r = <value optimized out>
#6  main (argc=58, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6526
        gdbstub_dev = 0x0
        i = <value optimized out>
        snapshot = 0
        linux_boot = 0
        initrd_filename = 0x0
        kernel_filename = 0x0
        kernel_cmdline = 0x7f8972c703cf ""
        boot_devices = "cad", '\000' <repeats 29 times>
        ds = <value optimized out>
        dcl = <value optimized out>
        cyls = 0
        heads = 0
        secs = 0
        translation = 0
        hda_opts = 0x7f8972fca630
        opts = <value optimized out>
        olist = <value optimized out>
        optind = 58
        optarg = 0x7fff44bee7ed "stdio"
        loadvm = 0x0
        machine = 0x7f8972fc4020
        cpu_model = 0x7fff44bee2f0 "host"
---Type <return> to continue, or q <return> to quit---
        fds = {1923317760, 32649}
        tb_size = 0
        pid_file = 0x0
        incoming = 0x0
        fd = 0
        pwd = 0x0
        chroot_dir = 0x0
        run_as = 0x0
        env = <value optimized out>
        show_vnc_port = 0
        defconfig = <value optimized out>
        defconfig_verbose = <value optimized out>
(gdb)

Comment 16 Sibiao Luo 2013-04-12 04:59:51 UTC
Also tried the same testing with virtio-scsi device, but did not meet this issue. So this issue only happen in virtio-blk device.

Comment 17 Markus Armbruster 2013-05-22 13:14:08 UTC
Reproduced with

$ for ((i=0; i<100; i++)); do echo stop; echo __com.redhat_drive_add file=tmp.qcow2,id=hotdrive$i; echo device_add virtio-blk-pci,drive=hotdrive$i,id=hotdisk$i,addr=7; sleep 3; echo c; sleep 3; echo stop; echo device_del hotdisk$i; sleep 3; echo c; sleep 3; done | socat STDIO UNIX:test-hmp

where test-hmp is a human monitor.

I get varying backtraces, after a varying number of iterations.

MALLOC_PERTURB_ doesn't seem to make it less random.

Also reproduced under valgrind.  Which complains a lot.  Maybe I can
find a clue there.

I tried to reproduce with upstream QEMU, but couldn't.  Maybe I can
find a clue in the list of upstream commits we don't have.

Comment 18 Stefan Hajnoczi 2013-08-08 09:21:24 UTC
(In reply to Markus Armbruster from comment #17)
> Reproduced with
> 
> $ for ((i=0; i<100; i++)); do echo stop; echo __com.redhat_drive_add
> file=tmp.qcow2,id=hotdrive$i; echo device_add
> virtio-blk-pci,drive=hotdrive$i,id=hotdisk$i,addr=7; sleep 3; echo c; sleep
> 3; echo stop; echo device_del hotdisk$i; sleep 3; echo c; sleep 3; done |
> socat STDIO UNIX:test-hmp
> 
> where test-hmp is a human monitor.
> 
> I get varying backtraces, after a varying number of iterations.
> 
> MALLOC_PERTURB_ doesn't seem to make it less random.
> 
> Also reproduced under valgrind.  Which complains a lot.  Maybe I can
> find a clue there.
> 
> I tried to reproduce with upstream QEMU, but couldn't.  Maybe I can
> find a clue in the list of upstream commits we don't have.

This looks like the missing qemu_del_vm_change_state_handler(s->change) call in virtio_blk_exit().  It was fixed upstream in:

  commit 69b302b2044a9a0f6d157d25b39a91ff7124c61f
  Author: Christian Borntraeger <borntraeger.com>
  Date:   Fri Feb 22 14:37:10 2013 +0100

      virtio-blk: fix unplug + virsh reboot

Laszlo backported this commit into RHEL 6.x qemu-kvm:

  commit 44d9306ede26ea2881e6c1f104b901be647022a8
  Author: Laszlo Ersek <lersek>
  Date:   Thu May 9 10:27:50 2013 +0200

    virtio-blk: fix unplug + virsh reboot
    
    RH-Author: Laszlo Ersek <lersek>
    Message-id: <1368095270-21681-1-git-send-email-lersek>
    Patchwork-id: 51208
    O-Subject: [RHEL-6.5 qemu-kvm 869586 PATCH] virtio-blk: fix unplug + virsh reboot
    Bugzilla: 869586

Comment 19 Markus Armbruster 2013-08-09 10:50:08 UTC
I can reproduce the bug with a qemu-kvm built from commit 44d9306^, but not with one built from commit 44d9306.  Closing as duplicate.  Stefan, thanks a lot for the pointer!

*** This bug has been marked as a duplicate of bug 869586 ***