918798 – qemu segmentation fault on hotplug/hotunplug when the vm is paused

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 918798 - qemu segmentation fault on hotplug/hotunplug when the vm is paused

Summary: qemu segmentation fault on hotplug/hotunplug when the vm is paused

Keywords:
Status:	CLOSED DUPLICATE of bug 869586
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Markus Armbruster
QA Contact:	Haim
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-06 21:29 UTC by Dafna Ron
Modified:	2013-08-09 10:50 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-08-09 10:50:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (681.66 KB, application/x-gzip) 2013-03-06 21:29 UTC, Dafna Ron	no flags	Details
logs (1.33 MB, application/x-gzip) 2013-03-12 11:07 UTC, Dafna Ron	no flags	Details
View All

Description Dafna Ron 2013-03-06 21:29:08 UTC

Created attachment 706270 [details]
logs

Description of problem:

I created a vm with 2 disks, detached one, installed OS on the vm and after the vm was installed I moved its disk. 
during the move disk I attached the second disk -> deactivated and activated again and the vm crashed. 
looking at the engine log it seems as if engine is sending destroy to the vm

Version-Release number of selected component (if applicable):

sf9
vdsm-4.10.2-10.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.eblake.2.x86_64

How reproducible:

50%

Steps to Reproduce:
1. create a vm with two disks and detach the second disk
2. install rhel -> after install finishes -> move the disk
3. after live storage migration snapshot is created -> attach the 2ed disk
4. deactivate the second disk -> activate the second disk
  
Actual results:

vm appears to crash but looking at engine log it seems that engine sent a destroy vm which also fails the live storage migration

Expected results:

1. either block attaching a disk to a vm which is live migrating a disk
2. or fix crash

Additional info:logs


2013-03-06 11:00:06,749 INFO  [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (pool-3-thread-41) [491b1d94] Running command: HotPlugDiskToVmCommand internal: false. Entities affected :  ID: 53c25ad5-184e-49fe-b3d9-09b46db4d4df Type: VM


2013-03-06 11:00:21,801 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-93) [2348d228] START, DestroyVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, vmId=53c25ad5-184e-49fe-b3d9-09b46db4d4df, force=false, secondsToWait=0, gracefully=false), log id: 23289ee1
2013-03-06 11:00:21,826 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-93) [2348d228] FINISH, DestroyVDSCommand, log id: 23289ee1
2013-03-06 11:00:21,837 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-93) [2348d228] vm LALA running in db and not running in vds - add to rerun treatment. vds gold-vdsc


013-03-06 11:06:22,741 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (QuartzScheduler_Worker-9) [36703988] EntityAsyncTask::EndAction: Ending action for 1 tasks (entity ID
: 01acfc7f-2dec-4e2c-a950-73c53c2f4e11): calling EndAction for action type LiveMigrateDisk.
2013-03-06 11:06:22,742 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-3-thread-41) [36703988] EntityAsyncTask::EndCommandAction [within thread] context: Attempting to
 EndAction LiveMigrateDisk, executionIndex: 1
2013-03-06 11:06:22,753 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-41) [61534099] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-03-06 11:06:22,753 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-41) [61534099] Reverting task deleteImage, handler: org.ovirt.engine.core.bll.lsm.CreateImagePlaceholderTaskHandler
2013-03-06 11:06:22,757 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-3-thread-41) [61534099] START, DeleteImageGroupVDSCommand( storagePoolId = 696adcda-73f3-46b3-ae74-a188c459dfcc, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = 964a332f-64e4-43cf-9c04-d581a21f0ad2, imageGroupId = 5f286c92-6d1d-4827-a9c0-5a29284a96ed, postZeros = false, forceDelete = false), log id: 33dc8b6e


vdsm: 

ibvirtEventLoop::ERROR::2013-03-06 10:46:13,494::guestIF::243::vm.Vm::(desktopLock) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::desktopLock failed
Traceback (most recent call last):
  File "/usr/share/vdsm/guestIF.py", line 241, in desktopLock
    self._forward("lock-screen")
  File "/usr/share/vdsm/guestIF.py", line 135, in _forward
    self._sock.send(message)
  File "/usr/lib64/python2.6/socket.py", line 167, in _dummy
    raise error(EBADF, 'Bad file descriptor')
error: [Errno 9] Bad file descriptor
libvirtEventLoop::DEBUG::2013-03-06 10:46:13,495::libvirtconnection::80::vm.Vm::(__eventCallback) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::graphics event phase 2 localAddr 
{'node': '10.35.115.21', 'service': '', 'family': 0} remoteAddr {'node': '10.35.201.83', 'service': '', 'family': 0}authScheme spice subject []
libvirtEventLoop::DEBUG::2013-03-06 10:46:13,496::guestIF::240::vm.Vm::(desktopLock) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::desktopLock called
libvirtEventLoop::ERROR::2013-03-06 10:46:13,496::guestIF::243::vm.Vm::(desktopLock) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::desktopLock failed
Traceback (most recent call last):
  File "/usr/share/vdsm/guestIF.py", line 241, in desktopLock
    self._forward("lock-screen")
  File "/usr/share/vdsm/guestIF.py", line 135, in _forward
    self._sock.send(message)
  File "/usr/lib64/python2.6/socket.py", line 167, in _dummy
    raise error(EBADF, 'Bad file descriptor')
error: [Errno 9] Bad file descriptor


21a891f3-fc35-4d1e-afa1-e5ca8b1b7791::ERROR::2013-03-06 11:00:19,880::task::833::TaskManager.Task::(_setError) Task=`e377557f-a05f-4dfd-9abf-824989b4fa66`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/vm.py", line 907, in __afterVolumeExtension
    apparentSize, trueSize = self.__verifyVolumeExtension(volInfo)
  File "/usr/share/vdsm/vm.py", line 857, in __verifyVolumeExtension
    self.__refreshDriveVolume(volInfo)
  File "/usr/share/vdsm/vm.py", line 848, in __refreshDriveVolume
    self.cont(guestCpuLocked=True)
  File "/usr/share/vdsm/vm.py", line 948, in cont
    self.updateGuestCpuRunning()
  File "/usr/share/vdsm/libvirtvm.py", line 1370, in updateGuestCpuRunning
    self._guestCpuRunning = (self._dom.info()[0] ==
  File "/usr/share/vdsm/libvirtvm.py", line 529, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 104, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1855, in info
    if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '53c25ad5-184e-49fe-b3d9-09b46db4d4df'
21a891f3-fc35-4d1e-afa1-e5ca8b1b7791::DEBUG::2013-03-06 11:00:19,885::task::852::TaskManager.Task::(_run) Task=`e377557f-a05f-4dfd-9abf-824989b4fa66`::Task._run: e377557f-a05f-
4dfd-9abf-824989b4fa66 ({'newSize': 3072, 'domainID': '52e4917b-10bc-4b03-9b82-a2e15732c15b', 'name': u'vda', 'volumeID': '0d842363-d484-4fd5-983b-db443e1ec0cc', 'imageID': '5f
286c92-6d1d-4827-a9c0-5a29284a96ed', 'poolID': '696adcda-73f3-46b3-ae74-a188c459dfcc'},) {} failed - stopping task
libvirtEventLoop::ERROR::2013-03-06 11:00:19,881::libvirtconnection::91::vds::(__eventCallback) Error running VM callback
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 67, in __eventCallback
    v._onLibvirtLifecycleEvent(event, detail, None)
  File "/usr/share/vdsm/libvirtvm.py", line 3033, in _onLibvirtLifecycleEvent
    hooks.after_vm_cont(self._dom.XMLDesc(0), self.conf)
  File "/usr/share/vdsm/libvirtvm.py", line 529, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 104, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '53c25ad5-184e-49fe-b3d9-09b46db4d4df'

Comment 1 Liron Aravot 2013-03-10 10:14:27 UTC

Dafna,  the vm is reported as Down by vdsm after the connection was lost with the qemu process and then engine sends the destory.

libvirtEventLoop::DEBUG::2013-03-06 11:00:20,028::vm::1047::vm.Vm::(setDownStatus) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::Changed state to Down: Lost connection with qemu process

Thread-46827::DEBUG::2013-03-06 11:00:21,342::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.49]::call vmGetStats with ('53c25ad5-184e-49fe-b3d9-09b46db4d4df',) {}
Thread-46827::DEBUG::2013-03-06 11:00:21,342::BindingXMLRPC::920::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': '
Down', 'hash': '-6607417337149127009', 'exitMessage': 'Lost connection with qemu process', 'vmId': '53c25ad5-184e-49fe-b3d9-09b46db4d4df', 'timeOffset': '-43200', 'exitCode'
: 1}]}

Thread-46828::DEBUG::2013-03-06 11:00:21,370::libvirtvm::2588::vm.Vm::(destroy) vmId=`53c25ad5-184e-49fe-b3d9-09b46db4d4df`::destroy Called


can you elaborate if there's an issue here?

Comment 2 Dafna Ron 2013-03-12 08:53:00 UTC

there was no issue - reproduction was as written in the steps. 
please talk to vdsm, kvm or libvirt people if you think this issue is not related to engine component.

Comment 3 Liron Aravot 2013-03-12 09:15:35 UTC

Dafna,  please attach all the other relevant logs (libvirt/qemu process) so it's possible to see what happend to the vm and why it went down - if you can please also update the header to the current bug as the sent destory is not a bug.

Comment 5 Dafna Ron 2013-03-12 11:07:06 UTC

I just got the same issue when live migrating a preallocated disk twice. 
attaching the logs from it.

Comment 6 Dafna Ron 2013-03-12 11:07:50 UTC

Created attachment 708912 [details]
logs

Comment 7 Liron Aravot 2013-03-12 12:02:06 UTC

Federico, please take a look - vdsm reports that the vm is down.

Comment 8 Federico Simoncelli 2013-03-12 15:20:08 UTC

Yes the qemu process crashed for some reason. There's no trace of the reason on Dafna's host but I have been able to reproduce the issue looping a pause/hot(un)plug/resume:

while true; do
  virsh suspend vm1
  virsh attach-disk vm1 /tmp/hotplug-disk.qcow vdb
  sleep 5
  virsh resume vm1
  sleep 5
  virsh suspend vm1
  virsh detach-disk vm1 /tmp/hotplug-disk.qcow vdb
  sleep 5
  virsh resume vm1
  sleep 5
done

After a while the qemu process crashed with:

Program received signal SIGSEGV, Segmentation fault.
qemu_bh_delete (bh=0x0) at async.c:119
119	    bh->scheduled = 0;

#0  qemu_bh_delete (bh=0x0) at async.c:119
#1  0x00007f0917b3fe80 in virtio_blk_dma_restart_bh (opaque=0x7f0919136c00) at /usr/src/debug/qemu-kvm-1.2.0/hw/virtio-blk.c:440
#2  0x00007f0917971026 in qemu_bh_poll () at async.c:71
#3  0x00007f0917a8e7f3 in main_loop_wait (nonblocking=<optimized out>) at main-loop.c:506
#4  0x00007f091796ad63 in main_loop () at /usr/src/debug/qemu-kvm-1.2.0/vl.c:1643
#5  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /usr/src/debug/qemu-kvm-1.2.0/vl.c:3792

# rpm -q qemu libvirt
qemu-1.2.2-6.fc18.x86_64
libvirt-0.10.2.3-1.fc18.x86_64

We don't have any real proof that it's exactly the same issue even though I feel that there's an high chance that it is.
Qemu probably needs to fix this anyway.

Comment 9 Ayal Baron 2013-03-12 20:25:00 UTC

(In reply to comment #8)
> Yes the qemu process crashed for some reason. There's no trace of the reason
> on Dafna's host but I have been able to reproduce the issue looping a
> pause/hot(un)plug/resume:
> 
> while true; do
>   virsh suspend vm1
>   virsh attach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
>   virsh suspend vm1
>   virsh detach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
> done
> 
> After a while the qemu process crashed with:
> 
> Program received signal SIGSEGV, Segmentation fault.
> qemu_bh_delete (bh=0x0) at async.c:119
> 119	    bh->scheduled = 0;
> 
> #0  qemu_bh_delete (bh=0x0) at async.c:119
> #1  0x00007f0917b3fe80 in virtio_blk_dma_restart_bh (opaque=0x7f0919136c00)
> at /usr/src/debug/qemu-kvm-1.2.0/hw/virtio-blk.c:440
> #2  0x00007f0917971026 in qemu_bh_poll () at async.c:71
> #3  0x00007f0917a8e7f3 in main_loop_wait (nonblocking=<optimized out>) at
> main-loop.c:506
> #4  0x00007f091796ad63 in main_loop () at
> /usr/src/debug/qemu-kvm-1.2.0/vl.c:1643
> #5  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
> at /usr/src/debug/qemu-kvm-1.2.0/vl.c:3792
> 
> # rpm -q qemu libvirt
> qemu-1.2.2-6.fc18.x86_64
> libvirt-0.10.2.3-1.fc18.x86_64
> 
> We don't have any real proof that it's exactly the same issue even though I
> feel that there's an high chance that it is.
> Qemu probably needs to fix this anyway.

So why didn't we move the bug to qemu?

Comment 10 juzhang 2013-04-01 09:06:50 UTC

(In reply to comment #8)
> Yes the qemu process crashed for some reason. There's no trace of the reason
> on Dafna's host but I have been able to reproduce the issue looping a
> pause/hot(un)plug/resume:
> 
> while true; do
>   virsh suspend vm1
>   virsh attach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
>   virsh suspend vm1
>   virsh detach-disk vm1 /tmp/hotplug-disk.qcow vdb
>   sleep 5
>   virsh resume vm1
>   sleep 5
> done
> 
> After a while the qemu process crashed with:
> 
> Program received signal SIGSEGV, Segmentation fault.
> qemu_bh_delete (bh=0x0) at async.c:119
> 119	    bh->scheduled = 0;
> 
> #0  qemu_bh_delete (bh=0x0) at async.c:119
> #1  0x00007f0917b3fe80 in virtio_blk_dma_restart_bh (opaque=0x7f0919136c00)
> at /usr/src/debug/qemu-kvm-1.2.0/hw/virtio-blk.c:440
> #2  0x00007f0917971026 in qemu_bh_poll () at async.c:71
> #3  0x00007f0917a8e7f3 in main_loop_wait (nonblocking=<optimized out>) at
> main-loop.c:506
> #4  0x00007f091796ad63 in main_loop () at
> /usr/src/debug/qemu-kvm-1.2.0/vl.c:1643
> #5  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
> at /usr/src/debug/qemu-kvm-1.2.0/vl.c:3792
> 
> # rpm -q qemu libvirt
> qemu-1.2.2-6.fc18.x86_64
> libvirt-0.10.2.3-1.fc18.x86_64
> 
> We don't have any real proof that it's exactly the same issue even though I
> feel that there's an high chance that it is.
> Qemu probably needs to fix this anyway.

Comment 12 Sibiao Luo 2013-04-12 02:44:12 UTC

I tried it with the scenario of the comment #8 for more than 10 times, but did not met this issue.

host info:
kernel-2.6.32-369.el6.x86_64
qemu-kvm-0.12.1.2-2.360.el6.x86_64
guest info:
kernel-2.6.32-369.el6.x86_64

Steps:
1.boot a guest with '-global PIIX4_PM.disable_s3=0'.
e.g:# /usr/libexec/qemu-kvm -S -M rhel6.4.0 -cpu host -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -name sluo-test -uuid b18fdd6c-a213-4022-9ca4-5d07225e40b0 -rtc base=localtime,clock=host,driftfix=slew -device virtio-serial-pci,id=virtio-serial0,max_ports=16,vectors=0,bus=pci.0,addr=0x3 -chardev socket,id=channel1,path=/tmp/helloworld1,server,nowait -device virtserialport,chardev=channel1,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port1 -chardev socket,id=channel2,path=/tmp/helloworld2,server,nowait -device virtserialport,chardev=channel2,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port2 -drive file=/home/RHEL-Server-6.4-64.qcow2,if=none,id=drive-system-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,serial=QEMU-DISK1 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-system-disk,id=system-disk,bootindex=1 -netdev tap,id=hostnet0,vhost=off,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=08:2e:5f:0a:1d:b1,bus=pci.0,addr=0x5,bootindex=2 -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x6 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -serial unix:/tmp/ttyS0,server,nowait -qmp tcp:0:4444,server,nowait -k en-us -boot menu=on -vnc :1 -spice disable-ticketing,port=5931 -vga qxl -monitor stdio
2.suspend the guest.
guest] # pm-suspend
3.do hot-plug a virtio-blk data disk.
(qemu) __com.redhat_drive_add file=/home/my-data-disk.raw,id=drive-data-disk,format=raw,cache=none,aio=native
(qemu) device_add virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-data-disk,id=data-disk
4.wait a moment then resuming the guest.
(qemu) system_wakeup
5.wait a moment then do S3 again.
guest] # pm-suspend
6.hot-remove the data disk.
(qemu) device_del data-disk
7.resuming the guest after a while.
(qemu) system_wakeup
8.repeat the step2 to step7 for loop.

Results:
test it for 10 times, but did not met any promblem, both qemu, guest and host works well.

Comment 13 Sibiao Luo 2013-04-12 02:55:39 UTC

tried it with anther scenario can hit it, qemu core dump.

host info:
kernel-2.6.32-369.el6.x86_64
qemu-kvm-0.12.1.2-2.360.el6.x86_64
guest info:
kernel-2.6.32-369.el6.x86_64

Steps:
1.boot a guest with '-global PIIX4_PM.disable_s3=0'.
e.g:# /usr/libexec/qemu-kvm -S -M rhel6.4.0 -cpu host -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -name sluo-test -uuid b18fdd6c-a213-4022-9ca4-5d07225e40b0 -rtc base=localtime,clock=host,driftfix=slew -device virtio-serial-pci,id=virtio-serial0,max_ports=16,vectors=0,bus=pci.0,addr=0x3 -chardev socket,id=channel1,path=/tmp/helloworld1,server,nowait -device virtserialport,chardev=channel1,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port1 -chardev socket,id=channel2,path=/tmp/helloworld2,server,nowait -device virtserialport,chardev=channel2,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port2 -drive file=/home/RHEL-Server-6.4-64.qcow2,if=none,id=drive-system-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,serial=QEMU-DISK1 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-system-disk,id=system-disk,bootindex=1 -netdev tap,id=hostnet0,vhost=off,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=08:2e:5f:0a:1d:b1,bus=pci.0,addr=0x5,bootindex=2 -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x6 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -serial unix:/tmp/ttyS0,server,nowait -qmp tcp:0:4444,server,nowait -k en-us -boot menu=on -vnc :1 -spice disable-ticketing,port=5931 -vga qxl -monitor stdio
2.stop the guest.
(qemu) stop
3.do hot-plug a virtio-blk data disk.
(qemu) __com.redhat_drive_add file=/home/my-data-disk.raw,id=drive-data-disk,format=raw,cache=none,aio=native
(qemu) device_add virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-data-disk,id=data-disk
4.wait a moment then resuming the guest.
(qemu) cont
5.wait a moment then do stop it again.
(qemu) stop
6.hot-remove the data disk.
(qemu) device_del data-disk
7.resuming the guest after a while.
(qemu) cont
8.repeat the step2 to step7 for loop.

Results:
after some times repeat testing, the qemu-kvm will core dump.
(qemu) Segmentation fault (core dumped)

Program terminated with signal 11, Segmentation fault.
#0  virtio_blk_handle_request (req=0x21, mrb=0x7fff44bea580) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:379
379	    if (req->elem.out_num < 1 || req->elem.in_num < 1) {
(gdb) bt
#0  virtio_blk_handle_request (req=0x21, mrb=0x7fff44bea580) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:379
#1  0x00007f8972acfe4b in virtio_blk_dma_restart_bh (opaque=0x7f89747eedf0)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:471
#2  0x00007f8972af2161 in qemu_bh_poll () at /usr/src/debug/qemu-kvm-0.12.1.2/async.c:70
#3  0x00007f8972abb479 in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4017
#4  0x00007f8972add9ea in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2244
#5  0x00007f8972abe078 in main_loop (argc=58, argv=<value optimized out>, envp=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4187
#6  main (argc=58, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6526
(gdb) bt full
#0  virtio_blk_handle_request (req=0x21, mrb=0x7fff44bea580) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:379
No locals.
#1  0x00007f8972acfe4b in virtio_blk_dma_restart_bh (opaque=0x7f89747eedf0)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:471
        s = 0x7f89747eedf0
        req = 0x21
        mrb = {blkreq = {{sector = 0, nb_sectors = 0, qiov = 0x0, cb = 0, opaque = 0x0, error = 0} <repeats 32 times>}, 
          num_writes = 0, old_bs = 0x0}
#2  0x00007f8972af2161 in qemu_bh_poll () at /usr/src/debug/qemu-kvm-0.12.1.2/async.c:70
        bh = <value optimized out>
        bhp = <value optimized out>
        next = 0x7f89747eefd0
        ret = 1
        nesting = 1
#3  0x00007f8972abb479 in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4017
        ioh = <value optimized out>
        rfds = {fds_bits = {1, 0 <repeats 15 times>}}
        wfds = {fds_bits = {0 <repeats 16 times>}}
        xfds = {fds_bits = {0 <repeats 16 times>}}
        ret = 1
        nfds = 38
        tv = {tv_sec = 0, tv_usec = 999996}
#4  0x00007f8972add9ea in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2244
        fds = {26, 27}
        mask = {__val = {268443712, 0 <repeats 15 times>}}
        sigfd = 28
---Type <return> to continue, or q <return> to quit---
#5  0x00007f8972abe078 in main_loop (argc=58, argv=<value optimized out>, envp=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4187
        r = <value optimized out>
#6  main (argc=58, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6526
        gdbstub_dev = 0x0
        i = <value optimized out>
        snapshot = 0
        linux_boot = 0
        initrd_filename = 0x0
        kernel_filename = 0x0
        kernel_cmdline = 0x7f8972c703cf ""
        boot_devices = "cad", '\000' <repeats 29 times>
        ds = <value optimized out>
        dcl = <value optimized out>
        cyls = 0
        heads = 0
        secs = 0
        translation = 0
        hda_opts = 0x7f8972fca630
        opts = <value optimized out>
        olist = <value optimized out>
        optind = 58
        optarg = 0x7fff44bee7ed "stdio"
        loadvm = 0x0
        machine = 0x7f8972fc4020
        cpu_model = 0x7fff44bee2f0 "host"
---Type <return> to continue, or q <return> to quit---
        fds = {1923317760, 32649}
        tb_size = 0
        pid_file = 0x0
        incoming = 0x0
        fd = 0
        pwd = 0x0
        chroot_dir = 0x0
        run_as = 0x0
        env = <value optimized out>
        show_vnc_port = 0
        defconfig = <value optimized out>
        defconfig_verbose = <value optimized out>
(gdb)

Comment 16 Sibiao Luo 2013-04-12 04:59:51 UTC

Also tried the same testing with virtio-scsi device, but did not meet this issue. So this issue only happen in virtio-blk device.

Comment 17 Markus Armbruster 2013-05-22 13:14:08 UTC

Reproduced with

$ for ((i=0; i<100; i++)); do echo stop; echo __com.redhat_drive_add file=tmp.qcow2,id=hotdrive$i; echo device_add virtio-blk-pci,drive=hotdrive$i,id=hotdisk$i,addr=7; sleep 3; echo c; sleep 3; echo stop; echo device_del hotdisk$i; sleep 3; echo c; sleep 3; done | socat STDIO UNIX:test-hmp

where test-hmp is a human monitor.

I get varying backtraces, after a varying number of iterations.

MALLOC_PERTURB_ doesn't seem to make it less random.

Also reproduced under valgrind.  Which complains a lot.  Maybe I can
find a clue there.

I tried to reproduce with upstream QEMU, but couldn't.  Maybe I can
find a clue in the list of upstream commits we don't have.

Comment 18 Stefan Hajnoczi 2013-08-08 09:21:24 UTC

(In reply to Markus Armbruster from comment #17)
> Reproduced with
> 
> $ for ((i=0; i<100; i++)); do echo stop; echo __com.redhat_drive_add
> file=tmp.qcow2,id=hotdrive$i; echo device_add
> virtio-blk-pci,drive=hotdrive$i,id=hotdisk$i,addr=7; sleep 3; echo c; sleep
> 3; echo stop; echo device_del hotdisk$i; sleep 3; echo c; sleep 3; done |
> socat STDIO UNIX:test-hmp
> 
> where test-hmp is a human monitor.
> 
> I get varying backtraces, after a varying number of iterations.
> 
> MALLOC_PERTURB_ doesn't seem to make it less random.
> 
> Also reproduced under valgrind.  Which complains a lot.  Maybe I can
> find a clue there.
> 
> I tried to reproduce with upstream QEMU, but couldn't.  Maybe I can
> find a clue in the list of upstream commits we don't have.

This looks like the missing qemu_del_vm_change_state_handler(s->change) call in virtio_blk_exit().  It was fixed upstream in:

  commit 69b302b2044a9a0f6d157d25b39a91ff7124c61f
  Author: Christian Borntraeger <borntraeger.com>
  Date:   Fri Feb 22 14:37:10 2013 +0100

      virtio-blk: fix unplug + virsh reboot

Laszlo backported this commit into RHEL 6.x qemu-kvm:

  commit 44d9306ede26ea2881e6c1f104b901be647022a8
  Author: Laszlo Ersek <lersek>
  Date:   Thu May 9 10:27:50 2013 +0200

    virtio-blk: fix unplug + virsh reboot
    
    RH-Author: Laszlo Ersek <lersek>
    Message-id: <1368095270-21681-1-git-send-email-lersek>
    Patchwork-id: 51208
    O-Subject: [RHEL-6.5 qemu-kvm 869586 PATCH] virtio-blk: fix unplug + virsh reboot
    Bugzilla: 869586

Comment 19 Markus Armbruster 2013-08-09 10:50:08 UTC

I can reproduce the bug with a qemu-kvm built from commit 44d9306^, but not with one built from commit 44d9306.  Closing as duplicate.  Stefan, thanks a lot for the pointer!

*** This bug has been marked as a duplicate of bug 869586 ***

Note You need to log in before you can comment on or make changes to this bug.