Bug 1548053

Summary:	Repeated call to virDomainDetachDeviceFlags fails with internal error: unable to execute QEMU command 'device_del'
Product:	Red Hat Enterprise Linux 7	Reporter:	David Vallee Delisle <dvd>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED DUPLICATE	QA Contact:	Han Han <hhan>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.4	CC:	berrange, chhu, coli, dasmith, dvd, dyuan, eglynn, hhan, jdenemar, jiyan, kchamart, lmen, marjones, mbooth, mschuppe, nova-maint, rbryant, sbauza, sgordon, srevivo, vromanso, xuzhang
Target Milestone:	pre-dev-freeze
Target Release:	7.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-25 09:23:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Vallee Delisle 2018-02-22 15:42:32 UTC

Description of problem:
When deleting a stack with live instances, _try_detach_device is able to detach the disks on first try. But, when it loops again, libvirt-3.2.0 is returning a different return code when the device is not found.

It used to be libvirt.VIR_ERR_OPERATION_FAILED [1] but it looks like in 3.2.0 it's now libvirt.VIR_ERR_INTERNAL_ERROR. I did a debug session and I could confirm that it's returning 1 [2]

[3] Full stack trace

More logs are available in the attached case.

Version-Release number of selected component (if applicable):
openstack-nova-compute-16.0.2-9.el7ost.noarch
libvirt-3.2.0-14.el7_4.7.x86_64

How reproducible:
All the time

Steps to Reproduce:
1. Create a stack with volumes
2. Make sure the instances are still running
3. Delete stack

Actual results:
Heat is unable to delete OS::Cinder::VolumeAttachment

Expected results:
Heat should delete successfully.

Additional info:
[1] https://github.com/openstack/nova/blob/master/nova/virt/libvirt/guest.py#L404

[2]
~~~
2018-02-22 15:12:28.665 1 DEBUG nova.virt.libvirt.guest [-] DVD Libvirt Error Code: 1 Error Message: internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk5' not found _try_detach_device /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:410
2018-02-22 15:12:28.676 1 DEBUG nova.virt.libvirt.guest [-] DVD Libvirt Error Code: 1 Error Message: internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk2' not found _try_detach_device /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:410
2018-02-22 15:12:28.683 1 DEBUG nova.virt.libvirt.guest [-] DVD Libvirt Error Code: 1 Error Message: internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk4' not found _try_detach_device /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:410
~~~

[3]
~~~
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server [req-893b6c41-f9ee-4faf-964e-66f8d676bb17 b9882e5aedcb445ebd2ea923376310d7 0820ecf92d4b4c7ab1b1a3bd70dfb327 - default default] Exception during message h
andling: libvirtError: internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk2' not found
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 76, in wrapped
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     function_name, call_dict, binary)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     self.force_reraise()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 67, in wrapped
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 218, in decorated_function
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     self.force_reraise()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 206, in decorated_function
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5007, in detach_volume
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     attachment_id=attachment_id)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4961, in _detach_volume
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     attachment_id=attachment_id, destroy_bdm=destroy_bdm)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 302, in detach
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     self.driver_detach(context, instance, volume_api, virt_driver)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 290, in driver_detach
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     volume_api.roll_detaching(context, volume_id)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     self.force_reraise()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/block_device.py", line 277, in driver_detach
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     encryption=encryption)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1391, in detach_volume
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     wait_for_detach()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 423, in func
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     return evt.wait()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     return hubs.get_hub().switch()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     return self.greenlet.switch()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 137, in _run_loop
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     result = func(*self.args, **self.kw)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 394, in _func
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 458, in _do_wait_and_retry_detach
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     _try_detach_device(config, persistent=False, live=live)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 422, in _try_detach_device
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     device=alternative_device_name)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     self.force_reraise()
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 402, in _try_detach_device
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     self.detach_device(conf, persistent=persistent, live=live)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 483, in detach_device
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     self._domain.detachDeviceFlags(device_xml, flags=flags)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     result = proxy_call(self._autowrap, f, *args, **kwargs)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     rv = execute(f, *args, **kwargs)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     six.reraise(c, e, tb)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     rv = meth(*args, **kwargs)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1194, in detachDeviceFlags
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server     if ret == -1: raise libvirtError ('virDomainDetachDeviceFlags() failed', dom=self)
2018-02-22 15:12:28.951 1 ERROR oslo_messaging.rpc.server libvirtError: internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk2' not found
~~~

Comment 2 Matthew Booth 2018-03-03 17:18:41 UTC

As above, it seems libvirt is returning internal error where it formerly returned operation failed.

Comment 4 Jiri Denemark 2018-03-05 16:54:19 UTC

There's no regression in the error libvirt reports, you're just hitting different path.

Looking at the referenced Nova code around https://github.com/openstack/nova/blob/master/nova/virt/libvirt/guest.py#L404
I see Nove apparently calls virDomainDetachDeviceFlags repeatedly until the device is gone. The first call to
virDomainDetachDeviceFlags succeeded, which means libvirt successfully told QEMU to remove the device. However, doing so
requires the guest OS to cooperate, which means it's asynchronous. In other words, the device will be removed only after
the guest OS allows it to be removed, which QEMU reports via DEVICE_DELETED event. Until then the device is still
present in libvirt's domain definition and thus repeated call to virDomainDetachDeviceFlags will not report the expected
VIR_ERR_OPERATION_FAILED  "disk %s not found" error and it will ask QEMU to remove the device again.

In this case, the device is somehow already removed (perhaps partially) in QEMU and it reports an error. So either the
device is not quite removed (yet) and thus QEMU didn't send us the event or we somehow ignored it.

Would you be able to provide debug logs from libvirtd (see https://wiki.libvirt.org/page/DebugLogs) starting before the
first successful call to virDomainDetachDeviceFlags?

Comment 8 Jiri Denemark 2018-03-21 09:01:41 UTC

The attached log file is /var/log/libvirt/qemu/instance-00000206.log which is not really useful in this case. I asked for debug logs from libvirtd. Please consult https://wiki.libvirt.org/page/DebugLogs for how to get them. It's the /var/log/libvirt/libvirtd.log file I'm interested in.

Comment 9 Mark Jones 2018-03-22 15:42:18 UTC

Hi Jiri,

I'll ask for the additional log files from the customer.

Thanks
Mark

Comment 19 Jiri Denemark 2018-07-31 16:49:05 UTC

As I explained in comment 4, the device is still present in the domain XML
until we process DEVICE_DELETED event. Thus repeated
virDomainDetachDeviceFlags may think the device is still present while it was
already removed by QEMU and thus the monitor command fails.

The most recent logs finally shed some light on what's happening here:

2018-06-25 14:42:15.059+0000: 800875: debug : virThreadJobSet:96 : Thread 800875 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2018-06-25 14:42:15.059+0000: 800875: debug : virDomainDetachDeviceFlags:8270 : dom=0x7f0858bd9b20, (VM: name=instance-0000002c, uuid=c148ab90-d61f-41c4-8c50-015e18edb819), xml=<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native"/>
  <source dev="/dev/dm-0"/>
  <target bus="virtio" dev="vdf"/>
  <serial>f1d3409d-c1d2-43d3-8d73-c7694abecb5b</serial>
  <address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/>
</disk>
, flags=0x3
2018-06-25 14:42:15.059+0000: 800875: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=none async=none)
2018-06-25 14:42:15.059+0000: 800875: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:15.071+0000: 800875: debug : qemuDomainObjEnterMonitorInternal:4895 : Entering monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:15.071+0000: 800875: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f0850bd7b60 msg={"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-89"}
 fd=-1
2018-06-25 14:42:15.072+0000: 800814: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0850bd7b60 reply={"return": {}, "id": "libvirt-89"}
2018-06-25 14:42:15.072+0000: 800875: debug : qemuMonitorJSONCommandWithFd:303 : Receive command reply ret=0 rxObject=0x557d80d9d0a0
2018-06-25 14:42:15.072+0000: 800875: debug : qemuDomainObjExitMonitorInternal:4918 : Exited monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c)

    At this point virDomainDetachDeviceFlags is waiting for 5 seconds for
    DEVICE_DELETED event so that the device is already gone once the API
    finishes.


2018-06-25 14:42:16.749+0000: 800876: debug : virThreadJobSet:96 : Thread 800876 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2018-06-25 14:42:16.749+0000: 800876: debug : virDomainDetachDeviceFlags:8270 : dom=0x7f084800b860, (VM: name=instance-0000002c, uuid=c148ab90-d61f-41c4-8c50-015e18edb819), xml=<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native"/>
  <source dev="/dev/dm-1"/>
  <target bus="virtio" dev="vdc"/>
  <serial>3ad91f4c-683a-4d2f-adf6-aca57c173de6</serial>
  <address type="pci" domain="0x0000" bus="0x00" slot="0x09" function="0x0"/>
</disk>
, flags=0x3
2018-06-25 14:42:16.749+0000: 800876: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=modify async=none)
2018-06-25 14:42:16.749+0000: 800876: debug : qemuDomainObjBeginJobInternal:4648 : Waiting for job (vm=0x7f0854bca070 name=instance-0000002c)

    virDomainDetachDeviceFlags is called to remove another disk (virtio-disk2)
    from instance-0000002c is started and has to wait for the first one to
    finish.


2018-06-25 14:42:20.075+0000: 800875: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:20.075+0000: 800875: debug : virThreadJobClear:121 : Thread 800875 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0

    The virDomainDetachDeviceFlags call removing virtio-disk5 finished.


2018-06-25 14:42:20.075+0000: 800876: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)

    The virDomainDetachDeviceFlags call removing virtio-disk2 got woken up and
    acquired the job.


2018-06-25 14:42:20.090+0000: 800877: debug : virThreadJobSet:96 : Thread 800877 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2018-06-25 14:42:20.090+0000: 800877: debug : virDomainDetachDeviceFlags:8270 : dom=0x7f0840bcddd0, (VM: name=instance-0000002c, uuid=c148ab90-d61f-41c4-8c50-015e18edb819), xml=<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native"/>
  <source dev="/dev/dm-0"/>
  <target bus="virtio" dev="vdf"/>
  <serial>f1d3409d-c1d2-43d3-8d73-c7694abecb5b</serial>
  <address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/>
</disk>
, flags=0x1

    15 ms after the first call to virDomainDetachDeviceFlags removing
    virtio-disk5 finished Nova is trying again...


2018-06-25 14:42:20.090+0000: 800877: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=modify async=none)
2018-06-25 14:42:20.090+0000: 800877: debug : qemuDomainObjBeginJobInternal:4648 : Waiting for job (vm=0x7f0854bca070 name=instance-0000002c)

    An attempt to detach another disk (virtio-disk2) is still running at this
    point which means we need to wait for it to finish before we can proceed.


2018-06-25 14:42:23.901+0000: 800814: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0850bd7b60 event={"timestamp": {"seconds": 1529937743, "microseconds": 900746}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk5/virtio-backend"}}
2018-06-25 14:42:23.907+0000: 800814: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0850bd7b60 event={"timestamp": {"seconds": 1529937743, "microseconds": 906895}, "event": "DEVICE_DELETED", "data": {"device": "virtio-disk5", "path": "/machine/peripheral/virtio-disk5"}}
2018-06-25 14:42:23.907+0000: 803610: debug : qemuProcessEventHandler:4818 : vm=0x7f0854bca070, event=2
2018-06-25 14:42:23.907+0000: 803610: debug : processDeviceDeletedEvent:4321 : Removing device virtio-disk5 from domain 0x7f0854bca070 instance-0000002c
2018-06-25 14:42:23.907+0000: 803610: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=modify async=none)
2018-06-25 14:42:23.907+0000: 803610: debug : qemuDomainObjBeginJobInternal:4648 : Waiting for job (vm=0x7f0854bca070 name=instance-0000002c)

    The virtio-disk5 disk is removed by QEMU, but another API is still running
    on instance-0000002c which means even processing of the DEVICE_DELETED
    event has to wait.


2018-06-25 14:42:24.872+0000: 800814: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0850bd7b60 event={"timestamp": {"seconds": 1529937744, "microseconds": 871753}, "event": "DEVICE_DELETED", "data": {"device": "virtio-disk2", "path": "/machine/peripheral/virtio-disk2"}}
2018-06-25 14:42:24.872+0000: 800814: debug : qemuProcessHandleDeviceDeleted:1382 : Device virtio-disk2 removed from domain 0x7f0854bca070 instance-0000002c
2018-06-25 14:42:24.872+0000: 800814: debug : qemuDomainSignalDeviceRemoval:4493 : Removal of device 'virtio-disk2' continues in waiting thread

    The other disk is removed by QEMU while its virDomainDetachDeviceFlags
    call was still running and the removal is handled within the API.


2018-06-25 14:42:24.872+0000: 800876: debug : qemuDomainRemoveDiskDevice:3641 : Removing disk virtio-disk2 from domain 0x7f0854bca070 instance-0000002c
2018-06-25 14:42:24.886+0000: 800876: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:24.886+0000: 800876: debug : virThreadJobClear:121 : Thread 800876 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0

    The removal of virtio-disk2 finished and one of the threads waiting for
    the job will be woken up.


2018-06-25 14:42:24.886+0000: 800877: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)

    The second attempt to remove virtio-disk5 was the lucky thread, which
    means the DEVICE_DELETED event from QEMU signalling virtio-disk5 has been
    successfully removed is still waiting to be processed.


2018-06-25 14:42:24.889+0000: 800877: debug : qemuDomainObjEnterMonitorInternal:4895 : Entering monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:24.889+0000: 800877: debug : qemuMonitorJSONCommandWithFd:298 : Send command '{"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-94"}' for write with FD -1
2018-06-25 14:42:24.889+0000: 800877: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f0850bd7b60 msg={"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-94"}
 fd=-1
2018-06-25 14:42:24.890+0000: 800814: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f0850bd7b60 buf={"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-94"}
 len=78 ret=78 errno=0
2018-06-25 14:42:24.890+0000: 800814: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0850bd7b60 reply={"id": "libvirt-94", "error": {"class": "DeviceNotFound", "desc": "Device 'virtio-disk5' not found"}}
2018-06-25 14:42:24.890+0000: 800814: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0850bd7b60 reply={"id": "libvirt-94", "error": {"class": "DeviceNotFound", "desc": "Device 'virtio-disk5' not found"}}
2018-06-25 14:42:24.891+0000: 800877: error : qemuMonitorJSONCheckError:389 : internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk5' not found
2018-06-25 14:42:24.891+0000: 800877: debug : qemuDomainObjExitMonitorInternal:4918 : Exited monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:24.891+0000: 800877: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:24.891+0000: 800877: debug : virThreadJobClear:121 : Thread 800877 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=-1

    Since the DEVICE_DELETED event was not processed yet, libvirt thought the
    device was still present and tried to detach it. This resulted in an error
    from QEMU.


2018-06-25 14:42:24.891+0000: 803610: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)
2018-06-25 14:42:24.891+0000: 803610: debug : qemuDomainRemoveDiskDevice:3641 : Removing disk virtio-disk5 from domain 0x7f0854bca070 instance-0000002c
2018-06-25 14:42:24.899+0000: 803610: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c)

    The thread which is supposed to process the DEVICE_DELETED event finally
    acquires the job and removes the device from domain definition.


So there's nothing really wrong here from libvirt's point of view. It's just a
race condition between the actual removal of the device and Nova retrying the
removal. It's not normally required or useful to call
virDomainDetachDeviceFlags several times for the same device, but it can help
when the guest OS is not fully booted yet and thus is completely ignoring
detach requests rather then ignoring them just because the device is still in
use. I'll see if libvirt can be changed so that it reports the same error when
device is not found no matter where (in libvirt or QEMU) the check was done.
This should solve the race since Nova knows how to handle the case when
libvirt reports the device cannot be found.


The XFS errors in the guest OS are a separate issue though. It looks like the
guest OS happily allowed QEMU to remove the disk even though it contains a
mounted filesystem. I'd suggest to make a separate bugzilla for this for
qemu-kvm-rhev component so that they can check what's going on.

Comment 22 Han Han 2019-03-05 03:06:53 UTC

Jiri, this bug seems duplicated to https://bugzilla.redhat.com/show_bug.cgi?id=1623389. They are all about device detach race condition. Could you confirm it?

Comment 23 Han Han 2019-04-15 06:00:00 UTC

ping

Comment 24 Jiri Denemark 2019-04-25 09:23:14 UTC

(In reply to Han Han from comment #22)
> Jiri, this bug seems duplicated to
> https://bugzilla.redhat.com/show_bug.cgi?id=1623389. They are all about
> device detach race condition. Could you confirm it?

Yeah, I agree.

*** This bug has been marked as a duplicate of bug 1623389 ***

Comment 25 Han Han 2019-04-25 09:58:09 UTC

Jiri, I think it is better not to close it. Because BZ1623389 is for rhel8 release, we need a bug to track the issue on rhel7.
BTW, please confirm if a zstream fix is needed.