Bug 1548053
Summary: | Repeated call to virDomainDetachDeviceFlags fails with internal error: unable to execute QEMU command 'device_del' | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | David Vallee Delisle <dvd> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED DUPLICATE | QA Contact: | Han Han <hhan> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.4 | CC: | berrange, chhu, coli, dasmith, dvd, dyuan, eglynn, hhan, jdenemar, jiyan, kchamart, lmen, marjones, mbooth, mschuppe, nova-maint, rbryant, sbauza, sgordon, srevivo, vromanso, xuzhang |
Target Milestone: | pre-dev-freeze | ||
Target Release: | 7.4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-04-25 09:23:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Vallee Delisle
2018-02-22 15:42:32 UTC
As above, it seems libvirt is returning internal error where it formerly returned operation failed. There's no regression in the error libvirt reports, you're just hitting different path. Looking at the referenced Nova code around https://github.com/openstack/nova/blob/master/nova/virt/libvirt/guest.py#L404 I see Nove apparently calls virDomainDetachDeviceFlags repeatedly until the device is gone. The first call to virDomainDetachDeviceFlags succeeded, which means libvirt successfully told QEMU to remove the device. However, doing so requires the guest OS to cooperate, which means it's asynchronous. In other words, the device will be removed only after the guest OS allows it to be removed, which QEMU reports via DEVICE_DELETED event. Until then the device is still present in libvirt's domain definition and thus repeated call to virDomainDetachDeviceFlags will not report the expected VIR_ERR_OPERATION_FAILED "disk %s not found" error and it will ask QEMU to remove the device again. In this case, the device is somehow already removed (perhaps partially) in QEMU and it reports an error. So either the device is not quite removed (yet) and thus QEMU didn't send us the event or we somehow ignored it. Would you be able to provide debug logs from libvirtd (see https://wiki.libvirt.org/page/DebugLogs) starting before the first successful call to virDomainDetachDeviceFlags? The attached log file is /var/log/libvirt/qemu/instance-00000206.log which is not really useful in this case. I asked for debug logs from libvirtd. Please consult https://wiki.libvirt.org/page/DebugLogs for how to get them. It's the /var/log/libvirt/libvirtd.log file I'm interested in. Hi Jiri, I'll ask for the additional log files from the customer. Thanks Mark As I explained in comment 4, the device is still present in the domain XML until we process DEVICE_DELETED event. Thus repeated virDomainDetachDeviceFlags may think the device is still present while it was already removed by QEMU and thus the monitor command fails. The most recent logs finally shed some light on what's happening here: 2018-06-25 14:42:15.059+0000: 800875: debug : virThreadJobSet:96 : Thread 800875 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags 2018-06-25 14:42:15.059+0000: 800875: debug : virDomainDetachDeviceFlags:8270 : dom=0x7f0858bd9b20, (VM: name=instance-0000002c, uuid=c148ab90-d61f-41c4-8c50-015e18edb819), xml=<disk type="block" device="disk"> <driver name="qemu" type="raw" cache="none" io="native"/> <source dev="/dev/dm-0"/> <target bus="virtio" dev="vdf"/> <serial>f1d3409d-c1d2-43d3-8d73-c7694abecb5b</serial> <address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/> </disk> , flags=0x3 2018-06-25 14:42:15.059+0000: 800875: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=none async=none) 2018-06-25 14:42:15.059+0000: 800875: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:15.071+0000: 800875: debug : qemuDomainObjEnterMonitorInternal:4895 : Entering monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:15.071+0000: 800875: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f0850bd7b60 msg={"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-89"} fd=-1 2018-06-25 14:42:15.072+0000: 800814: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0850bd7b60 reply={"return": {}, "id": "libvirt-89"} 2018-06-25 14:42:15.072+0000: 800875: debug : qemuMonitorJSONCommandWithFd:303 : Receive command reply ret=0 rxObject=0x557d80d9d0a0 2018-06-25 14:42:15.072+0000: 800875: debug : qemuDomainObjExitMonitorInternal:4918 : Exited monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c) At this point virDomainDetachDeviceFlags is waiting for 5 seconds for DEVICE_DELETED event so that the device is already gone once the API finishes. 2018-06-25 14:42:16.749+0000: 800876: debug : virThreadJobSet:96 : Thread 800876 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags 2018-06-25 14:42:16.749+0000: 800876: debug : virDomainDetachDeviceFlags:8270 : dom=0x7f084800b860, (VM: name=instance-0000002c, uuid=c148ab90-d61f-41c4-8c50-015e18edb819), xml=<disk type="block" device="disk"> <driver name="qemu" type="raw" cache="none" io="native"/> <source dev="/dev/dm-1"/> <target bus="virtio" dev="vdc"/> <serial>3ad91f4c-683a-4d2f-adf6-aca57c173de6</serial> <address type="pci" domain="0x0000" bus="0x00" slot="0x09" function="0x0"/> </disk> , flags=0x3 2018-06-25 14:42:16.749+0000: 800876: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=modify async=none) 2018-06-25 14:42:16.749+0000: 800876: debug : qemuDomainObjBeginJobInternal:4648 : Waiting for job (vm=0x7f0854bca070 name=instance-0000002c) virDomainDetachDeviceFlags is called to remove another disk (virtio-disk2) from instance-0000002c is started and has to wait for the first one to finish. 2018-06-25 14:42:20.075+0000: 800875: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:20.075+0000: 800875: debug : virThreadJobClear:121 : Thread 800875 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0 The virDomainDetachDeviceFlags call removing virtio-disk5 finished. 2018-06-25 14:42:20.075+0000: 800876: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) The virDomainDetachDeviceFlags call removing virtio-disk2 got woken up and acquired the job. 2018-06-25 14:42:20.090+0000: 800877: debug : virThreadJobSet:96 : Thread 800877 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags 2018-06-25 14:42:20.090+0000: 800877: debug : virDomainDetachDeviceFlags:8270 : dom=0x7f0840bcddd0, (VM: name=instance-0000002c, uuid=c148ab90-d61f-41c4-8c50-015e18edb819), xml=<disk type="block" device="disk"> <driver name="qemu" type="raw" cache="none" io="native"/> <source dev="/dev/dm-0"/> <target bus="virtio" dev="vdf"/> <serial>f1d3409d-c1d2-43d3-8d73-c7694abecb5b</serial> <address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/> </disk> , flags=0x1 15 ms after the first call to virDomainDetachDeviceFlags removing virtio-disk5 finished Nova is trying again... 2018-06-25 14:42:20.090+0000: 800877: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=modify async=none) 2018-06-25 14:42:20.090+0000: 800877: debug : qemuDomainObjBeginJobInternal:4648 : Waiting for job (vm=0x7f0854bca070 name=instance-0000002c) An attempt to detach another disk (virtio-disk2) is still running at this point which means we need to wait for it to finish before we can proceed. 2018-06-25 14:42:23.901+0000: 800814: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0850bd7b60 event={"timestamp": {"seconds": 1529937743, "microseconds": 900746}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk5/virtio-backend"}} 2018-06-25 14:42:23.907+0000: 800814: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0850bd7b60 event={"timestamp": {"seconds": 1529937743, "microseconds": 906895}, "event": "DEVICE_DELETED", "data": {"device": "virtio-disk5", "path": "/machine/peripheral/virtio-disk5"}} 2018-06-25 14:42:23.907+0000: 803610: debug : qemuProcessEventHandler:4818 : vm=0x7f0854bca070, event=2 2018-06-25 14:42:23.907+0000: 803610: debug : processDeviceDeletedEvent:4321 : Removing device virtio-disk5 from domain 0x7f0854bca070 instance-0000002c 2018-06-25 14:42:23.907+0000: 803610: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f0854bca070 name=instance-0000002c, current job=modify async=none) 2018-06-25 14:42:23.907+0000: 803610: debug : qemuDomainObjBeginJobInternal:4648 : Waiting for job (vm=0x7f0854bca070 name=instance-0000002c) The virtio-disk5 disk is removed by QEMU, but another API is still running on instance-0000002c which means even processing of the DEVICE_DELETED event has to wait. 2018-06-25 14:42:24.872+0000: 800814: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0850bd7b60 event={"timestamp": {"seconds": 1529937744, "microseconds": 871753}, "event": "DEVICE_DELETED", "data": {"device": "virtio-disk2", "path": "/machine/peripheral/virtio-disk2"}} 2018-06-25 14:42:24.872+0000: 800814: debug : qemuProcessHandleDeviceDeleted:1382 : Device virtio-disk2 removed from domain 0x7f0854bca070 instance-0000002c 2018-06-25 14:42:24.872+0000: 800814: debug : qemuDomainSignalDeviceRemoval:4493 : Removal of device 'virtio-disk2' continues in waiting thread The other disk is removed by QEMU while its virDomainDetachDeviceFlags call was still running and the removal is handled within the API. 2018-06-25 14:42:24.872+0000: 800876: debug : qemuDomainRemoveDiskDevice:3641 : Removing disk virtio-disk2 from domain 0x7f0854bca070 instance-0000002c 2018-06-25 14:42:24.886+0000: 800876: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:24.886+0000: 800876: debug : virThreadJobClear:121 : Thread 800876 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0 The removal of virtio-disk2 finished and one of the threads waiting for the job will be woken up. 2018-06-25 14:42:24.886+0000: 800877: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) The second attempt to remove virtio-disk5 was the lucky thread, which means the DEVICE_DELETED event from QEMU signalling virtio-disk5 has been successfully removed is still waiting to be processed. 2018-06-25 14:42:24.889+0000: 800877: debug : qemuDomainObjEnterMonitorInternal:4895 : Entering monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:24.889+0000: 800877: debug : qemuMonitorJSONCommandWithFd:298 : Send command '{"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-94"}' for write with FD -1 2018-06-25 14:42:24.889+0000: 800877: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f0850bd7b60 msg={"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-94"} fd=-1 2018-06-25 14:42:24.890+0000: 800814: info : qemuMonitorIOWrite:552 : QEMU_MONITOR_IO_WRITE: mon=0x7f0850bd7b60 buf={"execute":"device_del","arguments":{"id":"virtio-disk5"},"id":"libvirt-94"} len=78 ret=78 errno=0 2018-06-25 14:42:24.890+0000: 800814: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0850bd7b60 reply={"id": "libvirt-94", "error": {"class": "DeviceNotFound", "desc": "Device 'virtio-disk5' not found"}} 2018-06-25 14:42:24.890+0000: 800814: info : qemuMonitorJSONIOProcessLine:213 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0850bd7b60 reply={"id": "libvirt-94", "error": {"class": "DeviceNotFound", "desc": "Device 'virtio-disk5' not found"}} 2018-06-25 14:42:24.891+0000: 800877: error : qemuMonitorJSONCheckError:389 : internal error: unable to execute QEMU command 'device_del': Device 'virtio-disk5' not found 2018-06-25 14:42:24.891+0000: 800877: debug : qemuDomainObjExitMonitorInternal:4918 : Exited monitor (mon=0x7f0850bd7b60 vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:24.891+0000: 800877: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:24.891+0000: 800877: debug : virThreadJobClear:121 : Thread 800877 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=-1 Since the DEVICE_DELETED event was not processed yet, libvirt thought the device was still present and tried to detach it. This resulted in an error from QEMU. 2018-06-25 14:42:24.891+0000: 803610: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) 2018-06-25 14:42:24.891+0000: 803610: debug : qemuDomainRemoveDiskDevice:3641 : Removing disk virtio-disk5 from domain 0x7f0854bca070 instance-0000002c 2018-06-25 14:42:24.899+0000: 803610: debug : qemuDomainObjEndJob:4826 : Stopping job: modify (async=none vm=0x7f0854bca070 name=instance-0000002c) The thread which is supposed to process the DEVICE_DELETED event finally acquires the job and removes the device from domain definition. So there's nothing really wrong here from libvirt's point of view. It's just a race condition between the actual removal of the device and Nova retrying the removal. It's not normally required or useful to call virDomainDetachDeviceFlags several times for the same device, but it can help when the guest OS is not fully booted yet and thus is completely ignoring detach requests rather then ignoring them just because the device is still in use. I'll see if libvirt can be changed so that it reports the same error when device is not found no matter where (in libvirt or QEMU) the check was done. This should solve the race since Nova knows how to handle the case when libvirt reports the device cannot be found. The XFS errors in the guest OS are a separate issue though. It looks like the guest OS happily allowed QEMU to remove the disk even though it contains a mounted filesystem. I'd suggest to make a separate bugzilla for this for qemu-kvm-rhev component so that they can check what's going on. Jiri, this bug seems duplicated to https://bugzilla.redhat.com/show_bug.cgi?id=1623389. They are all about device detach race condition. Could you confirm it? ping (In reply to Han Han from comment #22) > Jiri, this bug seems duplicated to > https://bugzilla.redhat.com/show_bug.cgi?id=1623389. They are all about > device detach race condition. Could you confirm it? Yeah, I agree. *** This bug has been marked as a duplicate of bug 1623389 *** Jiri, I think it is better not to close it. Because BZ1623389 is for rhel8 release, we need a bug to track the issue on rhel7. BTW, please confirm if a zstream fix is needed. |