Bug 1718707
Summary: | Restoring VM failed with the err "Cannot open log file: '/var/log/libvirt/qemu/77.log': Device or resource busy" | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | jiyan <jiyan> |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> |
Status: | CLOSED ERRATA | QA Contact: | Yanqiu Zhang <yanqzhan> |
Severity: | unspecified | Docs Contact: | |
Priority: | low | ||
Version: | 8.0 | CC: | chhu, dyuan, fjin, jdenemar, jiyan, jsuchane, lmen, xuzhang, yalzhang |
Target Milestone: | rc | Keywords: | Upstream |
Target Release: | 8.1 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-6.0.0-2.el8 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
When restoring a domain, libvirt spawns qemu process, let it read the incoming migration stream and only after that reusme its vCPUs. However, if domain is configured to have a disk which is contained in another running domain, resuming vCPUs fails (rightfully), because disks can't be locked for write.
However, libvirt reacted inadequately to this error. It removed the resuming domain from its internal list of domains (so for instance virsh list wouldn't show it again) and did not kill the qemu process rather than left it behind. This also meant that virtlogd had already a client connected to the domain's log therefore the second attempt to restore the domain failed with a different error because configuring qemu to dump its error/warning messages to virtlogd happens earlier than resuming vCPUs.
Anyway, the fix consists of making sure that the qemu process is killed on any failed resume.
Consequence:
Fix:
Result:
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-05 09:46:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jiyan
2019-06-10 02:07:20 UTC
Just for the record, an error is expected here because the disk on the NFS mount is still in use. Therefore, restore MUST fail. However, it must fail with a different error and certainly, libvirt can't leave qemu process behind. This is why you see "Device or resource busy" error - there is already a qemu process running that libvirt doesn't know of (well, forgot about) - after the first attempt to restore qemu was started but then left behind (with still opened log file). Therefore, the second attempt results in EBUSY error. Anyway, patches posted upstream: https://www.redhat.com/archives/libvir-list/2020-January/msg00584.html I've merged patches upstream: 4c581527d4 qemu: Stop domain on failed restore 3203ad6cfd qemu: Use g_autoptr() for qemuDomainSaveCookie 82e127e343 qemuDomainSaveImageStartVM: Use g_autoptr() for virCommand 1c16f261d0 qemuDomainSaveImageStartVM: Use VIR_AUTOCLOSE for @intermediatefd Verify this bug on: libvirt-daemon-6.0.0-4.module+el8.2.0+5642+838f3513.x86_64 qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904.x86_64 Steps: S1. Mount NFS on 2 hosts 1.Save running guest with nfs image on [hosta]: # virsh start vm2-yqz Domain vm2-yqz started # virsh save vm2-yqz vm2-yqz.save Domain vm2-yqz saved to vm2-yqz.save 2. scp saved file to [hostb]: # scp lenovo-***:/root/vm2-yqz.save . 3. Start guest on [hosta] again: # virsh start vm2-yqz Domain vm2-yqz started 4. Try to restore guest from the saved file on [hostb]: # virsh restore vm2-yqz.save error: Failed to restore domain from vm2-yqz.save error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock # ps aux|grep qemu|grep vm2-yqz (nothing output, no qemu process running after restore failed) # virsh list --all|grep vm2-yqz (nothing output) Try 4 more times again, still get same result: # virsh restore vm2-yqz.save error: Failed to restore domain from vm2-yqz.save error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock # ps aux|grep qemu|grep vm2-yqz # virsh list --all|grep vm2-yqz Check qemu log: 2020-02-06 08:15:26.386+0000: starting up libvirt version: 6.0.0... ... 2020-02-06T08:15:27.346353Z qemu-kvm: Failed to get "write" lock Is another process using the image [/s3-qe-team/yanqzhan/RHEL-8.2.0-20191219.0-x86_64.qcow2]? 2020-02-06 08:15:27.358+0000: shutting down, reason=failed 2020-02-06T08:15:27.359107Z qemu-kvm: terminating on signal 15 from pid 28129 (<unknown process>) # ps aux|grep 28129 root 28129 0.0 0.7 2279660 59084 ? Ssl 01:44 0:01 /usr/sbin/libvirtd --timeout 120 S2. On same host, start another guest using same image, then try to restore 1st guest: 1. # virsh start avocado-vt-vm1 Domain avocado-vt-vm1 started 2. # virsh restore vm2-yqz.save error: Failed to restore domain from vm2-yqz.save error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock # ps aux|grep qemu|grep vm2-yqz (nothing output) # virsh list --all|grep vm2-yqz - vm2-yqz shut off Try 4 more times again, still get same result. S3. Additional try for managedsave+start situation: # virsh managedsave avocado-vt-vm1 Domain avocado-vt-vm1 state saved by libvirt # virsh start vm2-yqz Domain vm2-yqz started # virsh start avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock # ps aux|grep qemu|grep avocado-vt-vm1 # virsh list --all|grep avocado-vt-vm1 - avocado-vt-vm1 shut off No qemu process left(killed by libvirtd) after restore failed. Since the result is as expected, mark bug as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2017 |