Bug 1718707

Summary:	Restoring VM failed with the err "Cannot open log file: '/var/log/libvirt/qemu/77.log': Device or resource busy"
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	jiyan <jiyan>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
Status:	CLOSED ERRATA	QA Contact:	Yanqiu Zhang <yanqzhan>
Severity:	unspecified	Docs Contact:
Priority:	low
Version:	8.0	CC:	chhu, dyuan, fjin, jdenemar, jiyan, jsuchane, lmen, xuzhang, yalzhang
Target Milestone:	rc	Keywords:	Upstream
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-6.0.0-2.el8	Doc Type:	Bug Fix
Doc Text:	Cause: When restoring a domain, libvirt spawns qemu process, let it read the incoming migration stream and only after that reusme its vCPUs. However, if domain is configured to have a disk which is contained in another running domain, resuming vCPUs fails (rightfully), because disks can't be locked for write. However, libvirt reacted inadequately to this error. It removed the resuming domain from its internal list of domains (so for instance virsh list wouldn't show it again) and did not kill the qemu process rather than left it behind. This also meant that virtlogd had already a client connected to the domain's log therefore the second attempt to restore the domain failed with a different error because configuring qemu to dump its error/warning messages to virtlogd happens earlier than resuming vCPUs. Anyway, the fix consists of making sure that the qemu process is killed on any failed resume. Consequence: Fix: Result:	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-05 09:46:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jiyan 2019-06-10 02:07:20 UTC

Description of problem:
Restoring VM failed with the err "Cannot open log file: '/var/log/libvirt/qemu/77.log': Device or resource busy"

Version-Release number of selected component (if applicable):
2 hosts
libvirt-4.5.0-18.el7.x86_64
qemu-kvm-rhev-2.12.0-30.el7.x86_64
kernel-3.10.0-1052.el7.x86_64    

How reproducible:
100%

Steps to Reproduce:
1. Mount NFS for 2 hosts and check
# mount
NFSserver:/home/nfs_data_intel on /tmp

2. Start VM in the 1st host; and then save the VM
# virsh start 7701
Domain 7701 started

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 42    7701                           running

# virsh save 7701 7701.save
Domain 7701 saved to 7701.save

3. Scp the save_file to the 2rd host
# scp 7701.save root.lab.eng.nay.redhat.com:/root

4. Start the VM in the 1st host again
# virsh start 7701
Domain 7701 started

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 43    7701                           running

5. Restore VM from the save_file in the 2rd host
# virsh restore 7701.save 
error: Failed to restore domain from 7701.save
error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

##### Thre is no VM list by virsh
# virsh list --all
No 7701 here

##### There is qemu progress
# ps -ef |grep 7701
qemu     18839     1  7 05:56 ?        00:00:00 /usr/libexec/qemu-kvm -name guest=7701,debug-threads=on 

6. Restore VM from the save_file in the 2rd host again
# virsh restore 77.save 
error: Failed to restore domain from 77.save
error: Cannot open log file: '/var/log/libvirt/qemu/77.log': Device or resource busy

Actual results:
As step-5. step-6 shows

Expected results:
While there is qemu process of VM, virsh list should also be able to output the status of VM>

Additional info:
Opereate the steps above in same host; the following 2 VMs use same img.

1. Start the first VM and save it
# virsh start 7701
Domain 7701 started

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 42    7701                           running
 -     7702                           shut off

# virsh save 7701 7701.save
Domain 7701 saved to 7701.save

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     7701                           shut off
 -     7702                           shut off

2. Start the 2rd VM
# virsh start 7702
Domain 7702 started

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 43    7702                           running
 44    7701                           shut off

3. Restore the 1st VM and check status
# virsh restore 7701.save 
error: Failed to restore domain from 7701.save
error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 43    7702                           running
 44    7701                           paused

# ps -ef |grep 7701
qemu     31264     1  0 05:58 ?        00:00:00 /usr/libexec/qemu-kvm -name guest=7701

Comment 4 Michal Privoznik 2020-01-13 10:58:53 UTC

Just for the record, an error is expected here because the disk on the NFS mount is still in use. Therefore, restore MUST fail. However, it must fail with a different error and certainly, libvirt can't leave qemu process behind. This is why you see "Device or resource busy" error - there is already a qemu process running that libvirt doesn't know of (well, forgot about) - after the first attempt to restore qemu was started but then left behind (with still opened log file). Therefore, the second attempt results in EBUSY error.

Anyway, patches posted upstream:

https://www.redhat.com/archives/libvir-list/2020-January/msg00584.html

Comment 5 Michal Privoznik 2020-01-16 08:42:12 UTC

I've merged patches upstream:

4c581527d4 qemu: Stop domain on failed restore
3203ad6cfd qemu: Use g_autoptr() for qemuDomainSaveCookie
82e127e343 qemuDomainSaveImageStartVM: Use g_autoptr() for virCommand
1c16f261d0 qemuDomainSaveImageStartVM: Use VIR_AUTOCLOSE for @intermediatefd

Comment 6 Michal Privoznik 2020-01-16 09:12:55 UTC

To POST:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2020-January/msg00224.html

Comment 8 Yanqiu Zhang 2020-02-06 08:46:33 UTC

Verify this bug on:
libvirt-daemon-6.0.0-4.module+el8.2.0+5642+838f3513.x86_64
qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904.x86_64

Steps:
S1. Mount NFS on 2 hosts
1.Save running guest with nfs image on [hosta]:
    # virsh start vm2-yqz
    Domain vm2-yqz started

    # virsh save vm2-yqz vm2-yqz.save

    Domain vm2-yqz saved to vm2-yqz.save

2. scp saved file to [hostb]:
# scp lenovo-***:/root/vm2-yqz.save .

3. Start guest on [hosta] again:
# virsh start vm2-yqz
Domain vm2-yqz started

4. Try to restore guest from the saved file on [hostb]:
    # virsh restore vm2-yqz.save
    error: Failed to restore domain from vm2-yqz.save
    error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

    # ps aux|grep qemu|grep vm2-yqz
    (nothing output, no qemu process running after restore failed)
    # virsh list --all|grep vm2-yqz
    (nothing output)

    Try 4 more times again, still get same result:
    # virsh restore vm2-yqz.save
    error: Failed to restore domain from vm2-yqz.save
    error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock
    # ps aux|grep qemu|grep vm2-yqz
    # virsh list --all|grep vm2-yqz

Check qemu log:
2020-02-06 08:15:26.386+0000: starting up libvirt version: 6.0.0...
...
2020-02-06T08:15:27.346353Z qemu-kvm: Failed to get "write" lock
Is another process using the image [/s3-qe-team/yanqzhan/RHEL-8.2.0-20191219.0-x86_64.qcow2]?
2020-02-06 08:15:27.358+0000: shutting down, reason=failed
2020-02-06T08:15:27.359107Z qemu-kvm: terminating on signal 15 from pid 28129 (<unknown process>)

# ps aux|grep 28129 
root     28129  0.0  0.7 2279660 59084 ?       Ssl  01:44   0:01 /usr/sbin/libvirtd --timeout 120


S2. On same host, start another guest using same image, then try to restore 1st guest:
1. # virsh start avocado-vt-vm1
Domain avocado-vt-vm1 started

2. # virsh restore vm2-yqz.save
error: Failed to restore domain from vm2-yqz.save
error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

# ps aux|grep qemu|grep vm2-yqz
(nothing output)
#  virsh list --all|grep vm2-yqz
 -    vm2-yqz          shut off

Try 4 more times again, still get same result.


S3. Additional try for managedsave+start situation:
# virsh managedsave avocado-vt-vm1 

Domain avocado-vt-vm1 state saved by libvirt

# virsh start vm2-yqz 
Domain vm2-yqz started

# virsh start avocado-vt-vm1 
error: Failed to start domain avocado-vt-vm1
error: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

# ps aux|grep qemu|grep avocado-vt-vm1
# virsh list --all|grep avocado-vt-vm1
 -    avocado-vt-vm1   shut off



No qemu process left(killed by libvirtd) after restore failed. Since the result is as expected, mark bug as verified.

Comment 10 errata-xmlrpc 2020-05-05 09:46:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017