Created attachment 1470788 [details] libvirtd log Description of problem: Quest disk image ownership is changed to root:root after second round of migration with killing src qemu at Finish phase Version-Release number of selected component (if applicable): libvirt-4.5.0-4.el7.x86_64 qemu-kvm-rhev-2.12.0-8.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Start a guest on source host with guest disk image located on nfs 2.Attach gdb to libvirtd on the destination host and set breakpoint to qemuMigrationDstFinish 3.Migrate guest to target with or without --p2p # virsh migrate rhel7-min qemu+ssh://$target/system --live --verbose 4.Wait until gdb hits the breakpoint 5.Kill QEMU process on the source host 6.Run "continue" command in gdb 7.After migration succeeds, migrate guest back to source host with or without --p2p: # virsh migrate rhel7-min qemu+ssh://10.66.5.190/system --live --verbose --migrateuri tcp://10.66.5.190 8.Redo step 3-6 9.After migration succeeds, check guest image ownership: # ll /nfs/RHEL-7.5-x86_64-latest.qcow2 -rw-r--r--. 1 root root 1403650048 Jul 26 09:46 RHEL-7.5-x86_64-latest.qcow2 Actual results: As step 9, guest image ownership is changed to root:root. Expected results: Guest image ownership should be restored during migration. Additional info: In libvirtd.log, I see migrated=0, which is not corrected: 2018-07-26 03:49:41.020+0000: 16709: debug : virSecurityDACRestoreAllLabel:1560 : Restoring security label on rhel7-min migrated=0 2018-07-26 03:49:41.020+0000: 16709: info : virSecurityDACRestoreFileLabelInternal:665 : Restoring DAC user and group on '/nfs/RHEL-7.5-x86_64-latest.qcow2' 2018-07-26 03:49:41.020+0000: 16709: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/nfs/RHEL-7.5-x86_64-latest.qcow2' to '0:0'
I'm not quite sure why this would happen only after the second migration, but the problem is the monitor EOF handler which is called when you kill the QEMU process on the source does not care about the migration and just resets everything. This is usually fine because killed domain results in failed migration most of the time, but it doesn't work in this corner case when the QEMU process gets killed just after the migration actually finished (i.e., at the point libvirtd itself would kill the process). We could perhaps somehow check the current phase of migration in the EOF handler so that it can pass migrated=1 when appropriate.
It might perhaps be caused if the filesystem is different on each host. ie a local ext4 FS on one host, and then exported as NFS to the second host.
(In reply to Daniel Berrange from comment #3) > It might perhaps be caused if the filesystem is different on each host. ie a > local ext4 FS on one host, and then exported as NFS to the second host. Hi Daniel I checked my test env on two hosts, they are configured correctly: Target: # mount |grep nfs 10.66.4.124:/nfs on /nfs type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=10,retrans=2,sec=sys,clientaddr=10.73.131.69,local_lock=none,addr=10.66.4.124) Source: 10.66.4.124:/nfs on /nfs type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=15,retrans=1,sec=sys,clientaddr=10.66.5.190,local_lock=none,addr=10.66.4.124) If "a local ext4 FS on one host, and then exported as NFS to the second host", the guest image ownership will be changed to root:root after the FIRST migration.
Test with libvirt-4.5.0-8.virtcov.el7.x86_64 Now Only need try to migration for only one time, the migration after "continue" in gdb will be FAILED. 1.Start a guest on source host with guest disk image located on nfs # ll /mnt/nfs/lizhu/images/rhel7.6-GUI.img -rw-------. 1 qemu qemu 10739318784 Sep 4 05:31 /mnt/nfs/lizhu/images/rhel7.6-GUI.img 2.Attach gdb to libvirtd on the destination host and set breakpoint to qemuMigrationDstFinish 3.Migrate guest to target # virsh migrate avocado-vt-vm1 qemu+ssh://10.73.73.112/system --verbose --live 4.Wait until gdb hits the breakpoint 5.Kill QEMU process on the source host 6.Run "continue" command in gdb 7.Check the migration process # virsh migrate avocado-vt-vm1 qemu+ssh://10.73.73.112/system --verbose --live Migration: [100 %]2018-09-04 09:32:46.794+0000: 14588: info : libvirt version: 4.5.0, package: 8.virtcov.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-09-03-10:48:21, x86-034.build.eng.bos.redhat.com) 2018-09-04 09:32:46.794+0000: 14588: info : hostname: *** 2018-09-04 09:32:46.794+0000: 14588: warning : virDomainMigrateVersion3Full:3249 : Guest avocado-vt-vm1 probably left in 'paused' state on source error: internal error: unable to execute QEMU command 'cont': Could not reopen qcow2 layer: Could not read qcow2 header: Permission denied 8. check the guest image # ll /mnt/nfs/lizhu/images/rhel7.6-GUI.img -rw-------. 1 root root 10739318784 Sep 4 05:32 /mnt/nfs/lizhu/images/rhel7.6-GUI.img
This matches what I described in comment #2. We need to enhance the monitor EOF handler a bit.
This bug is going to be addressed in next major release.
Can you please try to reproduce it with current rhel-av build? Thanks.
Test with libvirt-6.0.0-6.virtcov.el8.x86_64 and qemu-kvm-4.2.0-11.module Same steps as comment 5, it can still be reproduced # virsh -k0 migrate rhev qemu+ssh://xxxxx/system --live --verbose Migration: [100 %]error: internal error: unable to execute QEMU command 'cont': Could not reopen file: Permission denied
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.