Bug 1242904
Summary: | migration: Cancelling triggers guest IO errors | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Dr. David Alan Gilbert <dgilbert> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.2 | CC: | crobinso, dgilbert, dyuan, fjin, hhuang, huding, jdenemar, juzhang, knoel, michen, mzhan, rbalakri, stefanha, virt-maint, xfu, zpeng |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-1.2.17-4.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-11-19 06:48:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dr. David Alan Gilbert
2015-07-14 11:28:42 UTC
Hi Shu, Could you give a test and update it in the bz? Best Regards, Junyi (In reply to juzhang from comment #3) > Hi Shu, > > Could you give a test and update it in the bz? > > Best Regards, > Junyi Test with stress and stressapptest more than 10 rounds, can not hit the problem using qemu-kvm directly. David, could you provide virt-manager version and stressapptest cmd you use? Bests, Shaolong (In reply to Shaolong Hu from comment #4) > (In reply to juzhang from comment #3) > > Hi Shu, > > > > Could you give a test and update it in the bz? > > > > Best Regards, > > Junyi > > Test with stress and stressapptest more than 10 rounds, can not hit the > problem using qemu-kvm directly. > > David, could you provide virt-manager version and stressapptest cmd you use? > > > Bests, > Shaolong I've just repeated it to check: qemu-kvm-rhev-2.3.0-13.el7.x86_64 (on both source and destination now) libvirt-daemon-1.2.15-2.el7.x86_64 virt-manager-1.2.1-1.fc22.noarch but I doubt it's anything to do with the virt-manager version. In the guest I run: ./stressapptest -s 100 then after I cancel the migration, I ctrl-c the stressapptest then do a dmesg and see the IO errors. some more testing: 1) I could only reliably repeat this with virtio disk, not scsi or ide (although IDE did give some errors during shutdown) 20 I couldn't repeat this using 'virsh migrate' - so I'm not sure what virt-manager is doing differently. virt-manager likely isn't using different libvirt APIs here, but it is doing polling in separate threads with libvirt APIs DomainGetInfo and DomainJobInfo and possibly others. Those APIs hit qemu monitor commands which maybe in some round about way is tickling corruption. With plain virsh invocation there's just less libvirt interaction going on. (In reply to Dr. David Alan Gilbert from comment #6) > some more testing: > 1) I could only reliably repeat this with virtio disk, not scsi or ide > (although IDE did give some errors during shutdown) Please put a breakpoint on virtio_blk_handle_rw_error() and pretty-print req and error. A backtrace of all threads would also be useful. (In reply to Stefan Hajnoczi from comment #8) > (In reply to Dr. David Alan Gilbert from comment #6) > > some more testing: > > 1) I could only reliably repeat this with virtio disk, not scsi or ide > > (although IDE did give some errors during shutdown) > > Please put a breakpoint on virtio_blk_handle_rw_error() and pretty-print req > and error. A backtrace of all threads would also be useful. Given the error=13 I saw in that backtrace, and that 13 is is EACCES I decided to check the permissions. On hitting cancel I'm seeing the permission on the file change: -rw-------. qemu qemu system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 -rw-------. qemu qemu system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 -rw-------. root root system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 -rw-------. root root system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 I'm guessing that's probably libvirt? I've got libvirt-1.2.15-2.el7.x86_64 on both sides. (In reply to Dr. David Alan Gilbert from comment #10) > On hitting cancel I'm seeing the permission on the file change: > > -rw-------. qemu qemu system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 > -rw-------. qemu qemu system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 > -rw-------. root root system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 > -rw-------. root root system_u:object_r:nfs_t:s0 /home/vms/f20.qcow2 > > I'm guessing that's probably libvirt? I've got libvirt-1.2.15-2.el7.x86_64 > on both sides. Yes, QEMU does not invoke chown(2). It must be libvirt. Created attachment 1057564 [details]
libvirtd log from the source
Thanks for the logs, they confirm this is actually a libvirt bug. When QEMU dies at the destination host during migration, it is either intercepted by a monitor API called from Prepare or Finish steps and we properly call qemuProcessStop with VIR_QEMU_PROCESS_STOP_MIGRATED flag, or we get in qemuProcessHandleMonitorEOF callback. This callback does not set VIR_QEMU_PROCESS_STOP_MIGRATED flag and thus all security labels are restored. Since this is a race between qemuMigrationFinish and qemuProcessHandleMonitorEOF, cancelling migration may or may not hit the bug. A more reliable reproducer is killing the qemu-kvm process on the destination host. Another option which hits similar bug but in a different patch is running virDomainDestroy (virsh destroy) on the destination host. Another path which needs testing is restarting libvirtd on destination host during migration. Fixed upstream by v1.2.18-rc2-3-ge8d0166: commit e8d0166e1d27c18aacea4b1316760fad4106e1c7 Author: Jiri Denemark <jdenemar> Date: Thu Jul 30 16:42:43 2015 +0200 qemu: Do not reset labels when migration fails When stopping a domain on the destination host after a failed migration, we need to avoid reseting security labels since the domain is still running on the source host. While we were correctly doing so in some cases, there were still some paths which did this wrong. https://bugzilla.redhat.com/show_bug.cgi?id=1242904 Signed-off-by: Jiri Denemark <jdenemar> Reproduce on build libvirt-1.2.15-2.el7.x86_64 Steps: 0. Prepare a source host and a target host. 1.Prepare a shared image on nfs server: /90121/fjin/r71-3.qcow2 2.Start a guest with the shared image on source host. 3. check the owner of the image file: # ll /90121/fjin/r71-3.qcow2 -Z -rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0 /90121/fjin/r71-3.qcow2 4.Migrate the guest to target host, before migration finished, kill the qemu-kvm process on target(pkill qemu-kvm). # virsh migrate --live mig2 qemu+ssh://10.66.4.141/system error: unable to connect to server at 'fjin-4-141.englab.nay.redhat.com:49152': Connection refused 5. check the owner of the image file: # ll /90121/fjin/r71-3.qcow2 -Z -rw-r--r--. root root system_u:object_r:nfs_t:s0 /90121/fjin/r71-3.qcow2 The file owner is changed to 'root'. (In reply to Jiri Denemark from comment #16) > Another path which needs testing is restarting libvirtd on destination host > during migration. Reproduce on build libvirt-1.2.15-2.el7.x86_64 Steps: 0. Prepare a source host and a target host. 1.Prepare a shared image on nfs server: /90121/fjin/r71-2.qcow2 2.Start a guest with the shared image on source host. 3. check the owner of the image file: # ll /90121/fjin/r71-2.qcow2 -Z -rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0 /90121/fjin/r71-2.qcow2 4.Migrate the guest to target host, before migration finished, restart libvirtd on target host. # virsh migrate --live mig1 qemu+ssh://10.66.4.141/system --verbose Migration: [ 5 %]error: operation failed: migration job: unexpectedly failed 5. check the owner of the image file: # ll /90121/fjin/r71-2.qcow2 -Z -rw-r--r--. root root system_u:object_r:nfs_t:s0 /90121/fjin/r71-2.qcow2 The file owner is changed to 'root'. Verify this bug on build libvirt-1.2.17-6.el7.x86_64 Steps: 0.Prepare a source host and a target host. 1.Prepare a shared image on nfs server: /90121/fjin/rhel6.6-GUI.img 2.Start a guest with the shared image on source host. 3. check the owner of the image file: # ll /90121/fjin/rhel6.6-GUI.img -Z -rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0 /90121/fjin/rhel6.6-GUI.img 4.Migrate the guest to target host, before migration finished, restart libvirtd on target host. # virsh migrate rhel6.6-GUI qemu+ssh://10.66.4.141/system --verbose --live Migration: [ 4 %]error: operation failed: migration job: unexpectedly failed Or Kill the qemu-kvm process on target host during migration(by 'pkill qemu-kvm'): # virsh migrate rhel6.6-GUI qemu+ssh://10.66.4.141/system --verbose --live Migration: [ 1 %]error: internal error: early end of file from monitor: possible problem: RHEL-6 compat: ich9-usb-uhci1: irq_pin = 3 RHEL-6 compat: ich9-usb-uhci2: irq_pin = 3 RHEL-6 compat: ich9-usb-uhci3: irq_pin = 3 qemu: terminating on signal 15 from pid 30930 5. check the owner of the image file: # ll /90121/fjin/rhel6.6-GUI.img -Z -rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0 /90121/fjin/rhel6.6-GUI.img The owner of the shared image file is qemu, and guest works well on source host. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |