1242904 – migration: Cancelling triggers guest IO errors

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1242904 - migration: Cancelling triggers guest IO errors

Summary: migration: Cancelling triggers guest IO errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-14 11:28 UTC by Dr. David Alan Gilbert
Modified:	2015-11-19 06:48 UTC (History)
CC List:	16 users (show)
Fixed In Version:	libvirt-1.2.17-4.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-19 06:48:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2202	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2015-11-19 08:17:58 UTC

Description Dr. David Alan Gilbert 2015-07-14 11:28:42 UTC

Description of problem:
Cancelling a migration in progress causes fatal IO errors on a virtual disc in the guest

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.3.0-7.el7.x86_64

How reproducible:
100% ?

Steps to Reproduce:
1. Take an f20 guest
2. Start stressapptest in it
3. start a migrate to another host (in this case to a 7.1 host)
4. Cancel the migrate

(I'm using virtmanager for this and hitting the migrate button on it's menu and using the cancel on there).

Actual results:
Seen in the guest console:
end_request: I/O error, dev vda, sector 39297024

Expected results:
No errors

Additional info:
The VM image is on NFS storage served by the destination.

/usr/libexec/qemu-kvm -name f20-414 -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu SandyBridge -m 81920 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid eb42c9ed-4c2b-496d-b415-e1e1cc2a917e -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/f20-414.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot order=c,menu=on,strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -device usb-ccid,id=ccid0 -drive file=/home/vms/f20.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,id=drive-fdc0-0-0,format=raw -global isa-fdc.driveA=drive-fdc0-0-0 -drive if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:bc:51:1a,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charserial1,host=127.0.0.1,port=4555,telnet,server,nowait -device isa-serial,chardev=charserial1,id=serial1 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/f20-414.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on

Comment 3 juzhang 2015-07-16 04:46:17 UTC

Hi Shu,

Could you give a test and update it in the bz?

Best Regards,
Junyi

Comment 4 Shaolong Hu 2015-07-17 03:17:39 UTC

(In reply to juzhang from comment #3)
> Hi Shu,
> 
> Could you give a test and update it in the bz?
> 
> Best Regards,
> Junyi

Test with stress and stressapptest more than 10 rounds, can not hit the problem using qemu-kvm directly.

David, could you provide virt-manager version and stressapptest cmd you use?


Bests,
Shaolong

Comment 5 Dr. David Alan Gilbert 2015-07-28 09:58:38 UTC

(In reply to Shaolong Hu from comment #4)
> (In reply to juzhang from comment #3)
> > Hi Shu,
> > 
> > Could you give a test and update it in the bz?
> > 
> > Best Regards,
> > Junyi
> 
> Test with stress and stressapptest more than 10 rounds, can not hit the
> problem using qemu-kvm directly.
> 
> David, could you provide virt-manager version and stressapptest cmd you use?
> 
> 
> Bests,
> Shaolong

I've just repeated it to check:
   qemu-kvm-rhev-2.3.0-13.el7.x86_64  (on both source and destination now)
   libvirt-daemon-1.2.15-2.el7.x86_64
   virt-manager-1.2.1-1.fc22.noarch

but I doubt it's anything to do with the virt-manager version.

   In the guest I run:
   ./stressapptest -s 100

   then after I cancel the migration, I ctrl-c the stressapptest then do a dmesg
and see the IO errors.

Comment 6 Dr. David Alan Gilbert 2015-07-28 10:46:49 UTC

some more testing:
  1) I could only reliably repeat this with virtio disk, not scsi or ide (although IDE did give some errors during shutdown)
  20 I couldn't repeat this using 'virsh migrate' - so I'm not sure what virt-manager is doing differently.

Comment 7 Cole Robinson 2015-07-28 14:50:47 UTC

virt-manager likely isn't using different libvirt APIs here, but it is doing polling in separate threads with libvirt APIs DomainGetInfo and DomainJobInfo and possibly others. Those APIs hit qemu monitor commands which maybe in some round about way is tickling corruption. With plain virsh invocation there's just less libvirt interaction going on.

Comment 8 Stefan Hajnoczi 2015-07-29 14:58:37 UTC

(In reply to Dr. David Alan Gilbert from comment #6)
> some more testing:
>   1) I could only reliably repeat this with virtio disk, not scsi or ide
> (although IDE did give some errors during shutdown)

Please put a breakpoint on virtio_blk_handle_rw_error() and pretty-print req and error.  A backtrace of all threads would also be useful.

Comment 10 Dr. David Alan Gilbert 2015-07-29 19:00:00 UTC

(In reply to Stefan Hajnoczi from comment #8)
> (In reply to Dr. David Alan Gilbert from comment #6)
> > some more testing:
> >   1) I could only reliably repeat this with virtio disk, not scsi or ide
> > (although IDE did give some errors during shutdown)
> 
> Please put a breakpoint on virtio_blk_handle_rw_error() and pretty-print req
> and error.  A backtrace of all threads would also be useful.

Given the error=13 I saw in that backtrace, and that 13 is is EACCES I decided to check the permissions.

On hitting cancel I'm seeing the permission on the file change:

-rw-------. qemu qemu system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2
-rw-------. qemu qemu system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2
-rw-------. root root system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2
-rw-------. root root system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2

I'm guessing that's probably libvirt?  I've got libvirt-1.2.15-2.el7.x86_64 on both sides.

Comment 11 Stefan Hajnoczi 2015-07-30 07:13:04 UTC

(In reply to Dr. David Alan Gilbert from comment #10)
> On hitting cancel I'm seeing the permission on the file change:
> 
> -rw-------. qemu qemu system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2
> -rw-------. qemu qemu system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2
> -rw-------. root root system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2
> -rw-------. root root system_u:object_r:nfs_t:s0       /home/vms/f20.qcow2
> 
> I'm guessing that's probably libvirt?  I've got libvirt-1.2.15-2.el7.x86_64
> on both sides.

Yes, QEMU does not invoke chown(2).  It must be libvirt.

Comment 12 Dr. David Alan Gilbert 2015-07-30 09:03:28 UTC

Created attachment 1057564 [details]
libvirtd log from the source

Comment 14 Jiri Denemark 2015-07-30 11:20:41 UTC

Thanks for the logs, they confirm this is actually a libvirt bug. When QEMU dies at the destination host during migration, it is either intercepted by a monitor API called from Prepare or Finish steps and we properly call qemuProcessStop with VIR_QEMU_PROCESS_STOP_MIGRATED flag, or we get in qemuProcessHandleMonitorEOF callback. This callback does not set VIR_QEMU_PROCESS_STOP_MIGRATED flag and thus all security labels are restored.

Comment 15 Jiri Denemark 2015-07-31 12:09:19 UTC

Since this is a race between qemuMigrationFinish and qemuProcessHandleMonitorEOF, cancelling migration may or may not hit the bug. A more reliable reproducer is killing the qemu-kvm process on the destination host. Another option which hits similar bug but in a different patch is running virDomainDestroy (virsh destroy) on the destination host.

Comment 16 Jiri Denemark 2015-07-31 13:17:58 UTC

Another path which needs testing is restarting libvirtd on destination host during migration.

Comment 17 Jiri Denemark 2015-07-31 13:18:37 UTC

Fixed upstream by v1.2.18-rc2-3-ge8d0166:

commit e8d0166e1d27c18aacea4b1316760fad4106e1c7
Author: Jiri Denemark <jdenemar>
Date:   Thu Jul 30 16:42:43 2015 +0200

    qemu: Do not reset labels when migration fails
    
    When stopping a domain on the destination host after a failed migration,
    we need to avoid reseting security labels since the domain is still
    running on the source host. While we were correctly doing so in some
    cases, there were still some paths which did this wrong.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1242904
    
    Signed-off-by: Jiri Denemark <jdenemar>

Comment 19 Fangge Jin 2015-08-05 09:57:08 UTC

Reproduce on build libvirt-1.2.15-2.el7.x86_64

Steps:
0. Prepare a source host and a target host.

1.Prepare a shared image on nfs server: /90121/fjin/r71-3.qcow2

2.Start a guest with the shared image on source host.

3. check the owner of the image file:
# ll /90121/fjin/r71-3.qcow2 -Z
-rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0       /90121/fjin/r71-3.qcow2

4.Migrate the guest to target host, before migration finished, kill the qemu-kvm process on target(pkill qemu-kvm).
# virsh migrate --live mig2 qemu+ssh://10.66.4.141/system
error: unable to connect to server at 'fjin-4-141.englab.nay.redhat.com:49152': Connection refused

5. check the owner of the image file:
# ll /90121/fjin/r71-3.qcow2 -Z
-rw-r--r--. root root system_u:object_r:nfs_t:s0       /90121/fjin/r71-3.qcow2

The file owner is changed to 'root'.

Comment 21 Fangge Jin 2015-08-10 06:04:45 UTC

(In reply to Jiri Denemark from comment #16)
> Another path which needs testing is restarting libvirtd on destination host
> during migration.


Reproduce on build libvirt-1.2.15-2.el7.x86_64

Steps:
0. Prepare a source host and a target host.

1.Prepare a shared image on nfs server: /90121/fjin/r71-2.qcow2

2.Start a guest with the shared image on source host.

3. check the owner of the image file:
# ll /90121/fjin/r71-2.qcow2 -Z
-rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0       /90121/fjin/r71-2.qcow2

4.Migrate the guest to target host, before migration finished, restart libvirtd on target host.
# virsh migrate --live mig1 qemu+ssh://10.66.4.141/system --verbose
Migration: [  5 %]error: operation failed: migration job: unexpectedly failed

5. check the owner of the image file:
# ll /90121/fjin/r71-2.qcow2 -Z
-rw-r--r--. root root system_u:object_r:nfs_t:s0       /90121/fjin/r71-2.qcow2

The file owner is changed to 'root'.

Comment 22 Fangge Jin 2015-08-31 08:39:03 UTC

Verify this bug on build libvirt-1.2.17-6.el7.x86_64

Steps:
0.Prepare a source host and a target host.

1.Prepare a shared image on nfs server: /90121/fjin/rhel6.6-GUI.img

2.Start a guest with the shared image on source host.

3. check the owner of the image file:
# ll /90121/fjin/rhel6.6-GUI.img  -Z
-rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0       /90121/fjin/rhel6.6-GUI.img

4.Migrate the guest to target host, before migration finished, restart libvirtd on target host.
# virsh migrate rhel6.6-GUI qemu+ssh://10.66.4.141/system --verbose --live
Migration: [  4 %]error: operation failed: migration job: unexpectedly failed

Or
Kill the qemu-kvm process on target host during migration(by 'pkill qemu-kvm'):
# virsh migrate rhel6.6-GUI qemu+ssh://10.66.4.141/system --verbose --live
Migration: [  1 %]error: internal error: early end of file from monitor: possible problem:
RHEL-6 compat: ich9-usb-uhci1: irq_pin = 3
RHEL-6 compat: ich9-usb-uhci2: irq_pin = 3
RHEL-6 compat: ich9-usb-uhci3: irq_pin = 3
qemu: terminating on signal 15 from pid 30930


5. check the owner of the image file:
# ll /90121/fjin/rhel6.6-GUI.img  -Z
-rw-r--r--. qemu qemu system_u:object_r:nfs_t:s0       /90121/fjin/rhel6.6-GUI.img

The owner of the shared image file is qemu, and guest works well on source host.

Comment 24 errata-xmlrpc 2015-11-19 06:48:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html

Note You need to log in before you can comment on or make changes to this bug.