972675 – Fail migration when VM get paused due to EIO

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 972675 - Fail migration when VM get paused due to EIO

Summary: Fail migration when VM get paused due to EIO

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Krempa
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	886416 961154 1045833
TreeView+	depends on / blocked

Reported:	2013-06-10 11:51 UTC by Michal Skrivanek
Modified:	2013-12-22 12:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	libvirt-0.10.2-20.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1045833 (view as bug list)
Environment:
Last Closed:	2013-11-21 09:02:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1581	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2013-11-21 01:11:35 UTC

Description Michal Skrivanek 2013-06-10 11:51:38 UTC

We should not support migration of VMs paused in EIO. According to QEMU a VM stuck in EIO is not safe to migrate and in risk of data corruption

It seems to be the most feasible way how to solve issues caused by losing storage connectivity in RHEV where we try to migrate all the VMs from the affected host, in reality due to race in status as seen by vdsm and engine and the actual VM state we do try to migrate VMs in EIO (they can get to EIO during migration as well) - this problem is only solvable at lower layer
If libvirt fails the migration at least at the end of it we can keep the paused VM on original host and resume once the storage is reconnected.

Comment 2 Daniel Berrangé 2013-06-18 16:34:36 UTC

Sounds like it would be best for QEMU itself to refuse to accept the 'migrate' command if it is paused in EIO, and have it fail an ongoing migration if EIO occurs. Doing it in libvirt is somewhat racy since event notifications are asynchronous.

Comment 3 Peter Krempa 2013-07-03 13:00:34 UTC

The support for canceling ongoing migration was committed upstream a while ago in commits:

commit 5379bb0f33f1529f530a40958a10e8f02eb868bb
Author: Peter Krempa <pkrempa>
Date:   Wed Jun 12 16:11:22 2013 +0200

    migration: Don't propagate VIR_MIGRATE_ABORT_ON_ERROR
    
    This flag is meant for errors happening on the source of the migration
    and isn't used on the destination. To allow better migration
    compatibility, don't propagate it to the destination.

commit cf6d56ac433273b7e4e087bb861ebced0680cec3
Author: Peter Krempa <pkrempa>
Date:   Wed Jun 12 16:11:21 2013 +0200

    migration: Make erroring out on I/O error controllable by flag
    
    Paolo Bonzini pointed out that it's actually possible to migrate a qemu
    instance that was paused due to I/O error and it will be able to work on
    the destination if the storage is accessible.
    
    This patch introduces flag VIR_MIGRATE_ABORT_ON_ERROR that cancels the
    migration in case an I/O error happens while it's being performed and
    allows migration without this flag. This flag can be possibly used for
    other error reasons that may be introduced in the future.

commit 5f719f217ebf89668ca3c404e4b8288179c26c92
Author: Peter Krempa <pkrempa>
Date:   Mon Jun 10 16:30:48 2013 +0200

    qemu: Forbid migration of machines with I/O errors
    
    Such machine can't be successuflly migrated unles the I/O error has
    recovered and might lead to data corruption. Forbid this kind of
    migration.

commit caa467db626c8691d993e8e15d2cbb0bb043312c
Author: Peter Krempa <pkrempa>
Date:   Mon Jun 10 16:05:45 2013 +0200

    qemu: Cancel migration if guest encoutners I/O error while migrating
    
    During a live migration the guest may receive a disk access I/O error.
    In this state the guest is unable to continue running on a remote host
    after migration as some state may be present in the kernel and not
    migrated.
    
    With this patch, the migration is canceled in such case so it can either
    continue on the source if the I/O issues are recovered or has to be
    destroyed anyways.

Comment 6 zhe peng 2013-07-23 05:44:13 UTC

verify with build :
libvirt-0.10.2-21.el6.x86_64

step:
 1:prepare two machine,source and target
 2:create a guest on source with shared nfs 
 
 test one:
 do migration after guest EIO:
 pause guest with I/O error
 #virsh domstate $guest --reason
 paused (I/O error)

 # virsh migrate --live spice qemu+ssh://10.66.106.31/system --verbose --unsafe
root.106.31's password: 
error: cannot open file '/var/lib/libvirt/migrate/xuzhang-Graph.img': Input/output error
 check guest state
 # virsh domstate $guest --reason
 paused (I/O error)
 check target, no guest exist

 test two:
 during migration guest receive EIO:
 # virsh domstate spice
 running
 do migration
 # virsh migrate --live spice qemu+ssh://10.66.106.31/system --verbose --unsafe
root.106.31's password: 
 before migration finished, stop nfs server
Migration: [ 96 %]error: Unable to read from monitor: Connection reset by peer
the job stoped, check guest state
# virsh domstate spice --reason
paused (I/O error)
 check target, no guest exist. worked as expect, move to verified.

Comment 8 errata-xmlrpc 2013-11-21 09:02:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1581.html

Note You need to log in before you can comment on or make changes to this bug.