1494454 – RFE: add sanity checks for shared storage when migrating without block copy

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1494454 - RFE: add sanity checks for shared storage when migrating without block copy

Summary: RFE: add sanity checks for shared storage when migrating without block copy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Privoznik
QA Contact:	Fangge Jin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-22 10:02 UTC by Daniel Berrangé
Modified:	2018-10-30 09:51 UTC (History)
CC List:	9 users (show)
Fixed In Version:	libvirt-4.3.0-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-30 09:50:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:3113	0	None	None	None	2018-10-30 09:51:23 UTC

Description Daniel Berrangé 2017-09-22 10:02:46 UTC

Description of problem:
When triggering a migration across hosts libvirt does not do any significant
validation of the storage hosting the VM's disk images. As long as the same
disk paths are visible on both hosts, it'll let the migration run.

This can lead to trouble. For example if someone with non-shared storage, 
pre-creates the disk image on the target host with right size but then
triggers a migration *without* block copy enabled, libvirt will happily 
let the migration run.  When the CPUs start on the target host and the 
guest does I/O it'll find its disk image is all-zeros which usually 
leads to unhappiness!  What's particularly bad is that the user who
triggered migration won't realize the mistake they made until migration
has already finished and the source guest destroyed.  So only option to
recover is to reboot from scratch.

Validating correct storage setup is quite hard to do to cover all scenarios
correctly. For example, with non-shared storage, it is possible the user
has manually copied the disk image across ahead of time - this is fine 
if the disk image is read-only since there's no I/O to synchronize.  
Identifying whether block device backed storage is shared or not is 
non-trivial.

We could however do some simple checks to identify common mistakes 
with filesystem based disk backends. ie, if the virtual disk is 
writable, and is stored on an ext[2,3,4] / xfs / fat  filesystem,
then there's likely no way migration can do the right thing if 
block-copy is not requested. We should report an error here rather 
than let users shoot themselves in the foot. This would be
particularly useful for virt-manager users, since virt-manager 
always does migration without block-copy no matter what.

Comment 5 Michal Privoznik 2018-02-23 14:50:50 UTC

While looking at the code, we already report error:

static bool
qemuMigrationSrcIsSafe(virDomainDefPtr def,
                       size_t nmigrate_disks,
                       const char **migrate_disks,
                       unsigned int flags)

{
    bool storagemigration = flags & (VIR_MIGRATE_NON_SHARED_DISK |
                                     VIR_MIGRATE_NON_SHARED_INC);
    size_t i;
    int rc;

    for (i = 0; i < def->ndisks; i++) {
        virDomainDiskDefPtr disk = def->disks[i];
        const char *src = virDomainDiskGetSource(disk);

        /* Our code elsewhere guarantees shared disks are either readonly (in
         * which case cache mode doesn't matter) or used with cache=none or used with cache=directsync */
        if (virStorageSourceIsEmpty(disk->src) ||
            disk->src->readonly ||
            disk->src->shared ||
            disk->cachemode == VIR_DOMAIN_DISK_CACHE_DISABLE ||
            disk->cachemode == VIR_DOMAIN_DISK_CACHE_DIRECTSYNC)
            continue;

        /* disks which are migrated by qemu are safe too */
        if (storagemigration &&
            qemuMigrationAnyCopyDisk(disk, nmigrate_disks, migrate_disks))
            continue;

        if (virDomainDiskGetType(disk) == VIR_STORAGE_TYPE_FILE) {
            if ((rc = virFileIsSharedFS(src)) < 0)
                return false;
            else if (rc == 0)
                continue;
            if ((rc = virStorageFileIsClusterFS(src)) < 0)
                return false;
            else if (rc == 1)
                continue;
        } else if (disk->src->type == VIR_STORAGE_TYPE_NETWORK &&
                   disk->src->protocol == VIR_STORAGE_NET_PROTOCOL_RBD) {
            continue;
        }

        virReportError(VIR_ERR_MIGRATE_UNSAFE, "%s",
                       _("Migration may lead to data corruption if disks"
                         " use cache != none or cache != directsync"));
        return false;
    }

    return true;
}


Of course this can be overridden by --unsafe flag. Isn't this enough?

Comment 6 Daniel Berrangé 2018-02-23 14:56:58 UTC

Ah, that's more than I thought we did, but there's a couple of issues there 

 - If virFileIsSharedFS() returns 0 (ie a local ext3 FS), then we jump to next loop iteration. We should check if block migration is requested here, and raise error if it is not.  

 - If the disk is marked <shareable/> then we don't even bother with the IsSharedFS() check, which is again a problem if that detects a local FS with no block migration eanbled.

Comment 7 Michal Privoznik 2018-02-26 10:06:27 UTC

Okay, patch posted upstream:

https://www.redhat.com/archives/libvir-list/2018-February/msg01154.html

Comment 8 Michal Privoznik 2018-02-26 10:45:27 UTC

I've just pushed the patch upstream:

commit ed11e9cd95bd9cae6cdfab14ae0936930bbb63e6
Author:     Michal Privoznik <mprivozn>
AuthorDate: Mon Feb 26 09:35:25 2018 +0100
Commit:     Michal Privoznik <mprivozn>
CommitDate: Mon Feb 26 11:32:05 2018 +0100

    qemuMigrationSrcIsSafe: Check local storage more thoroughly
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1494454
    
    If a domain disk is stored on local filesystem (e.g. ext4) but is
    not being migrated it is very likely that domain is not able to
    run on destination. Regardless of share/cache mode.
    
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: Daniel P. Berrangé <berrange>

v4.1.0-rc1-1-ged11e9cd9

Comment 12 errata-xmlrpc 2018-10-30 09:50:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113

Note You need to log in before you can comment on or make changes to this bug.