Bug 1494454 - RFE: add sanity checks for shared storage when migrating without block copy
Summary: RFE: add sanity checks for shared storage when migrating without block copy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: Fangge Jin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-22 10:02 UTC by Daniel Berrangé
Modified: 2018-10-30 09:51 UTC (History)
9 users (show)

Fixed In Version: libvirt-4.3.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-30 09:50:00 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3113 None None None 2018-10-30 09:51:23 UTC

Description Daniel Berrangé 2017-09-22 10:02:46 UTC
Description of problem:
When triggering a migration across hosts libvirt does not do any significant
validation of the storage hosting the VM's disk images. As long as the same
disk paths are visible on both hosts, it'll let the migration run.

This can lead to trouble. For example if someone with non-shared storage, 
pre-creates the disk image on the target host with right size but then
triggers a migration *without* block copy enabled, libvirt will happily 
let the migration run.  When the CPUs start on the target host and the 
guest does I/O it'll find its disk image is all-zeros which usually 
leads to unhappiness!  What's particularly bad is that the user who
triggered migration won't realize the mistake they made until migration
has already finished and the source guest destroyed.  So only option to
recover is to reboot from scratch.

Validating correct storage setup is quite hard to do to cover all scenarios
correctly. For example, with non-shared storage, it is possible the user
has manually copied the disk image across ahead of time - this is fine 
if the disk image is read-only since there's no I/O to synchronize.  
Identifying whether block device backed storage is shared or not is 
non-trivial.

We could however do some simple checks to identify common mistakes 
with filesystem based disk backends. ie, if the virtual disk is 
writable, and is stored on an ext[2,3,4] / xfs / fat  filesystem,
then there's likely no way migration can do the right thing if 
block-copy is not requested. We should report an error here rather 
than let users shoot themselves in the foot. This would be
particularly useful for virt-manager users, since virt-manager 
always does migration without block-copy no matter what.

Comment 5 Michal Privoznik 2018-02-23 14:50:50 UTC
While looking at the code, we already report error:

static bool
qemuMigrationSrcIsSafe(virDomainDefPtr def,
                       size_t nmigrate_disks,
                       const char **migrate_disks,
                       unsigned int flags)

{
    bool storagemigration = flags & (VIR_MIGRATE_NON_SHARED_DISK |
                                     VIR_MIGRATE_NON_SHARED_INC);
    size_t i;
    int rc;

    for (i = 0; i < def->ndisks; i++) {
        virDomainDiskDefPtr disk = def->disks[i];
        const char *src = virDomainDiskGetSource(disk);

        /* Our code elsewhere guarantees shared disks are either readonly (in
         * which case cache mode doesn't matter) or used with cache=none or used with cache=directsync */
        if (virStorageSourceIsEmpty(disk->src) ||
            disk->src->readonly ||
            disk->src->shared ||
            disk->cachemode == VIR_DOMAIN_DISK_CACHE_DISABLE ||
            disk->cachemode == VIR_DOMAIN_DISK_CACHE_DIRECTSYNC)
            continue;

        /* disks which are migrated by qemu are safe too */
        if (storagemigration &&
            qemuMigrationAnyCopyDisk(disk, nmigrate_disks, migrate_disks))
            continue;

        if (virDomainDiskGetType(disk) == VIR_STORAGE_TYPE_FILE) {
            if ((rc = virFileIsSharedFS(src)) < 0)
                return false;
            else if (rc == 0)
                continue;
            if ((rc = virStorageFileIsClusterFS(src)) < 0)
                return false;
            else if (rc == 1)
                continue;
        } else if (disk->src->type == VIR_STORAGE_TYPE_NETWORK &&
                   disk->src->protocol == VIR_STORAGE_NET_PROTOCOL_RBD) {
            continue;
        }

        virReportError(VIR_ERR_MIGRATE_UNSAFE, "%s",
                       _("Migration may lead to data corruption if disks"
                         " use cache != none or cache != directsync"));
        return false;
    }

    return true;
}


Of course this can be overridden by --unsafe flag. Isn't this enough?

Comment 6 Daniel Berrangé 2018-02-23 14:56:58 UTC
Ah, that's more than I thought we did, but there's a couple of issues there 

 - If virFileIsSharedFS() returns 0 (ie a local ext3 FS), then we jump to next loop iteration. We should check if block migration is requested here, and raise error if it is not.  

 - If the disk is marked <shareable/> then we don't even bother with the IsSharedFS() check, which is again a problem if that detects a local FS with no block migration eanbled.

Comment 7 Michal Privoznik 2018-02-26 10:06:27 UTC
Okay, patch posted upstream:

https://www.redhat.com/archives/libvir-list/2018-February/msg01154.html

Comment 8 Michal Privoznik 2018-02-26 10:45:27 UTC
I've just pushed the patch upstream:

commit ed11e9cd95bd9cae6cdfab14ae0936930bbb63e6
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Mon Feb 26 09:35:25 2018 +0100
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Mon Feb 26 11:32:05 2018 +0100

    qemuMigrationSrcIsSafe: Check local storage more thoroughly
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1494454
    
    If a domain disk is stored on local filesystem (e.g. ext4) but is
    not being migrated it is very likely that domain is not able to
    run on destination. Regardless of share/cache mode.
    
    Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
    Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

v4.1.0-rc1-1-ged11e9cd9

Comment 12 errata-xmlrpc 2018-10-30 09:50:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113


Note You need to log in before you can comment on or make changes to this bug.