751631 – Default block cache mode for migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 751631 - Default block cache mode for migration

Summary: Default block cache mode for migration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-11-06 15:28 UTC by Dor Laor
Modified:	2018-12-02 18:46 UTC (History)
CC List:	14 users (show)
Fixed In Version:	libvirt-0.9.10-4.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 06:36:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:0748	0	normal	SHIPPED_LIVE	Low: libvirt security, bug fix, and enhancement update	2012-06-19 19:31:38 UTC

Description Dor Laor 2011-11-06 15:28:26 UTC

Description of problem:
It is unsafe to allow live migration is the cache!=none.
Otherwise the source host might cache data blocks and the destination host will have an old version of it, leading to corruption.

So libvirt needs to only allow live migration if cache=none.
In other cases it can try to flush the page cache in order of syncing it to the disk.

Notice that this is not just a "white&  black" appearch.  If anyone is
using GFS2, GPFS or any other coherent clustered filesystem, it is ok to
run with any cache=foo mode

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Andrew Cathrow 2011-11-06 16:20:11 UTC

What would the matrix look like? I presume

for block devices : 
 only allow if cache=noe

for file systems :
 We can guess at a few that should work - gfs2, gpfs, gluster

But can we define a definite list now or should this be a config option for w whitelist of filesystems?

Also this is flagged for 6.2 but it seems too big a change for that release.

Comment 4 Daniel Veillard 2011-11-07 08:01:41 UTC

I guess we will need to be careful and provide a way to bypass, because
existing users for which this "worked fine until now" will get rightfully
annoyed if their guest are stuck the day they need to migrate. But we
need a way to raise awareness of the risk. Maybe provide a force flag
to bypass the check, and possibly set up cache='none' on the migrated
guest automatically if they force the migration to avoid the issue
and of course notify the user then. A priori if a guest has been migrated
once it's likely to be migrated again in the future.

Ideally though we should not destroy performances for a relatively rare event
and the best would be to be able to switch off caching dynamically, flush on
the host and then proceed with the migration. I assume it's not possible now
but that would be the right way to go.

Daniel

Comment 6 Dor Laor 2011-11-09 10:21:45 UTC

I do think we need to protect against it, no matter how rare the issue might be, that's exactly the nature of data integrity issues. cache==none actually is the preferred way for performance too so it won't restrict users. It is possible to make sure we flush all of the host page cache but it will need qemu involvement to make sure it gets triggered exactly on the downtime period during migration and it might cause long stalls of the IO in case the cache was big. So at the end of the day, my recommendation is not to allow it at all and potentially add some override flag for dummies.

Comment 7 Jiri Denemark 2012-02-23 14:08:38 UTC

In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-February/msg01312.html

Comment 10 weizhang 2012-02-29 09:42:56 UTC

Verify pass on non-cluster filesystem
libvirt-0.9.10-3.el6.x86_64
qemu-kvm-0.12.1.2-2.232.el6.x86_64
kernel-2.6.32-225.el6.x86_64

Start a guest with cache=writeback
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/kvm-rhel6u2-x86_64-new.img'>
        <seclabel relabel='no'/>
      </source>
      <target dev='hda' bus='ide'/>
    </disk>
...
Then do migration, libvirt will report error
# virsh migrate --live mig qemu+ssh://{target ip}/system
error: Unsafe migration: Migration may lead to data corruption if disks use cache != none

Migration succeeds with --unsafe option with cache=writeback
# virsh migrate --live mig qemu+ssh://{target ip}/system --unsafe
succeed without error

Migration succeeds with cache=none without --unsafe
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='/mnt/kvm-rhel6u2-x86_64-new.img'>
        <seclabel relabel='no'/>
      </source>
      <target dev='hda' bus='ide'/>
    </disk>
...
# virsh migrate --live mig qemu+ssh://{target ip}/system 
succeed without error

For cluster filesystem, how can I build environment and test?

Comment 11 weizhang 2012-03-05 09:19:26 UTC

Retest on
qemu-kvm-0.12.1.2-2.232.el6.x86_64
kernel-2.6.32-225.el6.x86_64
libvirt-0.9.10-3.el6.x86_64

with 2 disks and one of them is cdrom with readonly mode
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/mnt/qcow2.img'>
        <seclabel relabel='no'/>
      </source>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/cdrom.img' startupPolicy='optional'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' unit='0'/>
    </disk>

Migration will report error 
error: Unsafe migration: Migration may lead to data corruption if disks use cache != none

So I think it still has bug on it

Comment 12 Jiri Denemark 2012-03-05 14:30:41 UTC

The following additional patch fixes the above issue:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-March/msg00261.html

Comment 14 weizhang 2012-03-07 05:28:07 UTC

Verify pass on comment 11 scenario, it can migrate succeed without error.
Still left the test on cluster filesystem, will wait for dor's reply on how to test it.

Comment 15 weizhang 2012-03-12 08:31:28 UTC

Verify pass according to Dor's suggestion, only need to test on non-cluster filesystem

Comment 17 errata-xmlrpc 2012-06-20 06:36:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0748.html

Note You need to log in before you can comment on or make changes to this bug.