Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1386359

Summary:	Post-copy migration of non-shared storage - active mirror block job (qemu)
Product:	Red Hat Enterprise Linux 7	Reporter:	Ademar Reis <areis>
Component:	qemu-kvm-rhev	Assignee:	Hanna Czenczek <hreitz>
Status:	CLOSED DEFERRED	QA Contact:	aihua liang <aliang>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.2	CC:	areis, berrange, chayang, coli, dgilbert, dyuan, fjin, jdenemar, juzhang, kchamart, kwolf, mtessun, pbonzini, qzhang, virt-bugs, virt-maint, xianwang, zpeng
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:	1324566
Clones:	1644988 (view as bug list)		Environment:
Last Closed:	2019-02-01 17:29:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1324566, 1644988

Description Ademar Reis 2016-10-18 18:12:20 UTC

+++ This bug was initially created as a clone of Bug #1324566 +++

Description of problem:

Post-copy migration is supposed to always converge, but when migrating a domain with non-shared storage, post-copy migration starts once all disks are migrated to the destination host. However, storage migration does not use post-copy approach, which means migrating a domain may never finish even though post-copy was requested.

Both storage and memory needs to be migrated in a post-copy way to ensure migration always converges.

Version-Release number of selected component (if applicable):

libvirt-1.3.3-1.el7

--- Additional comment from Jiri Denemark on 2016-04-06 12:52:03 BRT ---

Paolo, you seemed to have an idea what block jobs should be used by libvirt to implement post-copy storage migration. Could you describe your idea in detail?

--- Additional comment from Paolo Bonzini on 2016-04-07 10:23:34 BRT ---

Sure! It's the opposite to the current NBD flow, which runs the NBD server on the destination on drive-mirror on the source.

Here, the NBD server runs on the source, the qcow2 image is created with the NBD server as the backing file, and block-stream is used on the destination to do post-copy migration.  Unfortunately you cannot switch from pre-copy to postcopy; you have to start the postcopy phase before doing "cont" on the destination, with no previous copy.

I think this makes it less desirable than for RAM.

--- Additional comment from Dr. David Alan Gilbert on 2016-04-07 10:52:00 BRT ---

(In reply to Paolo Bonzini from comment #3)
> Sure! It's the opposite to the current NBD flow, which runs the NBD server
> on the destination on drive-mirror on the source.
> 
> Here, the NBD server runs on the source, the qcow2 image is created with the
> NBD server as the backing file, and block-stream is used on the destination
> to do post-copy migration.  Unfortunately you cannot switch from pre-copy to
> postcopy; you have to start the postcopy phase before doing "cont" on the
> destination, with no previous copy.
> 
> I think this makes it less desirable than for RAM.

I don't understand with this how you know when you can start running the destination; I also don't understand the interaction of why you can't do the precopy phase of RAM.

Anyway, isn't the easier story here just to use the existing block write throttling to set a block write bandwidth lower than the network bandwidth - unlike RAM, we've already got a throttle that should be able to limit based on what we need.

--- Additional comment from Jiri Denemark on 2016-04-07 10:56:26 BRT ---

Apparently (confirmed with Paolo on IRC), we'd have to start block-stream jobs after stopping vCPUs on the source and before starting the destination.

--- Additional comment from Jiri Denemark on 2016-04-07 10:57:59 BRT ---

That said, I think throttling disk I/O is really a better idea.

What do OpenStack guys think about it, Daniel?

--- Additional comment from Daniel Berrange on 2016-04-07 11:21:46 BRT ---

The desirable thing about post-copy is that it guarantees completion in a finite amount of time without having to do any kind of calculations wrt guest dirtying rate vs constantly changing network bandwidth availability. If you use pre-copy with bandwidth throttling you have the problem of figuring out what level of throttling you need to apply in order to ensure the guest completes in a finite predictable time - this is pretty non-trivial unless you are super conservative and apply a very strict bandwidth limit, which in turn has the problem that you're probably slowing the guest down more than it actually needs to be. THis is the prime reason OpenStack is much more enthusiastic about using post-copy, than throttling guest CPUs with cgroups or auto-converge feature of QEMU.


So from POV of ease of management having disk able to support post-copy in the same way as RAM is really very desirable for OpenStack.

--- Additional comment from Daniel Berrange on 2016-04-07 11:39:02 BRT ---

I'm thinking about the proposals on qemu-devel wrt extending the NBD server to support live backups, and how the NBD server would expose a fake block allocation bitmap to represent dirty blocks. I think that functionality could be usable as the foundation for doing combined switchable pre+post-copy for disk too.

First have the NBD server always run on the source host. Now during initial pre-copy phase, the NBD client on the target host would do one pass copying of the whole dataset. Thereafter it would loop querying the fake "block allocation bitmap" from the source, to get list of blocks which have been dirtied since the initial copy and copying those. 

When switching to post-copy, it would continue to fetch all remaining dirty blocks, but at any point it can make a request fetch a specific block being accessed right now by the VM.

--- Additional comment from Jiri Denemark on 2016-04-07 11:52:19 BRT ---

So it seems you agree that implementing storage migration using the currently available block-stream job is not something you'd want from libvirt. In that case we need to clone this BZ for QEMU requesting the new functionality.

--- Additional comment from Daniel Berrange on 2016-04-07 15:12:31 BRT ---

It would be nice if we could implement it using existing functionality, from what Paolo describes, it doesnt sound like it is possible - you can do pre-copy, or post-copy, but can't switch from pre-to-post on the fly which is what we'd need to have the ability todo to match RAM handling. So it seems we likely need new QEMU functionality.

Comment 1 xianwang 2017-03-13 03:20:10 UTC

Hi, Kevin,
As it shown, this bug is about post-copy migration and storage vm migration, because it has "FutureFeature" keyword, if we need to add test cases for it, which test plan should own this case? "migration" or "storage vm migration"?