Bug 1401173

Summary:	Target host in nova DB got updated to new compute while migration failed and qemu-kvm process was still running on source host.
Product:	Red Hat Enterprise Linux 7	Reporter:	Martin Schuppert <mschuppe>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED ERRATA	QA Contact:	zhe peng <zpeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.4	CC:	berrange, dasmith, dgilbert, dyuan, eglynn, fjin, jdenemar, jsuchane, kchamart, libvirt-maint, mas-hatada, mfuruta, mkalinin, mschuppe, mtessun, pbarta, pmorey, rbalakri, rbryant, rhel-osp-bz, sbauza, sferdjao, sgordon, smykhail, srevivo, vromanso, xuzhang, yafu
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-3.9.0-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1530130 (view as bug list)		Environment:
Last Closed:	2018-04-10 10:39:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1530130

Comment 4 Jiri Denemark 2016-12-05 19:43:41 UTC

Well, the migration code in Nova is just wrong. Calling virDomainGetJobInfo or virDomainGetJobStats (which is newer and better) is fine for getting data about migration progress and optionally using the data to tune some migration parameters in case the migration is not converging. But using migration statistics to deduce the result of a migration is very fragile and should never be done.

Only the migration API provides a clear indication of a successful or failed migration.

Comment 5 Sahid Ferdjaoui 2016-12-06 11:22:39 UTC

(In reply to Jiri Denemark from comment #4)
> Well, the migration code in Nova is just wrong. Calling virDomainGetJobInfo
> or virDomainGetJobStats (which is newer and better) is fine for getting data
> about migration progress and optionally using the data to tune some
> migration parameters in case the migration is not converging. But using
> migration statistics to deduce the result of a migration is very fragile and
> should never be done.
> 
> Only the migration API provides a clear indication of a successful or failed
> migration.

Would it be possible to get some pointers on this migration API. I did not find so much information on it in https://libvirt.org/devguide.html

Comment 6 Jiri Denemark 2016-12-06 12:08:29 UTC

Apparently Nova uses virDomainMigrateToURI* as can be seen at

https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/virt/libvirt/driver.py;h=3d422f1fd61faab0f7a4261761571e7fe96371c7;hb=refs/heads/rhos-8.0-patches#l5853
https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/virt/libvirt/driver.py;h=3d422f1fd61faab0f7a4261761571e7fe96371c7;hb=refs/heads/rhos-8.0-patches#l5871
https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/virt/libvirt/driver.py;h=3d422f1fd61faab0f7a4261761571e7fe96371c7;hb=refs/heads/rhos-8.0-patches#l5876

And the caller does not check the return values at all. Anyway, our API documentation can be found at http://libvirt.org/html/index.html, specifically the migration API Nova is using is described at http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMigrateToURI3

Comment 7 Martin Schuppert 2016-12-06 18:34:23 UTC

(In reply to Dr. David Alan Gilbert from comment #2)
> The original migration failure is 
> 2016-11-22T17:07:22.131885Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 -
> used_idx 0x2
> and is:
>    https://bugzilla.redhat.com/show_bug.cgi?id=1388465
> 
>    which is already fixed and working it's way through Z streaming; 
> 
> However, that doesn't explain why NOVA thought it succeeded.

Thanks David for the pointer to that issue. 

There are still some computes with instances running qemu-kvm-rhev from RHEL 7.2 and want to live migrate them to a compute running qemu-kvm-rhev from RHEL 7.3 version where the issue is fixed. 

Is there a way to ensure we can live-migrate instances without triggering "migration failed with: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2"?

Comment 8 Dr. David Alan Gilbert 2016-12-06 18:38:20 UTC

(In reply to Martin Schuppert from comment #7)
> (In reply to Dr. David Alan Gilbert from comment #2)
> > The original migration failure is 
> > 2016-11-22T17:07:22.131885Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 -
> > used_idx 0x2
> > and is:
> >    https://bugzilla.redhat.com/show_bug.cgi?id=1388465
> > 
> >    which is already fixed and working it's way through Z streaming; 
> > 
> > However, that doesn't explain why NOVA thought it succeeded.
> 
> Thanks David for the pointer to that issue. 
> 
> There are still some computes with instances running qemu-kvm-rhev from RHEL
> 7.2 and want to live migrate them to a compute running qemu-kvm-rhev from
> RHEL 7.3 version where the issue is fixed. 
> 
> Is there a way to ensure we can live-migrate instances without triggering
> "migration failed with: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2"?

Not that I know of; lets ask lprosek to be sure.
Note that it generally only affects Windows guests that have balloon enabled.
Linux guests should be OK.

Comment 9 Ladi Prosek 2016-12-07 07:59:36 UTC

(In reply to Dr. David Alan Gilbert from comment #8)
> (In reply to Martin Schuppert from comment #7)
> > (In reply to Dr. David Alan Gilbert from comment #2)
> > There are still some computes with instances running qemu-kvm-rhev from RHEL
> > 7.2 and want to live migrate them to a compute running qemu-kvm-rhev from
> > RHEL 7.3 version where the issue is fixed. 
> > 
> > Is there a way to ensure we can live-migrate instances without triggering
> > "migration failed with: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2"?
> 
> Not that I know of; lets ask lprosek to be sure.
> Note that it generally only affects Windows guests that have balloon enabled.
> Linux guests should be OK.

What David said. Windows guests with the balloon device enabled and balloon virtio-win driver installed, but without the balloon service running (blnsvr.exe in the virtio-win ISO) are susceptible to this issue. You have to migrate twice to hit this.

Disabling the guest balloon driver before starting the migration should get around it (Device Manager -> System Devices -> VirtIO Balloon Driver -> Disable). Then you can re-enable it on the destination.

If balloon is absolutely needed and can't be disabled, try starting the balloon service (blnsvr -i or blnsvr -r), although I suspect that this would help only on the first migration - the one that would have succeeded anyway. So maybe that's something to do only in case you happen to have to do a 7.2 -> 7.2 migration.

Comment 10 Sahid Ferdjaoui 2016-12-22 16:41:37 UTC

I think we should re-prioritize that issue for Nova in a lower level. This bug is happening because a lower-level component were in unpredictable situation.

However It's clear that the way of how Nova is acknowledging the live-migration process is not enough and should be improved.

I started a serie of patches for Nova. It's still in WIP but I hope to make something robust available soon.

Comment 11 Dr. David Alan Gilbert 2016-12-22 19:53:38 UTC

(In reply to Sahid Ferdjaoui from comment #10)
> I think we should re-prioritize that issue for Nova in a lower level. This
> bug is happening because a lower-level component were in unpredictable
> situation.
> 
> However It's clear that the way of how Nova is acknowledging the
> live-migration process is not enough and should be improved.

IMHO this shouldn't be deprioritised.
While qemu screwd up in this case, running 2 qemu's at the same time is the worst case failure mode and is what causes the corruption - this should NEVER happen!

> I started a serie of patches for Nova. It's still in WIP but I hope to make
> something robust available soon.

Comment 12 Martin Schuppert 2016-12-27 08:02:05 UTC

(In reply to Dr. David Alan Gilbert from comment #11)
> (In reply to Sahid Ferdjaoui from comment #10)
> > I think we should re-prioritize that issue for Nova in a lower level. This
> > bug is happening because a lower-level component were in unpredictable
> > situation.
> > 
> > However It's clear that the way of how Nova is acknowledging the
> > live-migration process is not enough and should be improved.
> 
> IMHO this shouldn't be deprioritised.
> While qemu screwd up in this case, running 2 qemu's at the same time is the
> worst case failure mode and is what causes the corruption - this should
> NEVER happen!

I agree that the issue why the live migration failed was on the lower level and this was fixed, but nova should have never proceeded with migration as successful. 

The main concern is that we see something similar again (with unknown end result at the moment) if other issues on a lower level happen, if we do not change nova to also fail in such a situation. Therefore I think we should work on improving nova with priority.

> > I started a serie of patches for Nova. It's still in WIP but I hope to make
> > something robust available soon.

Perfect thx, I have seen that.

Comment 14 Red Hat Bugzilla Rules Engine 2017-02-06 10:46:48 UTC

This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 16 Daniel Berrangé 2017-06-05 17:02:25 UTC

(In reply to Jiri Denemark from comment #4)
> Well, the migration code in Nova is just wrong. Calling virDomainGetJobInfo
> or virDomainGetJobStats (which is newer and better) is fine for getting data
> about migration progress and optionally using the data to tune some
> migration parameters in case the migration is not converging. But using
> migration statistics to deduce the result of a migration is very fragile and
> should never be done.
> 
> Only the migration API provides a clear indication of a successful or failed
> migration.

Unfortunately that is not correct.  virDomainMigrate return status cannot be trusted. If there is a network failure at certain point it can return failure, despite fact that the VM is now successfully running on the target. OpenStack hit this exact problem, which is what prompted the rewrite to use virDomainJobInfo. Now that API has some tricky points too, but resolving those problem scenarios was easier than resolving what the correct action is when virDomaniMigrate returns an error. So I don't think the Nova code as it stands is wrong and going back to virDomainMigrate will reintroduce the flaws we previously fixed with it.

Comment 17 Jiri Denemark 2017-06-05 20:00:52 UTC

Could you describe the solution with virDomainMigrate return status versus virDomainJobInfo in more details? I'm curious to see them since I can't see how virDomainJobInfo can give you reliable info when a split brain occurs.

Comment 18 Daniel Berrangé 2017-06-06 09:40:51 UTC

Neither can give you reliable info when split brain occurs. No matter which we pick, we needed extra logic in Nova to figure out the state of the system when error occurs. Given that we already needed to use virDomainGetInfo to monitor progress of the migration, it was easier to deal with resolving the problems when virDomainGetInfo errors, than when virDomainMigrate errors. IOW neither API is satisfactory on its own, and so using virDomainGetInfo is not wrong / worse, just different and in our case easier to deal with.

Comment 19 Dr. David Alan Gilbert 2017-06-06 15:46:15 UTC

Hmm I don't know how either of these tie up to Qemu's view of success/failure.
From QEMUs point of view I'd say you need the migration completed event/status to be sure.

Comment 20 Masaki Furuta ( RH ) 2017-07-25 14:14:19 UTC

Hi,

I also got similar report from Japan customer, as they're about to upgrade from OSP7 to 8 , 9 and upto 10.

They are in hurry and thinking of upgrade from OSP7 to 10, and had upgraded some of their compute from OSP7 to 8, and this time they carried out;

 1) LIVE migration; 

  qemu-kvm-rhev-2.3.0-31.el7_2.13 (OSP7) -> qemu-kvm-rhev-2.6.0-28.el7_3.9 (OSnP8).

And also they tried;

  2) COLD migration: OSP7 -> OSP8.


I have 2 Questions:


  After 1) LIVE migration, they found live migaration was failed with "VQ 2 size 0x80..." and "CPU feature arat not found" and "custom memory allocation vtable not supported" , that's expected 

  But they also found libvirtd has gone and not reponded to virsh and nova again. 

 Q1. Is that also expected behavior on this BZ.. or I should report it as another issue on new BZ?

And

  After 2) COLD migration, they saw following 2 issues.

  2-1)

    ~~~
    [stack@wbc-director01p-prd-p ~]$ nova migrate --poll 5bc29f8d-859f-4be3-8a3c-bb4092a95b46

    Server migrating... 0% complete
    Error migrating server
    ERROR (InstanceInErrorState): Object action create failed because: cannot create a Migration object without a migration_type set
    ~~~

  2-2)

    On another attempt, cold-migration failure caused file system corruption on Instance. 
    (I'm asking for detail to the customer now)
    

  Q2. Is this also expected ? , and basically this issue will affect COLD migration too? if not, I'll go and file new BZ for cold migration issue..

Customer definitely need to upgrade to OSP10 soon, since their product is on OSP10 and its deadline is approaching..

I really appreciate your help!

Comment 24 Sahid Ferdjaoui 2017-08-03 11:58:28 UTC

I think we should close that issue as WONTFIX since the root cause in not related to Nova but QEMU bug 1388465 and it seems not possible to do better than what we already have to handle migration errors (comment #18).

If there is no objection i'm going to close it tomorrow.

Comment 25 Dr. David Alan Gilbert 2017-08-03 12:08:59 UTC

This doesn't seem right to me; getting confused about the state of migration is really bad;  it's made the qemu failure much worse in this case with the possibility of a failed migration turning into a failed VM and possible corruption.

Please work with Dan and Jiri to figure out the best way to get libvirt to give you what you need.

Comment 26 Sahid Ferdjaoui 2017-08-03 12:21:29 UTC

(In reply to Dr. David Alan Gilbert from comment #25)
> This doesn't seem right to me; getting confused about the state of migration
> is really bad;  it's made the qemu failure much worse in this case with the
> possibility of a failed migration turning into a failed VM and possible
> corruption.
> 
> Please work with Dan and Jiri to figure out the best way to get libvirt to
> give you what you need.

Yes you are probably right.

Daniel, I proposed upstream a solution which merge the result of migration API and job domain info [0]. So both have to give a good result to consider the migration accepted. Please can you have a look and let me know if that is on a way we could consider?

If not, perhaps and as suggested by David we should probably open a BZ against libvirt to work on a solution to ensure the migration status.

[0] https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bug/1653718

Comment 27 Daniel Berrangé 2017-08-03 12:35:36 UTC

I don't think that proposed change to Nova is right or even needed. The domain job info should accurately reflect status, and if there's a case where it is not (which I've not seen evidence of so far), then that needs fixing in libvirt.

Comment 28 Sahid Ferdjaoui 2017-08-03 13:39:35 UTC

So it seems that domjobinfo returned VIR_DOMAIN_JOB_NONE and in such circumstance we try to do best effort by checking whether or not a domain is running on source host. If that is not the case so we consider the migration as success.

We could update Nova to just consider the migration as a fail if domjobinfo is returning VIR_DOMAIN_JOB_NONE or try to understand in libvirt how that is possible to get such result.

Comment 29 Daniel Berrangé 2017-08-03 13:44:24 UTC

You can't assume JOB_NONE means failed - very often it will certainly mean success. That's why we check whether the guest is still running or not.

Comment 30 Sahid Ferdjaoui 2017-08-03 13:58:57 UTC

Is this something we want to investigate in libvirt, why JOB_NONE can happen even if there is a success ? - So we should move this issue to libvirt.

If not I would refer to my comment #24 and close this issue as WONTFIX.

Comment 34 Jiri Denemark 2017-11-09 10:20:17 UTC

The virDomainGetJobInfo and virDomainGetJobStats APIs report statistics about
a (migration) job while it is running. Thus when there's no job running either
because none was started, or the job finished (successfully or not), the API
reports VIR_DOMAIN_JOB_NONE indicating there is no job running. You may be
lucky and get a COMPLETED or FAILED job type during a short window between the
end of migration and the end of the migration API which cleans the job
statistics, but most of the time NONE will be returned.

Seeing VIR_DOMAIN_JOB_NONE just means migration is no longer running. If the
domain is still running on the source at this point, migration obviously
failed. I looked at the code carefully and didn't see a way how NONE could be
returned after a successful migration before the domain was killed to cause
any confusion to Nova. However, this is not the case, in this bug libvirt
reported failed migration and the domain was apparently still running on the
source, but Nova thought migration was successful. According to the Nova code,
this means it got VIR_DOMAIN_JOB_COMPLETED (getting NONE would result in a
check for the domain running on the source and transforming the result to
FAILED). This happened at 18:07:22.195, while Nova reported a libvirt error
originating from the destination host at 18:07:22.372. That is, the error
happened at the very end of the migration. Likely after source QEMU already
sent all data to the destination.

Due to a bug in libvirt we used to report COMPLETED job type immediately when
source QEMU finished migration, which unfortunately did not mean the
destination was able to load the data and start vCPUs. When the QEMU process
died on the destination host or vCPUs could not be started, libvirt would
report migration failure. The value returned by libvirt's migration API was
correct, but if Nova managed to check the job status in the meantime, it would
see VIR_DOMAIN_JOB_COMPLETED job type.

This bug should be already fixed upstream by:

commit 3f2d6d829eb8de0348fcbd58d654b29d5c5bebc2
Refs: v3.7.0-29-g3f2d6d829
Author:     Nikolay Shirokovskiy <nshirokovskiy>
AuthorDate: Fri Sep 1 09:49:31 2017 +0300
Commit:     Jiri Denemark <jdenemar>
CommitDate: Thu Sep 7 12:52:36 2017 +0200

    qemu: migration: don't expose incomplete job as complete

    In case of real migration (not migrating to file on save, dump etc)
    migration info is not complete at time qemu finishes migration
    in normal (non postcopy) mode. We need to update disks stats,
    downtime info etc. Thus let's not expose this job status as
    completed.

    To archive this let's set status to 'qemu completed' after
    qemu reports migration is finished. It is not visible as complete
    job to clients. Cookie code on confirm phase will finally turn
    job into completed. As we don't need more things to do when
    migrating to file status is set to 'completed' as before
    in this case.

    Signed-off-by: Jiri Denemark <jdenemar>

Comment 35 Jiri Denemark 2017-11-09 21:36:52 UTC

Steps to reproduce:

1. on the destination host attach gdb to libvirtd, set breakpoint to
   qemuMigrationFinish, and let the daemon continue:
    # gdb -p $(pidof libvirtd)
    (gdb) br qemuMigrationFinish
    (gdb) c

2. migrate a domain to the destination host

3. once gdb stops at the breakpoint check 'virsh domjobinfo DOM' on the source
   host

4. kill the qemu-kvm process on the destination host

5. let gdb continue with executing libvirtd (this will likely need to be done
   twice since gdb may stop at SIGPIPE after the first one)
    (gdb) c

6. check migration failed and the domain is still running on the source


With an unfixed version, you'll see "Job type: Completed" in step 3 even
though the migration will fail in the end. The fixed libvirt will report
ongoing migration in step 3: "Job type: Unbounded".

Comment 37 zhe peng 2017-12-12 06:53:45 UTC

Thanks Jiri provide steps to reproduce .

I can reproduce this bug with build libvirt-2.0.0-10.el7_3.9.x86_64

step same with comment 35

after step 3, will get 
# virsh domjobinfo rhel7
Job type:         Completed 

verify with build: libvirt-3.9.0-5.el7.x86_64

after step 3 will get:
# virsh domjobinfo rhel7.3
Job type:         Unbounded   
Operation:        Outgoing migration
Time elapsed:     2790         ms
Data processed:   284.269 MiB
Data remaining:   553.668 MiB
Data total:       1.102 GiB
Memory processed: 284.269 MiB
Memory remaining: 553.668 MiB
Memory total:     1.102 GiB
Memory bandwidth: 112.759 MiB/s
Dirty rate:       0            pages/s
Page size:        4096         bytes
Iteration:        1           
Constant pages:   74773       
Normal pages:     72467       
Normal data:      283.074 MiB
Expected downtime: 300          ms
Setup time:       7            ms

after step 6,
on source host:
# virsh migrate rhel7.3 --live qemu+ssh://$target_host/system --verbose 
Migration: [100 %]error: internal error: qemu unexpectedly closed the monitor: red_qxl_loadvm_commands: 

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 5     rhel7.3                        running

move to verified.

Comment 47 errata-xmlrpc 2018-04-10 10:39:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0704