Bug 1401173
Summary: | Target host in nova DB got updated to new compute while migration failed and qemu-kvm process was still running on source host. | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Martin Schuppert <mschuppe> | |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | |
Status: | CLOSED ERRATA | QA Contact: | zhe peng <zpeng> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 7.4 | CC: | berrange, dasmith, dgilbert, dyuan, eglynn, fjin, jdenemar, jsuchane, kchamart, libvirt-maint, mas-hatada, mfuruta, mkalinin, mschuppe, mtessun, pbarta, pmorey, rbalakri, rbryant, rhel-osp-bz, sbauza, sferdjao, sgordon, smykhail, srevivo, vromanso, xuzhang, yafu | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-3.9.0-1.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1530130 (view as bug list) | Environment: | ||
Last Closed: | 2018-04-10 10:39:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1530130 |
Comment 4
Jiri Denemark
2016-12-05 19:43:41 UTC
(In reply to Jiri Denemark from comment #4) > Well, the migration code in Nova is just wrong. Calling virDomainGetJobInfo > or virDomainGetJobStats (which is newer and better) is fine for getting data > about migration progress and optionally using the data to tune some > migration parameters in case the migration is not converging. But using > migration statistics to deduce the result of a migration is very fragile and > should never be done. > > Only the migration API provides a clear indication of a successful or failed > migration. Would it be possible to get some pointers on this migration API. I did not find so much information on it in https://libvirt.org/devguide.html Apparently Nova uses virDomainMigrateToURI* as can be seen at https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/virt/libvirt/driver.py;h=3d422f1fd61faab0f7a4261761571e7fe96371c7;hb=refs/heads/rhos-8.0-patches#l5853 https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/virt/libvirt/driver.py;h=3d422f1fd61faab0f7a4261761571e7fe96371c7;hb=refs/heads/rhos-8.0-patches#l5871 https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/virt/libvirt/driver.py;h=3d422f1fd61faab0f7a4261761571e7fe96371c7;hb=refs/heads/rhos-8.0-patches#l5876 And the caller does not check the return values at all. Anyway, our API documentation can be found at http://libvirt.org/html/index.html, specifically the migration API Nova is using is described at http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMigrateToURI3 (In reply to Dr. David Alan Gilbert from comment #2) > The original migration failure is > 2016-11-22T17:07:22.131885Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 - > used_idx 0x2 > and is: > https://bugzilla.redhat.com/show_bug.cgi?id=1388465 > > which is already fixed and working it's way through Z streaming; > > However, that doesn't explain why NOVA thought it succeeded. Thanks David for the pointer to that issue. There are still some computes with instances running qemu-kvm-rhev from RHEL 7.2 and want to live migrate them to a compute running qemu-kvm-rhev from RHEL 7.3 version where the issue is fixed. Is there a way to ensure we can live-migrate instances without triggering "migration failed with: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2"? (In reply to Martin Schuppert from comment #7) > (In reply to Dr. David Alan Gilbert from comment #2) > > The original migration failure is > > 2016-11-22T17:07:22.131885Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 - > > used_idx 0x2 > > and is: > > https://bugzilla.redhat.com/show_bug.cgi?id=1388465 > > > > which is already fixed and working it's way through Z streaming; > > > > However, that doesn't explain why NOVA thought it succeeded. > > Thanks David for the pointer to that issue. > > There are still some computes with instances running qemu-kvm-rhev from RHEL > 7.2 and want to live migrate them to a compute running qemu-kvm-rhev from > RHEL 7.3 version where the issue is fixed. > > Is there a way to ensure we can live-migrate instances without triggering > "migration failed with: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2"? Not that I know of; lets ask lprosek to be sure. Note that it generally only affects Windows guests that have balloon enabled. Linux guests should be OK. (In reply to Dr. David Alan Gilbert from comment #8) > (In reply to Martin Schuppert from comment #7) > > (In reply to Dr. David Alan Gilbert from comment #2) > > There are still some computes with instances running qemu-kvm-rhev from RHEL > > 7.2 and want to live migrate them to a compute running qemu-kvm-rhev from > > RHEL 7.3 version where the issue is fixed. > > > > Is there a way to ensure we can live-migrate instances without triggering > > "migration failed with: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2"? > > Not that I know of; lets ask lprosek to be sure. > Note that it generally only affects Windows guests that have balloon enabled. > Linux guests should be OK. What David said. Windows guests with the balloon device enabled and balloon virtio-win driver installed, but without the balloon service running (blnsvr.exe in the virtio-win ISO) are susceptible to this issue. You have to migrate twice to hit this. Disabling the guest balloon driver before starting the migration should get around it (Device Manager -> System Devices -> VirtIO Balloon Driver -> Disable). Then you can re-enable it on the destination. If balloon is absolutely needed and can't be disabled, try starting the balloon service (blnsvr -i or blnsvr -r), although I suspect that this would help only on the first migration - the one that would have succeeded anyway. So maybe that's something to do only in case you happen to have to do a 7.2 -> 7.2 migration. I think we should re-prioritize that issue for Nova in a lower level. This bug is happening because a lower-level component were in unpredictable situation. However It's clear that the way of how Nova is acknowledging the live-migration process is not enough and should be improved. I started a serie of patches for Nova. It's still in WIP but I hope to make something robust available soon. (In reply to Sahid Ferdjaoui from comment #10) > I think we should re-prioritize that issue for Nova in a lower level. This > bug is happening because a lower-level component were in unpredictable > situation. > > However It's clear that the way of how Nova is acknowledging the > live-migration process is not enough and should be improved. IMHO this shouldn't be deprioritised. While qemu screwd up in this case, running 2 qemu's at the same time is the worst case failure mode and is what causes the corruption - this should NEVER happen! > I started a serie of patches for Nova. It's still in WIP but I hope to make > something robust available soon. (In reply to Dr. David Alan Gilbert from comment #11) > (In reply to Sahid Ferdjaoui from comment #10) > > I think we should re-prioritize that issue for Nova in a lower level. This > > bug is happening because a lower-level component were in unpredictable > > situation. > > > > However It's clear that the way of how Nova is acknowledging the > > live-migration process is not enough and should be improved. > > IMHO this shouldn't be deprioritised. > While qemu screwd up in this case, running 2 qemu's at the same time is the > worst case failure mode and is what causes the corruption - this should > NEVER happen! I agree that the issue why the live migration failed was on the lower level and this was fixed, but nova should have never proceeded with migration as successful. The main concern is that we see something similar again (with unknown end result at the moment) if other issues on a lower level happen, if we do not change nova to also fail in such a situation. Therefore I think we should work on improving nova with priority. > > I started a serie of patches for Nova. It's still in WIP but I hope to make > > something robust available soon. Perfect thx, I have seen that. This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release. (In reply to Jiri Denemark from comment #4) > Well, the migration code in Nova is just wrong. Calling virDomainGetJobInfo > or virDomainGetJobStats (which is newer and better) is fine for getting data > about migration progress and optionally using the data to tune some > migration parameters in case the migration is not converging. But using > migration statistics to deduce the result of a migration is very fragile and > should never be done. > > Only the migration API provides a clear indication of a successful or failed > migration. Unfortunately that is not correct. virDomainMigrate return status cannot be trusted. If there is a network failure at certain point it can return failure, despite fact that the VM is now successfully running on the target. OpenStack hit this exact problem, which is what prompted the rewrite to use virDomainJobInfo. Now that API has some tricky points too, but resolving those problem scenarios was easier than resolving what the correct action is when virDomaniMigrate returns an error. So I don't think the Nova code as it stands is wrong and going back to virDomainMigrate will reintroduce the flaws we previously fixed with it. Could you describe the solution with virDomainMigrate return status versus virDomainJobInfo in more details? I'm curious to see them since I can't see how virDomainJobInfo can give you reliable info when a split brain occurs. Neither can give you reliable info when split brain occurs. No matter which we pick, we needed extra logic in Nova to figure out the state of the system when error occurs. Given that we already needed to use virDomainGetInfo to monitor progress of the migration, it was easier to deal with resolving the problems when virDomainGetInfo errors, than when virDomainMigrate errors. IOW neither API is satisfactory on its own, and so using virDomainGetInfo is not wrong / worse, just different and in our case easier to deal with. Hmm I don't know how either of these tie up to Qemu's view of success/failure. From QEMUs point of view I'd say you need the migration completed event/status to be sure. Hi, I also got similar report from Japan customer, as they're about to upgrade from OSP7 to 8 , 9 and upto 10. They are in hurry and thinking of upgrade from OSP7 to 10, and had upgraded some of their compute from OSP7 to 8, and this time they carried out; 1) LIVE migration; qemu-kvm-rhev-2.3.0-31.el7_2.13 (OSP7) -> qemu-kvm-rhev-2.6.0-28.el7_3.9 (OSnP8). And also they tried; 2) COLD migration: OSP7 -> OSP8. I have 2 Questions: After 1) LIVE migration, they found live migaration was failed with "VQ 2 size 0x80..." and "CPU feature arat not found" and "custom memory allocation vtable not supported" , that's expected But they also found libvirtd has gone and not reponded to virsh and nova again. Q1. Is that also expected behavior on this BZ.. or I should report it as another issue on new BZ? And After 2) COLD migration, they saw following 2 issues. 2-1) ~~~ [stack@wbc-director01p-prd-p ~]$ nova migrate --poll 5bc29f8d-859f-4be3-8a3c-bb4092a95b46 Server migrating... 0% complete Error migrating server ERROR (InstanceInErrorState): Object action create failed because: cannot create a Migration object without a migration_type set ~~~ 2-2) On another attempt, cold-migration failure caused file system corruption on Instance. (I'm asking for detail to the customer now) Q2. Is this also expected ? , and basically this issue will affect COLD migration too? if not, I'll go and file new BZ for cold migration issue.. Customer definitely need to upgrade to OSP10 soon, since their product is on OSP10 and its deadline is approaching.. I really appreciate your help! I think we should close that issue as WONTFIX since the root cause in not related to Nova but QEMU bug 1388465 and it seems not possible to do better than what we already have to handle migration errors (comment #18). If there is no objection i'm going to close it tomorrow. This doesn't seem right to me; getting confused about the state of migration is really bad; it's made the qemu failure much worse in this case with the possibility of a failed migration turning into a failed VM and possible corruption. Please work with Dan and Jiri to figure out the best way to get libvirt to give you what you need. (In reply to Dr. David Alan Gilbert from comment #25) > This doesn't seem right to me; getting confused about the state of migration > is really bad; it's made the qemu failure much worse in this case with the > possibility of a failed migration turning into a failed VM and possible > corruption. > > Please work with Dan and Jiri to figure out the best way to get libvirt to > give you what you need. Yes you are probably right. Daniel, I proposed upstream a solution which merge the result of migration API and job domain info [0]. So both have to give a good result to consider the migration accepted. Please can you have a look and let me know if that is on a way we could consider? If not, perhaps and as suggested by David we should probably open a BZ against libvirt to work on a solution to ensure the migration status. [0] https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bug/1653718 I don't think that proposed change to Nova is right or even needed. The domain job info should accurately reflect status, and if there's a case where it is not (which I've not seen evidence of so far), then that needs fixing in libvirt. So it seems that domjobinfo returned VIR_DOMAIN_JOB_NONE and in such circumstance we try to do best effort by checking whether or not a domain is running on source host. If that is not the case so we consider the migration as success. We could update Nova to just consider the migration as a fail if domjobinfo is returning VIR_DOMAIN_JOB_NONE or try to understand in libvirt how that is possible to get such result. You can't assume JOB_NONE means failed - very often it will certainly mean success. That's why we check whether the guest is still running or not. Is this something we want to investigate in libvirt, why JOB_NONE can happen even if there is a success ? - So we should move this issue to libvirt. If not I would refer to my comment #24 and close this issue as WONTFIX. The virDomainGetJobInfo and virDomainGetJobStats APIs report statistics about a (migration) job while it is running. Thus when there's no job running either because none was started, or the job finished (successfully or not), the API reports VIR_DOMAIN_JOB_NONE indicating there is no job running. You may be lucky and get a COMPLETED or FAILED job type during a short window between the end of migration and the end of the migration API which cleans the job statistics, but most of the time NONE will be returned. Seeing VIR_DOMAIN_JOB_NONE just means migration is no longer running. If the domain is still running on the source at this point, migration obviously failed. I looked at the code carefully and didn't see a way how NONE could be returned after a successful migration before the domain was killed to cause any confusion to Nova. However, this is not the case, in this bug libvirt reported failed migration and the domain was apparently still running on the source, but Nova thought migration was successful. According to the Nova code, this means it got VIR_DOMAIN_JOB_COMPLETED (getting NONE would result in a check for the domain running on the source and transforming the result to FAILED). This happened at 18:07:22.195, while Nova reported a libvirt error originating from the destination host at 18:07:22.372. That is, the error happened at the very end of the migration. Likely after source QEMU already sent all data to the destination. Due to a bug in libvirt we used to report COMPLETED job type immediately when source QEMU finished migration, which unfortunately did not mean the destination was able to load the data and start vCPUs. When the QEMU process died on the destination host or vCPUs could not be started, libvirt would report migration failure. The value returned by libvirt's migration API was correct, but if Nova managed to check the job status in the meantime, it would see VIR_DOMAIN_JOB_COMPLETED job type. This bug should be already fixed upstream by: commit 3f2d6d829eb8de0348fcbd58d654b29d5c5bebc2 Refs: v3.7.0-29-g3f2d6d829 Author: Nikolay Shirokovskiy <nshirokovskiy> AuthorDate: Fri Sep 1 09:49:31 2017 +0300 Commit: Jiri Denemark <jdenemar> CommitDate: Thu Sep 7 12:52:36 2017 +0200 qemu: migration: don't expose incomplete job as complete In case of real migration (not migrating to file on save, dump etc) migration info is not complete at time qemu finishes migration in normal (non postcopy) mode. We need to update disks stats, downtime info etc. Thus let's not expose this job status as completed. To archive this let's set status to 'qemu completed' after qemu reports migration is finished. It is not visible as complete job to clients. Cookie code on confirm phase will finally turn job into completed. As we don't need more things to do when migrating to file status is set to 'completed' as before in this case. Signed-off-by: Jiri Denemark <jdenemar> Steps to reproduce: 1. on the destination host attach gdb to libvirtd, set breakpoint to qemuMigrationFinish, and let the daemon continue: # gdb -p $(pidof libvirtd) (gdb) br qemuMigrationFinish (gdb) c 2. migrate a domain to the destination host 3. once gdb stops at the breakpoint check 'virsh domjobinfo DOM' on the source host 4. kill the qemu-kvm process on the destination host 5. let gdb continue with executing libvirtd (this will likely need to be done twice since gdb may stop at SIGPIPE after the first one) (gdb) c 6. check migration failed and the domain is still running on the source With an unfixed version, you'll see "Job type: Completed" in step 3 even though the migration will fail in the end. The fixed libvirt will report ongoing migration in step 3: "Job type: Unbounded". Thanks Jiri provide steps to reproduce . I can reproduce this bug with build libvirt-2.0.0-10.el7_3.9.x86_64 step same with comment 35 after step 3, will get # virsh domjobinfo rhel7 Job type: Completed verify with build: libvirt-3.9.0-5.el7.x86_64 after step 3 will get: # virsh domjobinfo rhel7.3 Job type: Unbounded Operation: Outgoing migration Time elapsed: 2790 ms Data processed: 284.269 MiB Data remaining: 553.668 MiB Data total: 1.102 GiB Memory processed: 284.269 MiB Memory remaining: 553.668 MiB Memory total: 1.102 GiB Memory bandwidth: 112.759 MiB/s Dirty rate: 0 pages/s Page size: 4096 bytes Iteration: 1 Constant pages: 74773 Normal pages: 72467 Normal data: 283.074 MiB Expected downtime: 300 ms Setup time: 7 ms after step 6, on source host: # virsh migrate rhel7.3 --live qemu+ssh://$target_host/system --verbose Migration: [100 %]error: internal error: qemu unexpectedly closed the monitor: red_qxl_loadvm_commands: # virsh list --all Id Name State ---------------------------------------------------- 5 rhel7.3 running move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0704 |