Bug 1970337
Summary: | Fail to get migration failure immediately if yank under multifd migration | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> |
Component: | qemu-kvm | Assignee: | Leonardo Bras <leobras> |
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | chayang, dgilbert, fjin, jferlan, jinzhao, juzhang, leobras, mdean, mrezanin, quintela, qzhang, virt-maint, yfu, zixchen |
Version: | 9.0 | Keywords: | Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-6.2.0-1.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-17 12:23:27 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Li Xiaohui
2021-06-10 09:43:16 UTC
This sounds like the 'yank' hasn't done it's job and we're waiting for some type of tcp timeout. Assigned to Amnon for initial triage per bz process and age of bug created or assigned to virt-maint without triage. This bug reproduces upstream. So far, I could notice yank is correctly doing it's job and calling migration_yank_iochannel(). On the other hand, migration_yank_iochannel() calls qio_channel_shutdown() which is not enough to abort the migration in multifd case. Currently I am trying to understand what should I call to abort in multifd case. (As a test, I called qio_channel_shutdown() in every multifd iochannel and yank worked just fine, but I could not retry migration, because it was still 'ongoing') v1 posted upstream: http://patchwork.ozlabs.org/project/qemu-devel/patch/20210730074043.54260-1-leobras@redhat.com/ Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Update: Lukas Straub implemented a fix based in my previous patchset: http://patchwork.ozlabs.org/project/qemu-devel/patch/20210901175857.0858efe1@gecko.fritz.box/ Waiting for upstream merge in order to start the backporting process. Looks like the above was included in qemu-6.2 which was recently used to rebase into RHEL 9.0 I think this then can be moved along in the process, but want to make sure before actually doing it. I did set the ITR just to ensure it stays on the radar. Be sure the Devel Whiteboard contains "Fixed in upstream qemu-6.2 commit <commit-id>"... May need to get Mirek to help move to ON_QA since this bug wouldn't have been included when he did the update and we'll need qa to set ITM (In reply to John Ferlan from comment #7) > Looks like the above was included in qemu-6.2 which was recently used to > rebase into RHEL 9.0 Yes, it seems correct. The commit-id for this change is 20171ea8950c619f00dc5cfa6136fd489998ffc5, which is the same for upstream and rhel-9 branches. > > I think this then can be moved along in the process, but want to make sure > before actually doing it. I did set the ITR just to ensure it stays on the > radar. > > Be sure the Devel Whiteboard contains "Fixed in upstream qemu-6.2 commit > <commit-id>"... Could you please point the Devel Whiteboard? (I don't quite recall where to find it) (In reply to Leonardo Bras from comment #8) > (In reply to John Ferlan from comment #7) > > Looks like the above was included in qemu-6.2 which was recently used to > > rebase into RHEL 9.0 > > Yes, it seems correct. > The commit-id for this change is 20171ea8950c619f00dc5cfa6136fd489998ffc5, > which is the same for upstream and rhel-9 branches. > > > > > I think this then can be moved along in the process, but want to make sure > > before actually doing it. I did set the ITR just to ensure it stays on the > > radar. > > > > Be sure the Devel Whiteboard contains "Fixed in upstream qemu-6.2 commit > > <commit-id>"... > > Could you please point the Devel Whiteboard? > (I don't quite recall where to find it) It's at the top (search in the bz window)... I updated the bz, moved it to POST, set devel_ack+, and I set DTM=20 (hopefully avoid the noisy bots)... Will need ITM to be set in order to get release+ (it's all process stuff). The reality is it could already be tested with the rebase - we just have to follow process to get there... Leaving a needinfo for Mirek since this was fixed by the qemu-6.2 rebase, but since we've already gone through the errata processes some system may need more massaging before moving to ON_QA. Gating test with qemu-kvm-6.2.0-1.el9 pass, add Verified:Tested, SanityOnly Verify bug on rhel 9.0.0 (kernel-5.14.0-39.el9.x86_64 & qemu-kvm-6.2.0-1.el9.x86_64) Multifd get failed immediately when inject firewall via drop on dst host, and multifd migration go on and succeed after close firewalld. I have one question what should be the status of destination qemu when migration failed: Won't qemu process on dst host automatic quit when migration fail? When I test plain and multifd migration, I get two different results: 1.For plain migration, migration is always active on dst host, luckily I can stop the qemu process like { "execute": "quit" }. Qmp and hmp still work well; 2.For multifd migration, Qmp and hmp hang when I inject firewall on dst host even I enable qmp capability with 'OOB': '{"execute":"qmp_capabilities", "arguments":{"enable":["oob"]}}'. The only way I can stop dst qemu process is to kill process "kill -9 $dst_qemu_pid" I'm not sure whether it's a bug, could someone help? This *might* be OK, adding Juan to see if he can see where the destination is blocked. You're doing a 'yank' on the source; you should be able to do a 'yank' on the dest to kill off any hung connections and recover quickly. To do this you'll need a second qmp connection, and use the "exec-oob" command to execute the yank rather than "execute", see section 2.3 and 2.3.1 of https://github.com/qemu/qemu/blob/master/docs/interop/qmp-spec.txt#L89 that should then recover your main monitor for you. If it doesn't, then we have a bug with the recovery on the destination side. Even if that does work, lets check with Juan if the current hang you're seeing is avoidable. (In reply to Dr. David Alan Gilbert from comment #15) > This *might* be OK, adding Juan to see if he can see where the destination > is blocked. Found QMP & HMP on dst host sometimes hang, but sometimes still be active as plain migration. Maybe it's a problem. > You're doing a 'yank' on the source; you should be able to do a 'yank' on > the dest > to kill off any hung connections and recover quickly. > > To do this you'll need a second qmp connection, and use the "exec-oob" > command to execute the > yank rather than "execute", Thanks for the heads up. When QMP on dst hang, I tried to execute yank under oob, but it still hang, what shall I do for the next steps? {"execute":"qmp_capabilities", "arguments":{"enable":["oob"]}} {"return": {}} { "exec-oob": "query-yank" } {"return": [{"type": "chardev", "id": "qmp_id_qmpmonitor1"}, {"type": "chardev", "id": "qmp_id_catch_monitor"}, {"type": "chardev", "id": "compat_monitor0"}, {"type": "chardev", "id": "compat_monitor1"}, {"type": "chardev", "id": "compat_monitor2"}, {"type": "chardev", "id": "serial0"}, {"type": "migration"}]} {"exec-oob":"yank","arguments":{"instances":[{"type":"migration"}]}} {"return": {}} > see section 2.3 and 2.3.1 of > https://github.com/qemu/qemu/blob/master/docs/interop/qmp-spec.txt#L89 > > that should then recover your main monitor for you. > If it doesn't, then we have a bug with the recovery on the destination side. > > Even if that does work, lets check with Juan if the current hang you're > seeing is avoidable. Yeh if its still hanging after a yank I think it's one for juan to check the multifd code Ok, I would mark this bug verified per Comment 14 & Comment 15. We could go on track qemu on dst host hang issue for multifd migration (I think it's not same issue as bug). Juan, could you check according to Comment 14 & Comment 15, Comment 16? (In reply to Dr. David Alan Gilbert from comment #15) > This *might* be OK, adding Juan to see if he can see where the destination > is blocked. > You're doing a 'yank' on the source; you should be able to do a 'yank' on > the dest > to kill off any hung connections and recover quickly. > > To do this you'll need a second qmp connection, and use the "exec-oob" > command to execute the > yank rather than "execute", see section 2.3 and 2.3.1 of > https://github.com/qemu/qemu/blob/master/docs/interop/qmp-spec.txt#L89 > > that should then recover your main monitor for you. > If it doesn't, then we have a bug with the recovery on the destination side. > > Even if that does work, lets check with Juan if the current hang you're > seeing is avoidable. You need to do the yank on destination by hand. See what happened: - we start migration on both sides - we cut the network cable - we yank source and live is good there. - but destination is still waiting for more data. I can't see how it can detect that there has been a network cut when it is just reading. It needs to wait whatever timeout is needed. So I would say that things are ok. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: qemu-kvm), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2307 |