Bug 1956897

Summary: RFE: Allow killing stuck migration connection
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Michal Privoznik <mprivozn>
Component: qemu-kvmAssignee: Virtualization Maintenance <virt-maint>
qemu-kvm sub component: Live Migration QA Contact: Li Xiaohui <xiaohli>
Status: VERIFIED --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aliang, chayang, coli, ddepaula, dgilbert, ehadley, fdeutsch, jdenemar, jinzhao, juzhang, nanliu, pkrempa, virt-bugs, virt-maint, zixchen
Version: ---Keywords: RFE, TestOnly
Target Milestone: rc   
Target Release: 8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1955195 Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1955195    

Description Michal Privoznik 2021-05-04 15:42:34 UTC
+++ This bug was initially created as a clone of Bug #1955195 +++

Description of problem:

Libvirt is planning on adopting 'yank' command (see bug 1955195) that was implemented in upstream QEMU in commit v6.0.0-rc0~150^2~6 (and related commits).

This is an RFE to do whatever is needed to get the command into RHEL-AV.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Michal Privoznik 2021-05-04 15:46:38 UTC
An idea how to test 'yank' command is to start migration and then inject firewall rules that would drop packets silently (that is DROP instead of REJECT - so that the source doesn't get notified).

Comment 4 Li Xiaohui 2021-05-12 13:03:41 UTC
Tested 'yank' on rhelav-8.5.0 (kernel-4.18.0-304.3.el8.x86_64 & qemu-img-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64)


Test scenarios: 
1.inject firewall via drop on dst host when migration is active, then migration hang, use yank to fail migration. 


Test steps:
1.Boot a guest on src host;
2.Boot a guest on dst host with '-incoming defer';
3.Set migration incoming on dst host via qmp cmd;
{"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:1234"}}
4.Start migration on src host via qmp cmd;
{"execute": "migrate","arguments":{"uri": "tcp:10.73.130.69:1234"}}
5.During migration is active, inject firewall via drop on dst host:
# iptables -A INPUT -p tcp --dport 1234 -j DROP
6.After migration hang(query migrate, only the total time of migration is increasing, other migration params stay unchanged), use yank cmd to fail migration on src host:
{ "execute": "query-yank" }
{"return": [{"type": "chardev", "id": "qmp_id_qmpmonitor1"}, {"type": "chardev", "id": "qmp_id_catch_monitor"}, {"type": "chardev", "id": "compat_monitor0"}, {"type": "chardev", "id": "serial0"}, {"type": "migration"}]}
{"execute":"yank","arguments":{"instances":[{"type":"migration"}]}}


Actual result:
After step 6, query migration status, get failed status on src host via qmp cmd, I think the result is expected:
{"execute":"query-migrate"}
{"return": {"blocked": false, "status": "failed", "error-desc": "Unable to write to socket: Broken pipe"}}

Next we need quit qemu process on dst host by manually, then close firewall on dst host, we can start migration again and vm works well on dst host after migration.



BTW, who could help answer following two questions: 
Question 1) do we need test 'yank' with network failure scenario(hit a failure of the migration network during migration is active)???
I think network failure scenario is more nearly the requirement about 'yank' as it would hit qemu hang issue but inject firewall only hit migration hang(qemu and qmp still work well) :
*******************************************************************************
+# A yank instance can be yanked with the @yank qmp command to recover from a hanging QEMU.
+#
+# Currently implemented yank instances:
+#  - nbd block device:
+#    Yanking it will shut down the connection to the nbd server without
+#    attempting to reconnect.
+#  - socket chardev:
+#    Yanking it will shut down the connected socket.
+#  - migration:
+#    Yanking it will shut down all migration connections. Unlike
+#    @migrate_cancel, it will not notify the migration process, so migration
+#    will go into @failed state, instead of @cancelled state. @yank should be
+#    used to recover from hangs.

Question 2) shall qemu on dst host quit by automatically after executing 'yank' command as migration would fail?

Comment 5 Li Xiaohui 2021-05-12 13:12:48 UTC
Besides migration, yank is also related with nbd block, chardev (You can see the last content in Comment 4 or see defails from downstream qemu-kvm-6.0 commit: 50186051f425da3ace2425371c5271d0b64e7122).

Comment 6 Dr. David Alan Gilbert 2021-05-12 13:58:34 UTC
Thanks, that's a good test.  It would be better to use the "oob" capability (see https://github.com/qemu/qemu/blob/master/docs/interop/qmp-spec.txt#L116 )
that way even if there is a currently executing QMP command that's blocked, the 'yank' command should still execute.

Comment 7 John Ferlan 2021-05-14 14:34:01 UTC
Moving this to POST since this was included in qemu-6.0 and as noted in comment 4 is already testable.

I set ITM=14 mainly to get the release+ - feel free to use a later one for completion of "new" tests though
I did not set DTM, theoretically it could be 10 as that's about when the code was built, but that'll probably anger the dev missed bot since 10 already passed.

Danilo - I'll let you do the rest of the magic to move to ON_QA

Comment 8 Li Xiaohui 2021-05-17 08:26:05 UTC
(In reply to Dr. David Alan Gilbert from comment #6)
> Thanks, that's a good test.  It would be better to use the "oob" capability
> (see https://github.com/qemu/qemu/blob/master/docs/interop/qmp-spec.txt#L116
> )
> that way even if there is a currently executing QMP command that's blocked,
> the 'yank' command should still execute.

Thanks for the reminder. 

I will test network failure scenario when the machines are available.

Comment 9 Li Xiaohui 2021-05-25 08:20:22 UTC
Hi David,
I hit qemu core dump when do tls migration on the latest rhelav-8.5.0(qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64), could you help have a look as it give error about yank:
(qemu) qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
Aborted (core dumped)

Test steps:
1.Ca files generated as setup
2.Boot a vm as tls server on dst host;
3.Boot a vm as tls client on src host;
4.In dst host:
(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate_incoming tcp:$dst_host_ip:5801
In src host:
(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate -d tcp:$dst_host_short_name:5801


During migration, qemu on src&dst host hit core dump:
(qemu) qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
Aborted (core dumped)


And almost all tls cases are blocked by the error.

Comment 10 Dr. David Alan Gilbert 2021-05-25 12:12:33 UTC
Thanks for filing https://bugzilla.redhat.com/show_bug.cgi?id=1964326 for that crash.

Comment 11 Danilo Cesar Lemes de Paula 2021-06-08 00:28:00 UTC
Upstream feature already present in qemu-6.0.
Marked as TestOnly and moved directly to ON_QA

Comment 12 Li Xiaohui 2021-06-09 07:03:20 UTC
Test 'yank' command about scenario 2) migration network hit a failure on the latest RHELAV-8.5.0 (), the result also pass.

Test environment:
hosts: kernel-4.18.0-310.el8.x86_64 & qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.x86_64

Test scenarios: 
2)migration network hit a failure on dst host when migration is active, then migration hang, use yank to fail migration. 

Test steps:
1.Boot a guest on src host;
2.Boot a guest on dst host with '-incoming defer';
3.Set migration incoming on dst host via qmp cmd;
{"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:1234"}}
4.Start migration on src host via qmp cmd;
{"execute": "migrate","arguments":{"uri": "tcp:$dst_migration_ip:1234"}}
5.During migration is active, down migration network on dst host:
# nmcli con down ens1f1
6.After migration hang(query migrate, only the total time of migration is increasing, other migration params stay unchanged), use yank cmd to fail migration on src host:
{ "execute": "query-yank" }
{"return": [{"type": "chardev", "id": "qmp_id_qmpmonitor1"}, {"type": "chardev", "id": "qmp_id_catch_monitor"}, {"type": "chardev", "id": "compat_monitor0"}, {"type": "chardev", "id": "serial0"}, {"type": "migration"}]}
{"execute":"yank","arguments":{"instances":[{"type":"migration"}]}}

Actual result:
After step 6, query migration status, get failed status on src host via qmp cmd, I think the result is expected:
{"execute":"query-migrate"}
{"return": {"blocked": false, "status": "failed", "error-desc": "Unable to write to socket: Broken pipe"}}

Next quit qemu process on dst host by manually, then restore migration network on dst host, we can start migration again and vm works well on dst host after migration.


According to Comment 4 && Comment 12, I have tested yank under two scenarios, all they pass:
1) inject firewall via drop on dst host when migration is active, then migration hang, use yank to fail migration. 
2) migration network hit a failure on dst host when migration is active, then migration hang, use yank to fail migration. 



Question 1:
Do we support yank under multifd migration?
I see multifd [1] file is added some yank codes. but when test multifd migration under scenario 2), doesn't get expected result like precopy. I will add a new comment 13 to describe the test with yank and multifd. 

Question 2:
I think Scenario 2) is a better test scenario, so I will add a case 'use yank to fail migration when migration network hit a failure' for precopy migration, is it ok? (BTW, if we support multifd with yank, I will also add one corresponding case.)


[1]
$ git show b5eea99ec2f5cf6fa0ac12a757c8873b1d2a73a4
...
diff --git a/migration/multifd.c b/migration/multifd.c
index 45c690aa11..1a1e589064 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -25,6 +25,9 @@
 #include "trace.h"
 #include "multifd.h"
 
+#include "qemu/yank.h"
+#include "io/channel-socket.h"
...

Comment 13 Li Xiaohui 2021-06-09 07:21:18 UTC
When test multifd with yank similar with Comment 12 but enabled multifd capability
1) after network down, execute 'yank' to fail migration:
Result: wait about 10 mins, get migration failed:
{"execute":"query-migrate"}
{"return": {"blocked": false, "status": "failed", "error-desc": "Unable to write to socket: No route to host"}}
2) after network down, wait several seconds then recover network, execute 'yank' to fail migration:
Result: will get migration failed at once:
{"execute":"query-migrate"}
{"return": {"blocked": false, "status": "failed", "error-desc": "Unable to write to socket: Broken pipe"}}


I'm confused if we support yank with multifd, why we can't get migration failed at once after yank, and the error-desc is different? and why when recover network then we will get migration failed at once after yank?


Notes: all qmp commands are sent under 'OOB' capability in Commment 12 & 13

Comment 14 Peter Krempa 2021-06-09 08:53:15 UTC
Please note, that the hanging migration which this bug was cloned from actually hangs on a 'blockdev-del' of a NBD used for copying over disks/storage, so in this context.

Also note that the exact steps to make the connection stuck aren't probably known yet. It's not as simple as disabling an interface as in the case when 'blockdev-del' hung a proxy was involved and probably misbehaved.

Comment 15 Li Xiaohui 2021-06-09 09:20:02 UTC
(In reply to Peter Krempa from comment #14)
> Please note, that the hanging migration which this bug was cloned from
> actually hangs on a 'blockdev-del' of a NBD used for copying over
> disks/storage, so in this context.
> 

Thanks. 
Zixi, need your verify about yank on NBD here. 


> Also note that the exact steps to make the connection stuck aren't probably
> known yet. It's not as simple as disabling an interface as in the case when
> 'blockdev-del' hung a proxy was involved and probably misbehaved.

I could now thought above two scenarios to test yank from migration perspective. 
And I can update my test scenarios if get the exact and better steps to make connection stuck.
But if we need more test steps about NBD, Zixi maybe could help do it.

Comment 16 Dr. David Alan Gilbert 2021-06-09 10:27:33 UTC
(In reply to Li Xiaohui from comment #12)
> Test 'yank' command about scenario 2) migration network hit a failure on the
> latest RHELAV-8.5.0 (), the result also pass.
> 
> Test environment:
> hosts: kernel-4.18.0-310.el8.x86_64 &
> qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.x86_64
> 
> Test scenarios: 
> 2)migration network hit a failure on dst host when migration is active, then
> migration hang, use yank to fail migration. 
> 
> Test steps:
> 1.Boot a guest on src host;
> 2.Boot a guest on dst host with '-incoming defer';
> 3.Set migration incoming on dst host via qmp cmd;
> {"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:1234"}}
> 4.Start migration on src host via qmp cmd;
> {"execute": "migrate","arguments":{"uri": "tcp:$dst_migration_ip:1234"}}
> 5.During migration is active, down migration network on dst host:
> # nmcli con down ens1f1
> 6.After migration hang(query migrate, only the total time of migration is
> increasing, other migration params stay unchanged), use yank cmd to fail
> migration on src host:
> { "execute": "query-yank" }
> {"return": [{"type": "chardev", "id": "qmp_id_qmpmonitor1"}, {"type":
> "chardev", "id": "qmp_id_catch_monitor"}, {"type": "chardev", "id":
> "compat_monitor0"}, {"type": "chardev", "id": "serial0"}, {"type":
> "migration"}]}
> {"execute":"yank","arguments":{"instances":[{"type":"migration"}]}}
> 
> Actual result:
> After step 6, query migration status, get failed status on src host via qmp
> cmd, I think the result is expected:
> {"execute":"query-migrate"}
> {"return": {"blocked": false, "status": "failed", "error-desc": "Unable to
> write to socket: Broken pipe"}}
> 
> Next quit qemu process on dst host by manually, then restore migration
> network on dst host, we can start migration again and vm works well on dst
> host after migration.
> 
> 
> According to Comment 4 && Comment 12, I have tested yank under two
> scenarios, all they pass:
> 1) inject firewall via drop on dst host when migration is active, then
> migration hang, use yank to fail migration. 
> 2) migration network hit a failure on dst host when migration is active,
> then migration hang, use yank to fail migration. 
> 
> 
> 
> Question 1:
> Do we support yank under multifd migration?
> I see multifd [1] file is added some yank codes. but when test multifd
> migration under scenario 2), doesn't get expected result like precopy. I
> will add a new comment 13 to describe the test with yank and multifd. 

If it's taking 10 minutes to report the failure in multifd that sounds like a bug in multifd's yank code.
Please take that as a separate bz just for multifd.

> Question 2:
> I think Scenario 2) is a better test scenario, so I will add a case 'use
> yank to fail migration when migration network hit a failure' for precopy
> migration, is it ok? (BTW, if we support multifd with yank, I will also add
> one corresponding case.)

I think the firewalling mechanism is probably good - as long as you tell the destination to 'drop' packets

> 
> [1]
> $ git show b5eea99ec2f5cf6fa0ac12a757c8873b1d2a73a4
> ...
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 45c690aa11..1a1e589064 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -25,6 +25,9 @@
>  #include "trace.h"
>  #include "multifd.h"
>  
> +#include "qemu/yank.h"
> +#include "io/channel-socket.h"
> ...

Comment 17 Li Xiaohui 2021-06-10 09:47:50 UTC
(In reply to Dr. David Alan Gilbert from comment #16)
> (In reply to Li Xiaohui from comment #12)
> > Test 'yank' command about scenario 2) migration network hit a failure on the
> > latest RHELAV-8.5.0 (), the result also pass.
> > 
> > Question 1:
> > Do we support yank under multifd migration?
> > I see multifd [1] file is added some yank codes. but when test multifd
> > migration under scenario 2), doesn't get expected result like precopy. I
> > will add a new comment 13 to describe the test with yank and multifd. 
> 
> If it's taking 10 minutes to report the failure in multifd that sounds like
> a bug in multifd's yank code.
> Please take that as a separate bz just for multifd.

Filed a new bz:
Bug 1970337 - Fail to get migration failure immediately if yank under multifd migration


> 
> > Question 2:
> > I think Scenario 2) is a better test scenario, so I will add a case 'use
> > yank to fail migration when migration network hit a failure' for precopy
> > migration, is it ok? (BTW, if we support multifd with yank, I will also add
> > one corresponding case.)
> 
> I think the firewalling mechanism is probably good - as long as you tell the
> destination to 'drop' packets
> 

Got it, thank you.

Comment 18 zixchen 2021-06-11 01:26:12 UTC
Verified from NBD.

yank cmmond can quit the blocked migration, when network issue happened between the migrating destination host and the nbd server.

Version:
qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.x86_64
kernel-4.18.0-310.el8.x86_64

Steps and results:
1.Boot a guest on src host;
2.Boot a guest on dst host with '-incoming defer';
3.Set migration incoming on dst host via qmp cmd;
{"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:1234"}}
4.Start migration on src host via qmp cmd;
{"execute": "migrate","arguments":{"uri": "tcp:$ip:1234"}}
5.During migration is active, inject firewall via drop on nbd server:
# iptables -A INPUT -p tcp --dport 10820 -j DROP
6.the migration hangs:
{ "execute": "query-yank" }
{"return": [{"type": "chardev", "id": "qmp_id_qmpmonitor1"}, {"type": "chardev", "id": "qmp_id_catch_monitor"}, {"type": "chardev", "id": "compat_monitor0"}, {"node-name": "nbd_image1", "type": "block-node"}, {"type": "chardev", "id": "serial0"}, {"type": "migration"}]}
{"execute":"yank","arguments":{"instances":[{"type":"migration"}]}}
{"return": {}}
{"execute":"query-migrate"}
{"return": {"blocked": false, "status": "failed", "error-desc": "Unable to write to socket: Broken pipe"}}

Drop the iptables rule and quit the qemu process, then start over the migration, it can success.

Comment 19 Li Xiaohui 2021-06-11 03:17:15 UTC
Thank you Zixi for helping testing.
I would mark this bz verified per Comment 4 & 12 & 13 & 16 & 17.

About NBD issue in Bug 1945532 - VM migration halts occasionally:
1.We aren't clear how to reproduce bz now, so shouldn't block here;
2.'yank' command support is a separate request, not a solution to bz 1945532 like said in the comment: https://bugzilla.redhat.com/show_bug.cgi?id=1945532#c69
3.We can continue to track bz 1945532, NBD QE zixchen and Storage vm migration QE aliang will go on help testing if needed (I have cc them, they will track the issue)