Description of problem: Qemu core dump when do tls migration on two hosts: (qemu) qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed. Aborted (core dumped) Version-Release number of selected component (if applicable): hosts info: kernel-4.18.0-305.6.el8.x86_64 & qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64 guest info: kernel-4.18.0-305.8.el8.x86_64 How reproducible: 100% Steps to Reproduce: 1.Ca files generated as attachment 2.Boot a vm as tls server on dst host: -object tls-creds-x509,id=tls0,endpoint=server,dir=//mnt/nfs/tls \ -incoming defer \ 3.Boot a vm as tls client on src host: -object tls-creds-x509,id=tls0,endpoint=client,dir=//mnt/nfs/tls \ 4.On dst host: {"execute": "migrate-set-parameters", "arguments": {"tls-creds": "tls0"}, "id": "wd2lS2kr"} {"execute": "migrate-incoming", "arguments": {"uri": "tcp:10.73.130.67:4000"}, "id": "iyPg3lJW"} On src host: {"execute": "migrate-set-parameters", "arguments": {"tls-creds": "tls0"}, "id": "SrFNLZBe"} {"execute": "migrate", "arguments": {"uri": "tcp:hp-dl385g10-13:4000"}, "id": "2FCViNK3"} Actual results: During migration, qemu on src&dst host hit core dump: (qemu) qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed. Aborted (core dumped) Expected results: Migration succeed, vm works well after migration Additional info:
Leo: This looks like the one you tripped over a couple of days ago.
Created attachment 1786771 [details] tls_cert.sh
When do TLS encryption migration via exec, case passed.
I had this issue a while ago, and yesterday I took some time to try and understand this. I first reverted to a pre-yank commit, and after some small fixes, it worked just fine. Then I read the cover letter for the last yank patchset, and tried to monitor the usage of yank interface: - No-TLS migration did: 1 - register_instance() 2 - register_function() 3 - unregister_function() 4 - unregister_instance(). - When I try TLS migration, (3) don't happen at all, and this causes (4) to abort, because there are still valid functions registered. Looking closely, this happens because in migration_channel_connect(), if migration happens over TLS, to_dst_file is not assigned, causing (3) not to happen in migrate_fd_cleanup() because qemu_fclose() is not ran. Now I am trying to understand exactly where (3) is supposed to happen when TLS is used. (In reply to Li Xiaohui from comment #3) > When do TLS encryption migration via exec, case passed. I could not get exec to work with TLS. Could you please show the cmdline used for receiving and sending ends?
> > (In reply to Li Xiaohui from comment #3) > > When do TLS encryption migration via exec, case passed. > > I could not get exec to work with TLS. > Could you please show the cmdline used for receiving and sending ends? Test steps about exec: 1. Step1~3 are same with Comment 0; 2. In dst host: (qemu) migrate_set_parameter tls-creds tls0 (qemu) migrate_incoming "exec:socat - TCP4-LISTEN:9002" In src host: (qemu) migrate_set_parameter tls-creds tls0 (qemu) migrate_set_parameter tls-hostname $dst_short_host_name (qemu) migrate "exec:socat - TCP4:$dst_short_host_name:9002"
(In reply to Li Xiaohui from comment #5) > > > > (In reply to Li Xiaohui from comment #3) > > > When do TLS encryption migration via exec, case passed. > > > > I could not get exec to work with TLS. > > Could you please show the cmdline used for receiving and sending ends? > > Test steps about exec: > 1. Step1~3 are same with Comment 0; > 2. In dst host: > (qemu) migrate_set_parameter tls-creds tls0 > (qemu) migrate_incoming "exec:socat - TCP4-LISTEN:9002" > In src host: > (qemu) migrate_set_parameter tls-creds tls0 > (qemu) migrate_set_parameter tls-hostname $dst_short_host_name > (qemu) migrate "exec:socat - TCP4:$dst_short_host_name:9002" Thanks Xi, I will try that too! I have sent a v1 & v2 for a fix in this bug: http://patchwork.ozlabs.org/project/qemu-devel/patch/20210526221615.1093506-1-leobras.c@gmail.com/ Peter Xu recommended refactoring Yank on Migration so we place yank in channel-{tls,socket}, which seems to make more sense here.
I posted a v3 earlier, which was reviewed by Peter Xu and Lukas Straub: http://patchwork.ozlabs.org/project/qemu-devel/patch/20210601054030.1153249-1-leobras.c@gmail.com/ By all tests I did, this fixes the issue.
Patch got accepted upstream at: https://gitlab.com/qemu-project/qemu/-/commit/7de2e8565335c13fb3516cddbe2e40e366cce273 (master)
Pass after testing on rhelav 8.5.0 (kernel-4.18.0-312.el8.x86_64 & qemu-kvm-6.0.0-20.module+el8.5.0+11499+199527ef.x86_64) Run automation per tls migration test requirement with rhel8.5.0 and win2022 guests, cases all passed, logs please see following link: http://fileshare.englab.nay.redhat.com/pub/logs/xiaohli/bz1964326/
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.
Mark bz as verified per Comment 17, and remove SanityOnly because we have scenarios to cover this bz.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684