Bug 1964326 - Qemu core dump when do tls migration via tcp protocol
Summary: Qemu core dump when do tls migration via tcp protocol
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 8.5
Assignee: Leonardo Bras
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-25 08:44 UTC by Li Xiaohui
Modified: 2021-11-16 08:40 UTC (History)
12 users (show)

Fixed In Version: qemu-kvm-6.0.0-20.module+el8.5.0+11499+199527ef
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-16 07:53:34 UTC
Type: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tls_cert.sh (1.67 KB, application/x-shellscript)
2021-05-25 08:52 UTC, Li Xiaohui
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src qemu-kvm merge_requests 10 0 None None None 2021-06-18 07:31:07 UTC
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:54:08 UTC

Description Li Xiaohui 2021-05-25 08:44:52 UTC
Description of problem:
Qemu core dump when do tls migration on two hosts:
(qemu) qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
Aborted (core dumped)


Version-Release number of selected component (if applicable):
hosts info: kernel-4.18.0-305.6.el8.x86_64 & 
qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64
guest info: kernel-4.18.0-305.8.el8.x86_64


How reproducible:
100%


Steps to Reproduce:
1.Ca files generated as attachment
2.Boot a vm as tls server on dst host:
-object tls-creds-x509,id=tls0,endpoint=server,dir=//mnt/nfs/tls \
-incoming defer \
3.Boot a vm as tls client on src host:
-object tls-creds-x509,id=tls0,endpoint=client,dir=//mnt/nfs/tls \
4.On dst host:
{"execute": "migrate-set-parameters", "arguments": {"tls-creds": "tls0"}, "id": "wd2lS2kr"}
{"execute": "migrate-incoming", "arguments": {"uri": "tcp:10.73.130.67:4000"}, "id": "iyPg3lJW"}
On src host:
{"execute": "migrate-set-parameters", "arguments": {"tls-creds": "tls0"}, "id": "SrFNLZBe"}
{"execute": "migrate", "arguments": {"uri": "tcp:hp-dl385g10-13:4000"}, "id": "2FCViNK3"}


Actual results:
During migration, qemu on src&dst host hit core dump:
(qemu) qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
Aborted (core dumped)


Expected results:
Migration succeed, vm works well after migration


Additional info:

Comment 1 Dr. David Alan Gilbert 2021-05-25 08:50:54 UTC
Leo: This looks like the one you tripped over a couple of days ago.

Comment 2 Li Xiaohui 2021-05-25 08:52:25 UTC
Created attachment 1786771 [details]
tls_cert.sh

Comment 3 Li Xiaohui 2021-05-25 11:25:22 UTC
When do TLS encryption migration via exec, case passed.

Comment 4 Leonardo Bras 2021-05-25 18:13:21 UTC
I had this issue a while ago, and yesterday I took some time to try and understand this.

I first reverted to a pre-yank commit, and after some small fixes, it worked just fine.

Then I read the cover letter for the last yank patchset, and tried to monitor the usage of yank interface:
- No-TLS migration did:
  1 - register_instance()
  2 - register_function()
  3 - unregister_function()
  4 - unregister_instance().

- When I try TLS migration, (3) don't happen at all, and this causes (4) to abort, because there are still valid functions registered.
Looking closely, this happens because in migration_channel_connect(), if migration happens over TLS, to_dst_file is not assigned, causing (3) not to happen in migrate_fd_cleanup() because qemu_fclose() is not ran. 

Now I am trying to understand exactly where (3) is supposed to happen when TLS is used.

(In reply to Li Xiaohui from comment #3)
> When do TLS encryption migration via exec, case passed.

I could not get exec to work with TLS. 
Could you please show the cmdline used for receiving and sending ends?

Comment 5 Li Xiaohui 2021-05-26 06:44:16 UTC
> 
> (In reply to Li Xiaohui from comment #3)
> > When do TLS encryption migration via exec, case passed.
> 
> I could not get exec to work with TLS. 
> Could you please show the cmdline used for receiving and sending ends?

Test steps about exec:
1. Step1~3 are same with Comment 0;
2. In dst host:
(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate_incoming "exec:socat - TCP4-LISTEN:9002"
In src host:
(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate_set_parameter tls-hostname $dst_short_host_name
(qemu) migrate "exec:socat - TCP4:$dst_short_host_name:9002"

Comment 6 Leonardo Bras 2021-05-27 04:21:36 UTC
(In reply to Li Xiaohui from comment #5)
> > 
> > (In reply to Li Xiaohui from comment #3)
> > > When do TLS encryption migration via exec, case passed.
> > 
> > I could not get exec to work with TLS. 
> > Could you please show the cmdline used for receiving and sending ends?
> 
> Test steps about exec:
> 1. Step1~3 are same with Comment 0;
> 2. In dst host:
> (qemu) migrate_set_parameter tls-creds tls0
> (qemu) migrate_incoming "exec:socat - TCP4-LISTEN:9002"
> In src host:
> (qemu) migrate_set_parameter tls-creds tls0
> (qemu) migrate_set_parameter tls-hostname $dst_short_host_name
> (qemu) migrate "exec:socat - TCP4:$dst_short_host_name:9002"

Thanks Xi, I will try that too!

I have sent a v1 & v2 for a fix in this bug:
http://patchwork.ozlabs.org/project/qemu-devel/patch/20210526221615.1093506-1-leobras.c@gmail.com/

Peter Xu recommended refactoring Yank on Migration so we place yank in channel-{tls,socket}, which seems to make more sense  here.

Comment 7 Leonardo Bras 2021-06-01 21:35:28 UTC
I posted a v3 earlier, which was reviewed by Peter Xu and Lukas Straub:
http://patchwork.ozlabs.org/project/qemu-devel/patch/20210601054030.1153249-1-leobras.c@gmail.com/

By all tests I did, this fixes the issue.

Comment 9 Leonardo Bras 2021-06-10 06:00:55 UTC
Patch got accepted upstream at:
https://gitlab.com/qemu-project/qemu/-/commit/7de2e8565335c13fb3516cddbe2e40e366cce273 (master)

Comment 17 Li Xiaohui 2021-06-22 07:11:43 UTC
Pass after testing on rhelav 8.5.0 (kernel-4.18.0-312.el8.x86_64 & qemu-kvm-6.0.0-20.module+el8.5.0+11499+199527ef.x86_64)

Run automation per tls migration test requirement with rhel8.5.0 and win2022 guests, cases all passed, logs please see following link:
http://fileshare.englab.nay.redhat.com/pub/logs/xiaohli/bz1964326/

Comment 18 Yanan Fu 2021-06-22 12:05:48 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 19 Li Xiaohui 2021-06-22 13:07:52 UTC
Mark bz as verified per Comment 17, and remove SanityOnly because we have scenarios to cover this bz.

Comment 21 errata-xmlrpc 2021-11-16 07:53:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.