2106726 – Qemu on destination host crashed if migrate with postcopy and multifd enabled

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled

Summary: Qemu on destination host crashed if migrate with postcopy and multifd enabled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	9.3
Assignee:	Leonardo Bras
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:	2180898
Blocks:	2169733
TreeView+	depends on / blocked

Reported:	2022-07-13 11:38 UTC by Li Xiaohui
Modified:	2023-11-07 09:15 UTC (History)
CC List:	16 users (show)
Fixed In Version:	qemu-kvm-8.0.0-1.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2169733 (view as bug list)
Environment:
Last Closed:	2023-11-07 08:26:38 UTC
Type:	---
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-127606	0	None	None	None	2022-07-13 18:16:08 UTC
Red Hat Product Errata	RHSA-2023:6368	0	None	None	None	2023-11-07 08:27:17 UTC

Description Li Xiaohui 2022-07-13 11:38:57 UTC

Description of problem:
When enable postcopy and multifd capabilities at the same time, after migration completed, qemu on destination crashed:
(qemu) qemu-kvm: ../util/yank.c:107: void yank_unregister_instance(const YankInstance *): Assertion `QLIST_EMPTY(&entry->yankfns)' failed.


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-121.el9.x86_64 & qemu-kvm-7.0.0-8.el9.x86_64
guest info: kernel-5.14.0-125.el9.x86_64


How reproducible:
100%


Steps to Reproduce:
1.Migrate guest with postcopy and multifd enabled
2.
3.


Actual results:
Migration will finish, but qemu on destination host crashed:
(qemu) qemu-kvm: ../util/yank.c:107: void yank_unregister_instance(const YankInstance *): Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
2.sh: line 50: 140325 Aborted                 (core dumped) 


Expected results:
Hear from Juan, we do support postcopy + multifd migration on RHEL 8.7 and RHEL 9.1.
Then here postcopy +multifd migration should succeed, dst qemu and vm works well after migration.


Additional info:

Comment 1 Li Xiaohui 2022-07-13 11:42:53 UTC

This bug should also happen on RHEL 8.7.0, if we plan to fix it for RHEL 8.7.0, I will clone one.

Comment 2 Leonardo Bras 2022-07-18 20:40:53 UTC

Hello Li Xiaohui,

While I am unaware of status support for this scenario, I would like to better understand the issue by the technical viewpoint.

Could you please share commands you used to reproduce this?

Comment 3 Li Xiaohui 2022-08-16 02:46:11 UTC

(In reply to Leonardo Bras from comment #2)
> Hello Li Xiaohui,
> 
> While I am unaware of status support for this scenario, I would like to
> better understand the issue by the technical viewpoint.
> 
> Could you please share commands you used to reproduce this?

Qemu command like below [1].

1.Boot a guest with qemu command [1] on source host;
2.Boot a guest with same qemu command but append '-incoming defer' on destination host;
3.Enable multifd and postcopy capabilities on src and dst hosts:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"multifd","state":true}]}}
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}
4.During migration is active, switch to postcopy mode:
{"execute":"migrate-start-postcopy"}


After migration completes, qemu on dst host would crash. I will attach the qemu core dump file later.

Comment 4 Li Xiaohui 2022-08-16 02:47:30 UTC

Qemu command lines [1]:
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox on \
-machine q35,memory-backend=pc.ram \
-cpu EPYC-IBPB,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,arch-capabilities=on,xsaves=on,cmp-legacy=on,perfctr-core=on,clzero=on,xsaveerptr=on,virt-ssbd=on,npt=off,nrip-save=off,svme-addr-chk=off,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,monitor=off \
-nodefaults  \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \
-device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \
-device pcie-root-port,port=0x20,chassis=21,id=extra_root0,bus=pcie.0,multifunction=on,addr=0x3 \
-device pcie-root-port,port=0x21,chassis=22,id=extra_root1,bus=pcie.0,addr=0x3.0x1 \
-device pcie-root-port,port=0x22,chassis=23,id=extra_root2,bus=pcie.0,addr=0x3.0x2 \
-device nec-usb-xhci,id=usb1,bus=root0,addr=0x0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1,addr=0x0 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0,write-cache=on \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,netdev=tap0,bus=root2,addr=0x0 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-device virtio-balloon-pci,id=balloon0,bus=root3,addr=0x0 \
-device VGA,id=video0,vgamem_mb=16,bus=pcie.0,addr=0x1 \
-blockdev driver=file,auto-read-only=on,discard=unmap,aio=threads,cache.direct=on,cache.no-flush=off,filename=/mnt/xiaohli/rhel910-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,read-only=off,cache.direct=on,cache.no-flush=off,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 24576 \
-object memory-backend-ram,id=pc.ram,size=24576M \
-smp 28,maxcpus=32,cores=8,threads=2,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host,driftfix=slew \
-boot menu=off,strict=off,order=cdn,once=c \
-enable-kvm  \
-qmp tcp:0:3333,server=on,wait=off \
-serial tcp:0:4444,server=on,wait=off \
-monitor stdio \
-msg timestamp=on \

Comment 5 Leonardo Bras 2022-11-08 05:27:24 UTC

It *looks like* an yank issue. 
I will try to reproduce it, and see what can I do.

Comment 6 Leonardo Bras 2022-11-09 06:02:56 UTC

It looks like it was a yank issue in multifd + postcopy, not unregistering the multifd channels, and causing yank to crash.

I just sent a v1 for reviewing:
https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795-1-leobras@redhat.com/

Once it gets merged, I will proceed with the backporting (it should affect versions since RHEL 8.6 at least).

Comment 7 Li Xiaohui 2022-11-09 06:48:43 UTC

(In reply to Leonardo Bras from comment #6)
> It looks like it was a yank issue in multifd + postcopy, not unregistering
> the multifd channels, and causing yank to crash.
> 
> I just sent a v1 for reviewing:
> https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795-
> 1-leobras/
> 
> Once it gets merged, I will proceed with the backporting (it should affect
> versions since RHEL 8.6 at least).

What RHEL 8 version you plan to fix on? Maybe we only need to fix the latest RHEL 8.8?

Comment 8 Leonardo Bras 2022-11-09 06:57:24 UTC

(In reply to Li Xiaohui from comment #7)
> (In reply to Leonardo Bras from comment #6)
> > It looks like it was a yank issue in multifd + postcopy, not unregistering
> > the multifd channels, and causing yank to crash.
> > 
> > I just sent a v1 for reviewing:
> > https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795-
> > 1-leobras/
> > 
> > Once it gets merged, I will proceed with the backporting (it should affect
> > versions since RHEL 8.6 at least).
> 
> What RHEL 8 version you plan to fix on? Maybe we only need to fix the latest
> RHEL 8.8?

That's a good question.

It's a bugfix, so IIUC we should provide the fix to every affected version.
On the other hand, is multifd + postcopy supported by Red Hat in any product?

Anyway, whatever is decided it should be no problem backporting.

Comment 9 Li Xiaohui 2022-11-09 07:16:30 UTC

(In reply to Leonardo Bras from comment #8)
> (In reply to Li Xiaohui from comment #7)
> > (In reply to Leonardo Bras from comment #6)
> > > It looks like it was a yank issue in multifd + postcopy, not unregistering
> > > the multifd channels, and causing yank to crash.
> > > 
> > > I just sent a v1 for reviewing:
> > > https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795-
> > > 1-leobras/
> > > 
> > > Once it gets merged, I will proceed with the backporting (it should affect
> > > versions since RHEL 8.6 at least).
> > 
> > What RHEL 8 version you plan to fix on? Maybe we only need to fix the latest
> > RHEL 8.8?
> 
> That's a good question.
> 
> It's a bugfix, so IIUC we should provide the fix to every affected version.
> On the other hand, is multifd + postcopy supported by Red Hat in any product?

I can't answer this question. But I think zstream backport needs a strong justification.
I don't think this bug is necessary to backport on RHEL 8 zstream as I never see similar bugs reported by the customer before.

I would clone one for RHEL 8.8 first.


> 
> Anyway, whatever is decided it should be no problem backporting.

Thank you

Comment 10 Leonardo Bras 2023-02-18 01:49:01 UTC

"V2" here: https://patchwork.kernel.org/project/qemu-devel/list/?series=720556&state=%2A&archive=both
(Not actually sent as a V2, but also fixes the issue)

It was already approved and merged upstream under commit-id: cfc3bcf373218fb8757b0ff1ce2017b9b6ad4bff

Merge request created for centos9s :
https://gitlab.com/redhat/centos-stream/src/qemu-kvm/-/merge_requests/151

Comment 25 Yanan Fu 2023-04-24 10:33:58 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 28 Li Xiaohui 2023-04-27 07:08:26 UTC

Verify this bug through the below tests, only hit one issue when testing postcopy recovery with multifd enabled. 
We have a known bug https://bugzilla.redhat.com/show_bug.cgi?id=2107817#c1. 
Let's track this issue on that bug.

**********************************************************************************************
RESULTS [VIRT-49060-X86-Q35-BLOCKDEV]:
==>TOTAL : 14
==>PASS : 13 
   1: BASE-TEST-POSTCOPY-Migration basic precopy test without setting downtime and speed (6 min 28 sec)
   2: VIRT-49062-[postcopy] Migration finishes only with postcopy under high stress (rhel only) (15 min 29 sec)
   3: VIRT-58670-[postcopy] Cancel migration during the precopy phase (1 min 36 sec)
   4: VIRT-58672-[postcopy] Source should recovers when fail the destination during the precopy phase (1 min 32 sec)
   5: VIRT-85702-[postcopy] Post-copy migration with XBZRLE compression (3 min 24 sec)
   6: VIRT-294886-[migration] Postcopy migration recover after migrate-pause (2 min 28 sec)
   7: RHEL-150076-[postcopy] Set postcopy migration speed(max-postcopy-bandwidth) (4 min 48 sec)
   8: RHEL-186017-[postcopy] Basic postcopy migration (3 min 20 sec)
   9: RHEL-189930-[postcopy] Post-copy migration with enabling auto-converge (3 min 28 sec)
   10: POSTCOPY-MULTIFD-[postcopy] postcopy + multifd migration (3 min 12 sec)
   11: VIRT-86251-[postcopy] live migration post-copy support file-backed memory (3 min 52 sec)
   12: VIRT-93722-[postcopy]Postcopy migration with Numa pinned and Hugepage pinned guest--file backend (3 min 32 sec)
   13: POSTCOPY-MULTIFD-MEMORY-TEST-[postcopy] Postcopy + multifd migration with Numa pinned and Hugepage pinned guest--file backend (3 min 36 sec)
==>ERROR : 1 
   1: POSTCOPY-MULTIFD-PAUSE-TEST-[migration] Postcopy + multifd migration recover after migrate-pause (21 min 41 sec)
==>FAIL : 0 
==>CANCEL : 0 
==>SKIP : 0 
==>WARN : 0 
==>RUN TIME : 74 min 47 sec 
==>TEST LOG : /home/ipa/test_logs/VIRT_49060_x86_q35_blockdev-2023-04-25-05:44:49 
**********************************************************************************************
RESULTS [RHEL-175691-X86-Q35-BLOCKDEV]:
==>TOTAL : 6
==>PASS : 6 
   1: VIRT-109869-[Multiple-fds] Live migration with multifd on (13 min 4 sec)
   2: RHEL-186122-[Multiple-fds] Multifd migration cancel test (13 min 32 sec)
   3: RHEL-199218-[Multiple-fds] TLS encryption migration via ipv4 addr with multifd enabled (3 min 32 sec)
   4: POSTCOPY-MULTIFD-TLS-[Multiple-fds] TLS encryption migration via ipv4 addr with postcopy and multifd enabled (3 min 32 sec)
   5: POSTCOPY-MULTIFD-THREAD-TEST-[Multiple-fds] Postcopy + multifd migration with setting multifd threads (3 min 32 sec)
   6: RHEL-186019-[Multiple-fds] Multifd migration with Numa pinned and Hugepage pinned guest (3 min 40 sec)
==>ERROR : 0 
==>FAIL : 0 
==>CANCEL : 0 
==>SKIP : 0 
==>WARN : 0 
==>RUN TIME : 41 min 5 sec 
==>TEST LOG : /home/ipa/test_logs/RHEL_175691_x86_q35_blockdev-2023-04-25-06:59:37 
**********************************************************************************************



So I would mark this bug verified per above test results

Comment 29 Li Xiaohui 2023-09-28 03:01:59 UTC

As We don't plan to support postcopy + multifd scenarios on RHEL 9.3.0, I marked qe_test_coverage- for this bug.

Comment 31 errata-xmlrpc 2023-11-07 08:26:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6368

Note You need to log in before you can comment on or make changes to this bug.