Bug 2106726
| Summary: | Qemu on destination host crashed if migrate with postcopy and multifd enabled | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> | |
| Component: | qemu-kvm | Assignee: | Leonardo Bras <leobras> | |
| qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | medium | CC: | chayang, coli, fjin, jferlan, jinzhao, juzhang, lcheng, leobras, lijin, mdeng, mrezanin, mzamazal, nilal, peterx, quintela, virt-maint | |
| Version: | 9.2 | Keywords: | Triaged | |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
|
| Target Release: | 9.3 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | qemu-kvm-8.0.0-1.el9 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2169733 (view as bug list) | Environment: | ||
| Last Closed: | 2023-11-07 08:26:38 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2180898 | |||
| Bug Blocks: | 2169733 | |||
|
Description
Li Xiaohui
2022-07-13 11:38:57 UTC
This bug should also happen on RHEL 8.7.0, if we plan to fix it for RHEL 8.7.0, I will clone one. Hello Li Xiaohui, While I am unaware of status support for this scenario, I would like to better understand the issue by the technical viewpoint. Could you please share commands you used to reproduce this? (In reply to Leonardo Bras from comment #2) > Hello Li Xiaohui, > > While I am unaware of status support for this scenario, I would like to > better understand the issue by the technical viewpoint. > > Could you please share commands you used to reproduce this? Qemu command like below [1]. 1.Boot a guest with qemu command [1] on source host; 2.Boot a guest with same qemu command but append '-incoming defer' on destination host; 3.Enable multifd and postcopy capabilities on src and dst hosts: {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"multifd","state":true}]}} {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}} 4.During migration is active, switch to postcopy mode: {"execute":"migrate-start-postcopy"} After migration completes, qemu on dst host would crash. I will attach the qemu core dump file later. Qemu command lines [1]: /usr/libexec/qemu-kvm \ -name "mouse-vm" \ -sandbox on \ -machine q35,memory-backend=pc.ram \ -cpu EPYC-IBPB,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,arch-capabilities=on,xsaves=on,cmp-legacy=on,perfctr-core=on,clzero=on,xsaveerptr=on,virt-ssbd=on,npt=off,nrip-save=off,svme-addr-chk=off,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,monitor=off \ -nodefaults \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,addr=0x2 \ -device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \ -device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \ -device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \ -device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \ -device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \ -device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \ -device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \ -device pcie-root-port,port=0x20,chassis=21,id=extra_root0,bus=pcie.0,multifunction=on,addr=0x3 \ -device pcie-root-port,port=0x21,chassis=22,id=extra_root1,bus=pcie.0,addr=0x3.0x1 \ -device pcie-root-port,port=0x22,chassis=23,id=extra_root2,bus=pcie.0,addr=0x3.0x2 \ -device nec-usb-xhci,id=usb1,bus=root0,addr=0x0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1,addr=0x0 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0,write-cache=on \ -device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,netdev=tap0,bus=root2,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device virtio-balloon-pci,id=balloon0,bus=root3,addr=0x0 \ -device VGA,id=video0,vgamem_mb=16,bus=pcie.0,addr=0x1 \ -blockdev driver=file,auto-read-only=on,discard=unmap,aio=threads,cache.direct=on,cache.no-flush=off,filename=/mnt/xiaohli/rhel910-64-virtio-scsi.qcow2,node-name=drive_sys1 \ -blockdev driver=qcow2,node-name=drive_image1,read-only=off,cache.direct=on,cache.no-flush=off,file=drive_sys1 \ -netdev tap,id=tap0,vhost=on \ -m 24576 \ -object memory-backend-ram,id=pc.ram,size=24576M \ -smp 28,maxcpus=32,cores=8,threads=2,sockets=2 \ -vnc :10 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,strict=off,order=cdn,once=c \ -enable-kvm \ -qmp tcp:0:3333,server=on,wait=off \ -serial tcp:0:4444,server=on,wait=off \ -monitor stdio \ -msg timestamp=on \ It *looks like* an yank issue. I will try to reproduce it, and see what can I do. It looks like it was a yank issue in multifd + postcopy, not unregistering the multifd channels, and causing yank to crash. I just sent a v1 for reviewing: https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795-1-leobras@redhat.com/ Once it gets merged, I will proceed with the backporting (it should affect versions since RHEL 8.6 at least). (In reply to Leonardo Bras from comment #6) > It looks like it was a yank issue in multifd + postcopy, not unregistering > the multifd channels, and causing yank to crash. > > I just sent a v1 for reviewing: > https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795- > 1-leobras/ > > Once it gets merged, I will proceed with the backporting (it should affect > versions since RHEL 8.6 at least). What RHEL 8 version you plan to fix on? Maybe we only need to fix the latest RHEL 8.8? (In reply to Li Xiaohui from comment #7) > (In reply to Leonardo Bras from comment #6) > > It looks like it was a yank issue in multifd + postcopy, not unregistering > > the multifd channels, and causing yank to crash. > > > > I just sent a v1 for reviewing: > > https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795- > > 1-leobras/ > > > > Once it gets merged, I will proceed with the backporting (it should affect > > versions since RHEL 8.6 at least). > > What RHEL 8 version you plan to fix on? Maybe we only need to fix the latest > RHEL 8.8? That's a good question. It's a bugfix, so IIUC we should provide the fix to every affected version. On the other hand, is multifd + postcopy supported by Red Hat in any product? Anyway, whatever is decided it should be no problem backporting. (In reply to Leonardo Bras from comment #8) > (In reply to Li Xiaohui from comment #7) > > (In reply to Leonardo Bras from comment #6) > > > It looks like it was a yank issue in multifd + postcopy, not unregistering > > > the multifd channels, and causing yank to crash. > > > > > > I just sent a v1 for reviewing: > > > https://patchwork.kernel.org/project/qemu-devel/patch/20221109055629.789795- > > > 1-leobras/ > > > > > > Once it gets merged, I will proceed with the backporting (it should affect > > > versions since RHEL 8.6 at least). > > > > What RHEL 8 version you plan to fix on? Maybe we only need to fix the latest > > RHEL 8.8? > > That's a good question. > > It's a bugfix, so IIUC we should provide the fix to every affected version. > On the other hand, is multifd + postcopy supported by Red Hat in any product? I can't answer this question. But I think zstream backport needs a strong justification. I don't think this bug is necessary to backport on RHEL 8 zstream as I never see similar bugs reported by the customer before. I would clone one for RHEL 8.8 first. > > Anyway, whatever is decided it should be no problem backporting. Thank you "V2" here: https://patchwork.kernel.org/project/qemu-devel/list/?series=720556&state=%2A&archive=both (Not actually sent as a V2, but also fixes the issue) It was already approved and merged upstream under commit-id: cfc3bcf373218fb8757b0ff1ce2017b9b6ad4bff Merge request created for centos9s : https://gitlab.com/redhat/centos-stream/src/qemu-kvm/-/merge_requests/151 QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Verify this bug through the below tests, only hit one issue when testing postcopy recovery with multifd enabled. We have a known bug https://bugzilla.redhat.com/show_bug.cgi?id=2107817#c1. Let's track this issue on that bug. ********************************************************************************************** RESULTS [VIRT-49060-X86-Q35-BLOCKDEV]: ==>TOTAL : 14 ==>PASS : 13 1: BASE-TEST-POSTCOPY-Migration basic precopy test without setting downtime and speed (6 min 28 sec) 2: VIRT-49062-[postcopy] Migration finishes only with postcopy under high stress (rhel only) (15 min 29 sec) 3: VIRT-58670-[postcopy] Cancel migration during the precopy phase (1 min 36 sec) 4: VIRT-58672-[postcopy] Source should recovers when fail the destination during the precopy phase (1 min 32 sec) 5: VIRT-85702-[postcopy] Post-copy migration with XBZRLE compression (3 min 24 sec) 6: VIRT-294886-[migration] Postcopy migration recover after migrate-pause (2 min 28 sec) 7: RHEL-150076-[postcopy] Set postcopy migration speed(max-postcopy-bandwidth) (4 min 48 sec) 8: RHEL-186017-[postcopy] Basic postcopy migration (3 min 20 sec) 9: RHEL-189930-[postcopy] Post-copy migration with enabling auto-converge (3 min 28 sec) 10: POSTCOPY-MULTIFD-[postcopy] postcopy + multifd migration (3 min 12 sec) 11: VIRT-86251-[postcopy] live migration post-copy support file-backed memory (3 min 52 sec) 12: VIRT-93722-[postcopy]Postcopy migration with Numa pinned and Hugepage pinned guest--file backend (3 min 32 sec) 13: POSTCOPY-MULTIFD-MEMORY-TEST-[postcopy] Postcopy + multifd migration with Numa pinned and Hugepage pinned guest--file backend (3 min 36 sec) ==>ERROR : 1 1: POSTCOPY-MULTIFD-PAUSE-TEST-[migration] Postcopy + multifd migration recover after migrate-pause (21 min 41 sec) ==>FAIL : 0 ==>CANCEL : 0 ==>SKIP : 0 ==>WARN : 0 ==>RUN TIME : 74 min 47 sec ==>TEST LOG : /home/ipa/test_logs/VIRT_49060_x86_q35_blockdev-2023-04-25-05:44:49 ********************************************************************************************** RESULTS [RHEL-175691-X86-Q35-BLOCKDEV]: ==>TOTAL : 6 ==>PASS : 6 1: VIRT-109869-[Multiple-fds] Live migration with multifd on (13 min 4 sec) 2: RHEL-186122-[Multiple-fds] Multifd migration cancel test (13 min 32 sec) 3: RHEL-199218-[Multiple-fds] TLS encryption migration via ipv4 addr with multifd enabled (3 min 32 sec) 4: POSTCOPY-MULTIFD-TLS-[Multiple-fds] TLS encryption migration via ipv4 addr with postcopy and multifd enabled (3 min 32 sec) 5: POSTCOPY-MULTIFD-THREAD-TEST-[Multiple-fds] Postcopy + multifd migration with setting multifd threads (3 min 32 sec) 6: RHEL-186019-[Multiple-fds] Multifd migration with Numa pinned and Hugepage pinned guest (3 min 40 sec) ==>ERROR : 0 ==>FAIL : 0 ==>CANCEL : 0 ==>SKIP : 0 ==>WARN : 0 ==>RUN TIME : 41 min 5 sec ==>TEST LOG : /home/ipa/test_logs/RHEL_175691_x86_q35_blockdev-2023-04-25-06:59:37 ********************************************************************************************** So I would mark this bug verified per above test results As We don't plan to support postcopy + multifd scenarios on RHEL 9.3.0, I marked qe_test_coverage- for this bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6368 |