Bug 1996609
Summary: | Qemu hit core dump when dump guest memory during live migration | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> |
Component: | qemu-kvm | Assignee: | Peter Xu <peterx> |
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | high | CC: | chayang, dgilbert, jinzhao, lcapitulino, leidwang, lijin, marcandre.lureau, mdean, peterx, virt-maint |
Version: | 9.0 | Keywords: | Triaged |
Target Milestone: | rc | ||
Target Release: | 9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-6.1.0-7.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-17 12:23:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Li Xiaohui
2021-08-23 10:16:25 UTC
qemu cli[1]: /usr/libexec/qemu-kvm \ -name "mouse-vm" \ -sandbox off \ -machine q35 \ -cpu Skylake-Client \ -nodefaults \ -vga std \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,addr=0x2 \ -device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \ -device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \ -device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \ -device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \ -device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \ -device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \ -device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \ -device nec-usb-xhci,id=usb1,bus=root0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \ -device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device vmcoreinfo \ -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/home/test/rhel900-64-virtio-scsi.qcow2,node-name=drive_sys1 \ -blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \ -netdev tap,id=tap0,vhost=on \ -m 4096 \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -vnc :10 \ -rtc base=utc,clock=host \ -boot menu=off,strict=off,order=cdn,once=c \ -enable-kvm \ -qmp tcp:0:3333,server=on,wait=off \ -qmp tcp:0:9999,server=on,wait=off \ -qmp tcp:0:9888,server=on,wait=off \ -serial tcp:0:4444,server=on,wait=off \ -monitor stdio \ Stack trace of thread 10573: #4 0x00007f90a8a5f3a6 __assert_fail (libc.so.6 + 0x3b3a6) #5 0x000055d85ef82a24 virtio_net_pre_save (qemu-kvm + 0x5e5a24) #6 0x000055d85ecef406 vmstate_save_state_v (qemu-kvm + 0x352406) #7 0x000055d85ee95ea0 qemu_savevm_state_complete_precopy_non_iterable (qe> #8 0x000055d85ee967bc qemu_savevm_state_complete_precopy (qemu-kvm + 0x4f> #9 0x000055d85ed4d6a2 migration_thread (qemu-kvm + 0x3b06a2) OK, so it's an assert fail during the migration thread; I think the assert is: static int virtio_net_pre_save(void *opaque) { VirtIONet *n = opaque; /* At this point, backend must be stopped, otherwise * it might keep writing to memory. */ assert(!n->vhost_started); return 0; } so for some reason the VM is apparently running during the device save, when it should have been stopped. I think this is a fight between migration stopping the VM just before the device save, and dump restarting the VM after it's finished the dump. dump_init() does: if (runstate_is_running()) { vm_stop(RUN_STATE_SAVE_VM); s->resume = true; } else { s->resume = false; } and dump_cleanup does: if (s->resume) { if (s->detached) { qemu_mutex_lock_iothread(); } vm_start(); if (s->detached) { qemu_mutex_unlock_iothread(); } } I'm not quite sure what's going on with the locking, but I suspect that we get: a) Migration starts b) Dump starts c) Dump pauses the VM but notes it was running d) dump takes the big lock perhaps (?) - but migration keeps running it's main part e) migration gets to just before complete stops the VM f) dump finishes, restarts the VM g) migration calls virtio_net's dump code that gets upset the VM is running But hmm, I thought they both took the iothread lock; e.g. in migration_completion it takes the lock, stops the VM, and calls qemu_savevm_state_complete_precopy so it should be safe. Hmm. we could worry about debugging that locking; or we could just ban dumping from starting during migration. (In reply to Dr. David Alan Gilbert from comment #3) > we could worry about debugging that locking; or we could just ban dumping > from starting during migration. Maybe also vice versa? We should also disable live migration during a guest dump. Fundamentally the vm state cached in either guest-dump or live-migration could be racy if any of them goes into another... migrate_add_blocker() from guest-dump looks the right thing to me. QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Dump guest memory during migration, qmp will give below error on qemu-kvm-6.2.0-1.el9.x86_64, and migration can succeed: {"execute": "dump-guest-memory", "arguments": {"paging": false, "protocol": "file:/home/dump.elf", "format": "elf"}, "id": "VKVBB3mz"} {"id": "VKVBB3mz", "error": {"class": "GenericError", "desc": "disallowing migration blocker (migration/snapshot in progress) for: Live migration disabled: dump-guest-memory in progress"}} AFAIK we can't start migration if dump is in process. So no need to test this scenario in this bz. Also found this bz has been fixed on RHEL-8.6.0 under qemu-kvm-6.2.0-1.module+el8.6.0+13725+61ae1949.x86_64. Mark bz verified per above tests. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: qemu-kvm), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2307 |