Bug 1809380
Summary: | guest hang during reboot process after migration from RHEl7.8 to RHEL8.2.0. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | menli <menli> | ||||||
Component: | qemu-kvm | Assignee: | Dr. David Alan Gilbert <dgilbert> | ||||||
qemu-kvm sub component: | Live Migration | QA Contact: | jingzhao <jinzhao> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | medium | ||||||||
Priority: | high | CC: | ailan, ddepaula, dgilbert, jinzhao, juzhang, lijin, virt-maint, xiagao, xianwang, ymankad, yvugenfi | ||||||
Version: | 8.2 | Keywords: | Regression, Triaged | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 8.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-05-05 09:57:51 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
guest can reboot normally after migration from RHEL7.8 to RHEL8.1.1. host(RHEL8.1.1) qemu-kvm-4.1.0-23.module+el8.1.1+5748+5fcc84a8.1.x86_64 kernel-4.18.0-147.5.1.el8_1.x86_64 seabios-bin-1.12.0-5.module+el8.1.1+5309+6d656f05.noarch It is about live migration,thanks. Hi, a) Can you confirm the command line you're using please; '-machine pc' shouldn't work between versions; you normally have to specify a machine type, e.g. -machine pc-i440fx-rhel7.6.0 b) Please confirm the CPU hardware on source and dest c) Does either qemu give any error messages or warnings out? Please provide logs. Hi, a) yes,I have specified a machine type -machine pc-i440fx-rhel7.6.0 both in source and dest. b)source: model name : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz dest: model name : Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz c) neither qemu give any error messages or warnings out. OK, that should work Created attachment 1668902 [details]
linux image hang
OK, failure with a Linux guest as comment 8 is much easier for me - I'll take a look Reproduced here on my rhel7->rhel8 boxes using a rhel7 guest. Info registers after it's hung: XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000 (qemu) info registers EAX=00002000 EBX=00000000 ECX=000f4200 EDX=00000604 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000fc0 EIP=000f12ad EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=1 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f6980 00000037 IDT= 000f69be 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000 XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000 XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000 XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000 booting the source using the bios from rhel8 ( seabios-bin-1.12.0-5.module+el8.1.0+4022+29a53beb.noarch ) works. However using the rhel7 bios on rhel8 also works - i.e. rhel8->rhel8 migrate+reboot is fine with a rhel7 bios. Simplified the commandline down to: /usr/libexec/qemu-kvm \ -nodefaults \ -monitor stdio \ -rtc base=utc,clock=host \ -m 4096 \ -drive file=/home/vms-402/rhel-guest-image-7.7-261.x86_64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \ -machine pc-q35-rhel7.6.0 \ -sandbox off \ -enable-kvm \ -chardev file,path=/home/seabios.log,id=seabios -device isa-debugcon,chardev=seabios,iobase=0x402 \ -device virtio-scsi-pci,id=scsi0 \ -device scsi-hd,bus=scsi0.0,lun=0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -serial tcp:0:4445,server,wait \ -smp 8,cores=1,threads=1,sockets=8 \ -cpu SandyBridge \ -name "mouse-vm" \ -incoming tcp::8888 so far, and it's still failing. Not maanged to get it to fail on upstream yet. Had it reproduce on a rhel8 host running upstream 2.12.1->upstream 4.2.0 with: ./x86_64-softmmu/qemu-system-x86_64 \ -nodefaults \ -monitor stdio \ -rtc base=utc,clock=host \ -m 4096 \ -drive file=/home/vms-402/rhel-guest-image-7.7-261.x86_64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \ -machine pc-q35-2.12 \ -sandbox off \ -enable-kvm \ -chardev file,path=/home/seabios.log,id=seabios -device isa-debugcon,chardev=seabios,iobase=0x402 \ -device virtio-scsi-pci,id=scsi0 \ -device scsi-hd,bus=scsi0.0,lun=0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -serial tcp:0:4445,server,wait \ -smp 8,cores=1,threads=1,sockets=8 \ -cpu SandyBridge \ -name "mouse-vm" \ -incoming tcp::8888 and it doesn't reproduce on 2.12.1->upstream 4.1.1; a good starting place for a bisect tomorrow. (In reply to Dr. David Alan Gilbert from comment #19) > and it doesn't reproduce on 2.12.1->upstream 4.1.1; a good starting place > for a bisect tomorrow. It looks like that the offending commit is this one: 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset() during incoming migration) (In reply to Yan Vugenfirer from comment #20) > (In reply to Dr. David Alan Gilbert from comment #19) > > and it doesn't reproduce on 2.12.1->upstream 4.1.1; a good starting place > > for a bisect tomorrow. > > It looks like that the offending commit is this one: > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset() > during incoming migration) That does make sense; but why is the failure so specific, wouldn't we expect it to fail all migration->reboots on 4.2 ? (In reply to Dr. David Alan Gilbert from comment #23) > (In reply to Yan Vugenfirer from comment #20) > > (In reply to Dr. David Alan Gilbert from comment #19) > > > and it doesn't reproduce on 2.12.1->upstream 4.1.1; a good starting place > > > for a bisect tomorrow. > > > > It looks like that the offending commit is this one: > > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset() > > during incoming migration) > > That does make sense; but why is the failure so specific, wouldn't we expect > it to fail all migration->reboots on 4.2 ? Hi, Dave, I have tried this scenario on ppc (rhel7.8->rhelav8.2.0)but I did not hit this issue, so, this bug is x86 only. The following is the test steps. Host A: 3.10.0-1127.5.el7.ppc64le qemu-kvm-rhev-2.12.0-44.el7_8.1.ppc64le SLOF-20171214-3.gitfa98132.el7.noarch Host B: 4.18.0-186.2.el8.ppc64le qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.ppc64le SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd.noarch 1.boot a guest as following: /usr/libexec/qemu-kvm \ -rtc base=utc,clock=host \ -qmp tcp:0:3333,server,nowait \ -qmp tcp:0:9999,server,nowait \ -enable-kvm \ -watchdog i6300esb \ -monitor stdio \ -boot order=cdn,once=c,menu=on,strict=on \ -vga std \ -smp 8,cores=4,threads=1,sockets=2 \ -machine pseries-rhel7.6.0 \ -vnc :10 \ -device spapr-pci-host-bridge,index=1 \ -device virtio-scsi-pci,bus=pci.1,id=scsi0,addr=0x3 \ -device spapr-pci-host-bridge,index=2 \ -device virtio-scsi-pci,id=scsi1,multifunction=on,bus=pci.2,addr=0x03.1 \ -device virtio-serial-pci,id=virtio-serial0,max_ports=32 \ -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \ -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x05 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x07 \ -device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pci.0,addr=0x08 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \ -device spapr-vscsi,reg=0x71000001,id=scsi3 \ -device virtio-net-pci,mac=9a:4f:50:51:52:53,id=id9HRc5V,netdev=idjlQN53,vectors=10,mq=on,status=on,bus=pci.0,addr=0xa \ -device spapr-vlan,mac=9a:4f:50:51:52:54,netdev=hostnet0,id=net0,reg=0x71000002 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device usb-mouse,id=input1,bus=usb1.0,port=2 \ -device usb-kbd,id=input2,bus=usb1.0,port=3 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xb \ -nodefaults \ -name "mouse-vm" \ -netdev tap,id=idjlQN53,vhost=on,queues=4,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -m 4096,slots=256,maxmem=32G \ -drive file=/home/xianwang/rhel78-ppc64le-virtio-scsi.qcow2,if=none,id=drive_image1,snapshot=off,aio=threads,cache=none,format=qcow2 \ -sandbox off \ -chardev socket,id=serial_id_serial0,path=/tmp/console0,server,nowait \ 2.boot a incoming guest on dst host 3.do migration from hostA to hostB 4.reboot guest result: guest works well after migration and reboot successfully (In reply to xianwang from comment #24) > (In reply to Dr. David Alan Gilbert from comment #23) > > (In reply to Yan Vugenfirer from comment #20) > > > (In reply to Dr. David Alan Gilbert from comment #19) > > > > and it doesn't reproduce on 2.12.1->upstream 4.1.1; a good starting place > > > > for a bisect tomorrow. > > > > > > It looks like that the offending commit is this one: > > > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset() > > > during incoming migration) > > > > That does make sense; but why is the failure so specific, wouldn't we expect > > it to fail all migration->reboots on 4.2 ? > > Hi, Dave, > I have tried this scenario on ppc (rhel7.8->rhelav8.2.0)but I did not hit > this issue, so, this bug is x86 only. The following is the test steps. > Host A: > 3.10.0-1127.5.el7.ppc64le > qemu-kvm-rhev-2.12.0-44.el7_8.1.ppc64le > SLOF-20171214-3.gitfa98132.el7.noarch > > Host B: > 4.18.0-186.2.el8.ppc64le > qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.ppc64le > SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd.noarch > > 1.boot a guest as following: > /usr/libexec/qemu-kvm \ > -rtc base=utc,clock=host \ > -qmp tcp:0:3333,server,nowait \ > -qmp tcp:0:9999,server,nowait \ > -enable-kvm \ > -watchdog i6300esb \ > -monitor stdio \ > -boot order=cdn,once=c,menu=on,strict=on \ > -vga std \ > -smp 8,cores=4,threads=1,sockets=2 \ > -machine pseries-rhel7.6.0 \ > -vnc :10 \ > -device spapr-pci-host-bridge,index=1 \ > -device virtio-scsi-pci,bus=pci.1,id=scsi0,addr=0x3 \ > -device spapr-pci-host-bridge,index=2 \ > -device virtio-scsi-pci,id=scsi1,multifunction=on,bus=pci.2,addr=0x03.1 \ > -device virtio-serial-pci,id=virtio-serial0,max_ports=32 \ > -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \ > -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x05 \ > -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x07 \ > -device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pci.0,addr=0x08 \ > -device > scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi- > id=0,lun=0,bootindex=0 \ > -device spapr-vscsi,reg=0x71000001,id=scsi3 \ > -device > virtio-net-pci,mac=9a:4f:50:51:52:53,id=id9HRc5V,netdev=idjlQN53,vectors=10, > mq=on,status=on,bus=pci.0,addr=0xa \ > -device > spapr-vlan,mac=9a:4f:50:51:52:54,netdev=hostnet0,id=net0,reg=0x71000002 \ > -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ > -device usb-mouse,id=input1,bus=usb1.0,port=2 \ > -device usb-kbd,id=input2,bus=usb1.0,port=3 \ > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xb \ > -nodefaults \ > -name "mouse-vm" \ > -netdev > tap,id=idjlQN53,vhost=on,queues=4,script=/etc/qemu-ifup,downscript=/etc/qemu- > ifdown \ > -netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ > -m 4096,slots=256,maxmem=32G \ > -drive > file=/home/xianwang/rhel78-ppc64le-virtio-scsi.qcow2,if=none,id=drive_image1, > snapshot=off,aio=threads,cache=none,format=qcow2 \ > -sandbox off \ > -chardev socket,id=serial_id_serial0,path=/tmp/console0,server,nowait \ > > 2.boot a incoming guest on dst host > 3.do migration from hostA to hostB > 4.reboot guest > > result: > guest works well after migration and reboot successfully Thanks for checking; x86's reboot process is quite odd, so it doesn't surprise me it's an x86 only. 4.2.0->4.2.0 works however, 4.2.0 source using the older 2.12 bios->4.2.0 breaks (In reply to Yan Vugenfirer from comment #20) > (In reply to Dr. David Alan Gilbert from comment #19) > > and it doesn't reproduce on 2.12.1->upstream 4.1.1; a good starting place > > for a bisect tomorrow. > > It looks like that the offending commit is this one: > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset() > during incoming migration) I've confirmed reverting it does fix it - now I'd just like to spend more time understanding why. If we need a release-time hack then we can just revert it. I'm guessing there's something different about how the aliased bios rom ends up because of the reset, which I hadn't expected. (In reply to Dr. David Alan Gilbert from comment #27) > (In reply to Yan Vugenfirer from comment #20) > > (In reply to Dr. David Alan Gilbert from comment #19) > > > and it doesn't reproduce on 2.12.1->upstream 4.1.1; a good starting place > > > for a bisect tomorrow. > > > > It looks like that the offending commit is this one: > > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset() > > during incoming migration) > > I've confirmed reverting it does fix it - now I'd just like to spend more > time understanding why. > If we need a release-time hack then we can just revert it. > > I'm guessing there's something different about how the aliased bios rom ends > up because of the reset, which I hadn't expected. The commit message mentioned that the commit was needed for arm64 VMs. Maybe as a quick fix execute offending code based on the platform or machine type? +if (runstate_check(RUN_STATE_INMIGRATE)) + return; + Sure, as a quick hack - but I actually want to understand what state is different on x86. I mean the theory is that the incoming migration will overwrite all of the ROM anyway, so resetting it shouldn't be necessary. But the fact the reset is failing makes me think that something involving the rom banking is different. As expected this fails if I just limit the inigrate to the bios-256k.bin OK, I see the problem. rom_reset gets called: a) At initialisation b) On a reboot The problem is the inmigrate check happens on (a) - and skips the rom_reset, but also this leaves the rom data allocated. rom_reset normally frees the rom data. When (b) happens later on, because it's actually still got rom data, it does the memcpy and overwrites the rom. It's the rom_free that needs to happen even in our skip case. Posted upstream: exec/rom_reset: Free rom data during inmigrate skip set blocker? - this bug causes a hang at 'reboot' of the guest after a migration; this ends up being latent - i.e. you do a migrate and some time later try and do the reboot and it will hang. Please test: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27247232 seems to work for me. Please test this one instead: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27249921 for the v2 I've just posted upstream. Setting ITR=8.2.0 as this patch was sent for 8.2.0 But we have to consider moving to 8.2.1 if exception+ is not granted. QA_ACK, please? Hi Dave Used qemu-kvm-4.2.0-14.el8.bz1809380b.x86_64 to test ,didn't reproduce it ,thanks. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2017 |
Created attachment 1667103 [details] guest hang during reboot process Description of problem: guest hang during reboot process after migration from RHEl7.8 to RHEL8.2.0. Version-Release number of selected component (if applicable): Host(RHEL7.8): kernel-3.10.0-1127.el7.x86_64 qemu-kvm-rhev-2.12.0-44.el7.x86_64 Host(RHEL8.2.0): qemu-kvm-tests-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64 kernel-4.18.0-180.el8.x86_64 Guest: windows 2019 How reproducible: 3/3 Steps to Reproduce: 1. boot guest with below cmd lines. /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm5' \ -machine pc \ -nodefaults \ -device VGA,bus=pci.0 \ -drive id=drive_cd1,if=none,snapshot=off,aio=threads,cache=none,file=Win2019.qcow2\ -device ide-hd,id=cd1,drive=drive_cd1,bus=ide.0 \ -device virtio-net-pci,mac=9a:36:83:b6:3d:05,id=idJVpmsF,netdev=id23ZUK6,bus=pci.0 \ -netdev tap,script=/etc/qemu-ifup,downscript=no,id=id23ZUK6,vhost=on \ -m 6G \ -smp 8,maxcpus=24,cores=12,threads=1,sockets=2 \ -cpu 'SandyBridge' \ -cdrom /home/kvm_autotest_root/iso/windows/virtio-win-1.9.10-3.el7.iso \ -vnc :1 \ -rtc base=localtime,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -enable-kvm \ -qmp tcp:0:1231,server,nowait \ -monitor stdio \ 2.do migration from RHEL7.8 to RHEL8.2.0 3.reboot guest once migration complete. Actual results: guest hang during reboot process. Expected results: guest can reboot normally. Additional info: guest can reboot normally after migration from RHEL7.8 to RHEL8.1