Bug 1809380

Summary: guest hang during reboot process after migration from RHEl7.8 to RHEL8.2.0.
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: menli <menli>
Component: qemu-kvmAssignee: Dr. David Alan Gilbert <dgilbert>
qemu-kvm sub component: Live Migration QA Contact: jingzhao <jinzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: ailan, ddepaula, dgilbert, jinzhao, juzhang, lijin, virt-maint, xiagao, xianwang, ymankad, yvugenfi
Version: 8.2Keywords: Regression, Triaged
Target Milestone: rc   
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-05 09:57:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
guest hang during reboot process
none
linux image hang none

Description menli@redhat.com 2020-03-03 01:01:32 UTC
Created attachment 1667103 [details]
guest hang during reboot process

Description of problem:

guest hang during reboot process after migration from RHEl7.8 to RHEL8.2.0.


Version-Release number of selected component (if applicable):

Host(RHEL7.8):
kernel-3.10.0-1127.el7.x86_64
qemu-kvm-rhev-2.12.0-44.el7.x86_64

Host(RHEL8.2.0):
qemu-kvm-tests-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64
kernel-4.18.0-180.el8.x86_64

Guest:
windows 2019 


How reproducible:
3/3


Steps to Reproduce:

1. boot guest with below cmd lines.
 /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm5' \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0 \
    -drive id=drive_cd1,if=none,snapshot=off,aio=threads,cache=none,file=Win2019.qcow2\
    -device ide-hd,id=cd1,drive=drive_cd1,bus=ide.0 \
    -device virtio-net-pci,mac=9a:36:83:b6:3d:05,id=idJVpmsF,netdev=id23ZUK6,bus=pci.0 \
    -netdev tap,script=/etc/qemu-ifup,downscript=no,id=id23ZUK6,vhost=on \
    -m 6G  \
    -smp 8,maxcpus=24,cores=12,threads=1,sockets=2  \
    -cpu 'SandyBridge'   \
    -cdrom /home/kvm_autotest_root/iso/windows/virtio-win-1.9.10-3.el7.iso  \
    -vnc :1 \
    -rtc base=localtime,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -qmp tcp:0:1231,server,nowait \
    -monitor stdio \

2.do migration from RHEL7.8 to RHEL8.2.0

3.reboot guest once migration complete.


Actual results:
guest hang during reboot process.


Expected results:
guest can reboot normally.


Additional info:

guest can reboot normally after migration from RHEL7.8 to RHEL8.1

Comment 1 menli@redhat.com 2020-03-03 01:55:49 UTC
guest can reboot normally after migration from RHEL7.8 to RHEL8.1.1.

host(RHEL8.1.1)

qemu-kvm-4.1.0-23.module+el8.1.1+5748+5fcc84a8.1.x86_64
kernel-4.18.0-147.5.1.el8_1.x86_64
seabios-bin-1.12.0-5.module+el8.1.1+5309+6d656f05.noarch

Comment 3 menli@redhat.com 2020-03-09 01:10:14 UTC
It is about live migration,thanks.

Comment 5 Dr. David Alan Gilbert 2020-03-09 12:26:05 UTC
Hi,
  a) Can you confirm the command line you're using please; '-machine pc' shouldn't work between versions;
     you normally have to specify a machine type, e.g. -machine pc-i440fx-rhel7.6.0

  b) Please confirm the CPU hardware on source and dest
  c) Does either qemu give any error messages or warnings out? Please provide logs.

Comment 6 menli@redhat.com 2020-03-10 01:16:03 UTC
Hi,
  a) yes,I have specified a machine type -machine pc-i440fx-rhel7.6.0 both in source and dest.

  b)source: model name	: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
    dest:   model name	: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz
  c) neither qemu give any error messages or warnings out.

Comment 7 Dr. David Alan Gilbert 2020-03-10 09:56:33 UTC
OK, that should work

Comment 9 jingzhao 2020-03-10 10:34:27 UTC
Created attachment 1668902 [details]
linux image hang

Comment 12 Dr. David Alan Gilbert 2020-03-10 10:41:10 UTC
OK, failure with a Linux guest as comment 8 is much easier for me - I'll take a look

Comment 15 Dr. David Alan Gilbert 2020-03-10 14:41:33 UTC
Reproduced here on my rhel7->rhel8 boxes using a rhel7 guest.
Info registers after it's hung:

XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000
(qemu) info registers
EAX=00002000 EBX=00000000 ECX=000f4200 EDX=00000604
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000fc0
EIP=000f12ad EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000f6980 00000037
IDT=     000f69be 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000
XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000
XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000
XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000

Comment 16 Dr. David Alan Gilbert 2020-03-10 16:51:14 UTC
booting the source using the bios from rhel8 ( seabios-bin-1.12.0-5.module+el8.1.0+4022+29a53beb.noarch )
works.
However using the rhel7 bios on rhel8 also works - i.e. rhel8->rhel8 migrate+reboot is fine with a rhel7 bios.

Comment 17 Dr. David Alan Gilbert 2020-03-10 18:30:58 UTC
Simplified the commandline down to:
/usr/libexec/qemu-kvm  \
-nodefaults  \
-monitor stdio \
-rtc base=utc,clock=host \
-m 4096 \
-drive file=/home/vms-402/rhel-guest-image-7.7-261.x86_64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-machine pc-q35-rhel7.6.0 \
-sandbox off \
-enable-kvm  \
-chardev file,path=/home/seabios.log,id=seabios -device isa-debugcon,chardev=seabios,iobase=0x402 \
-device virtio-scsi-pci,id=scsi0 \
-device scsi-hd,bus=scsi0.0,lun=0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-serial tcp:0:4445,server,wait \
-smp 8,cores=1,threads=1,sockets=8 \
-cpu SandyBridge \
-name "mouse-vm" \
-incoming tcp::8888

so far, and it's still failing.  Not maanged to get it to fail on upstream yet.

Comment 18 Dr. David Alan Gilbert 2020-03-10 19:48:52 UTC
Had it reproduce on a rhel8 host running upstream 2.12.1->upstream 4.2.0 with:
./x86_64-softmmu/qemu-system-x86_64  \
-nodefaults  \
-monitor stdio \
-rtc base=utc,clock=host \
-m 4096 \
-drive file=/home/vms-402/rhel-guest-image-7.7-261.x86_64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-machine pc-q35-2.12 \
-sandbox off \
-enable-kvm  \
-chardev file,path=/home/seabios.log,id=seabios -device isa-debugcon,chardev=seabios,iobase=0x402 \
-device virtio-scsi-pci,id=scsi0 \
-device scsi-hd,bus=scsi0.0,lun=0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-serial tcp:0:4445,server,wait \
-smp 8,cores=1,threads=1,sockets=8 \
-cpu SandyBridge \
-name "mouse-vm" \
-incoming tcp::8888

Comment 19 Dr. David Alan Gilbert 2020-03-10 20:03:16 UTC
and it doesn't reproduce on 2.12.1->upstream 4.1.1;  a good starting place for a bisect tomorrow.

Comment 20 Yvugenfi@redhat.com 2020-03-10 22:28:21 UTC
(In reply to Dr. David Alan Gilbert from comment #19)
> and it doesn't reproduce on 2.12.1->upstream 4.1.1;  a good starting place
> for a bisect tomorrow.

It looks like that the offending commit is this one: 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset() during incoming migration)

Comment 23 Dr. David Alan Gilbert 2020-03-11 09:15:13 UTC
(In reply to Yan Vugenfirer from comment #20)
> (In reply to Dr. David Alan Gilbert from comment #19)
> > and it doesn't reproduce on 2.12.1->upstream 4.1.1;  a good starting place
> > for a bisect tomorrow.
> 
> It looks like that the offending commit is this one:
> 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset()
> during incoming migration)

That does make sense; but why is the failure so specific, wouldn't we expect it to fail all migration->reboots on 4.2 ?

Comment 24 xianwang 2020-03-11 10:22:10 UTC
(In reply to Dr. David Alan Gilbert from comment #23)
> (In reply to Yan Vugenfirer from comment #20)
> > (In reply to Dr. David Alan Gilbert from comment #19)
> > > and it doesn't reproduce on 2.12.1->upstream 4.1.1;  a good starting place
> > > for a bisect tomorrow.
> > 
> > It looks like that the offending commit is this one:
> > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset()
> > during incoming migration)
> 
> That does make sense; but why is the failure so specific, wouldn't we expect
> it to fail all migration->reboots on 4.2 ?

Hi, Dave,
I have tried this scenario on ppc (rhel7.8->rhelav8.2.0)but I did not hit this issue, so, this bug is x86 only. The following is the test steps.
Host A:
3.10.0-1127.5.el7.ppc64le
qemu-kvm-rhev-2.12.0-44.el7_8.1.ppc64le
SLOF-20171214-3.gitfa98132.el7.noarch

Host B:
4.18.0-186.2.el8.ppc64le
qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.ppc64le
SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd.noarch

1.boot a guest as following:
/usr/libexec/qemu-kvm  \
-rtc base=utc,clock=host \
-qmp tcp:0:3333,server,nowait \
-qmp tcp:0:9999,server,nowait \
-enable-kvm  \
-watchdog i6300esb \
-monitor stdio \
-boot order=cdn,once=c,menu=on,strict=on \
-vga std \
-smp 8,cores=4,threads=1,sockets=2 \
-machine pseries-rhel7.6.0 \
-vnc :10 \
-device spapr-pci-host-bridge,index=1 \
-device virtio-scsi-pci,bus=pci.1,id=scsi0,addr=0x3 \
-device spapr-pci-host-bridge,index=2 \
-device virtio-scsi-pci,id=scsi1,multifunction=on,bus=pci.2,addr=0x03.1 \
-device virtio-serial-pci,id=virtio-serial0,max_ports=32 \
-device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
-device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x05 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x07 \
-device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pci.0,addr=0x08 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device spapr-vscsi,reg=0x71000001,id=scsi3 \
-device virtio-net-pci,mac=9a:4f:50:51:52:53,id=id9HRc5V,netdev=idjlQN53,vectors=10,mq=on,status=on,bus=pci.0,addr=0xa \
-device spapr-vlan,mac=9a:4f:50:51:52:54,netdev=hostnet0,id=net0,reg=0x71000002 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-device usb-mouse,id=input1,bus=usb1.0,port=2 \
-device usb-kbd,id=input2,bus=usb1.0,port=3 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xb \
-nodefaults  \
-name "mouse-vm" \
-netdev tap,id=idjlQN53,vhost=on,queues=4,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-m 4096,slots=256,maxmem=32G \
-drive file=/home/xianwang/rhel78-ppc64le-virtio-scsi.qcow2,if=none,id=drive_image1,snapshot=off,aio=threads,cache=none,format=qcow2 \
-sandbox off \
-chardev socket,id=serial_id_serial0,path=/tmp/console0,server,nowait \

2.boot a incoming guest on dst host
3.do migration from hostA to hostB
4.reboot guest

result:
guest works well after migration and reboot successfully

Comment 25 Dr. David Alan Gilbert 2020-03-11 10:32:34 UTC
(In reply to xianwang from comment #24)
> (In reply to Dr. David Alan Gilbert from comment #23)
> > (In reply to Yan Vugenfirer from comment #20)
> > > (In reply to Dr. David Alan Gilbert from comment #19)
> > > > and it doesn't reproduce on 2.12.1->upstream 4.1.1;  a good starting place
> > > > for a bisect tomorrow.
> > > 
> > > It looks like that the offending commit is this one:
> > > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset()
> > > during incoming migration)
> > 
> > That does make sense; but why is the failure so specific, wouldn't we expect
> > it to fail all migration->reboots on 4.2 ?
> 
> Hi, Dave,
> I have tried this scenario on ppc (rhel7.8->rhelav8.2.0)but I did not hit
> this issue, so, this bug is x86 only. The following is the test steps.
> Host A:
> 3.10.0-1127.5.el7.ppc64le
> qemu-kvm-rhev-2.12.0-44.el7_8.1.ppc64le
> SLOF-20171214-3.gitfa98132.el7.noarch
> 
> Host B:
> 4.18.0-186.2.el8.ppc64le
> qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.ppc64le
> SLOF-20191022-3.git899d9883.module+el8.2.0+5449+efc036dd.noarch
> 
> 1.boot a guest as following:
> /usr/libexec/qemu-kvm  \
> -rtc base=utc,clock=host \
> -qmp tcp:0:3333,server,nowait \
> -qmp tcp:0:9999,server,nowait \
> -enable-kvm  \
> -watchdog i6300esb \
> -monitor stdio \
> -boot order=cdn,once=c,menu=on,strict=on \
> -vga std \
> -smp 8,cores=4,threads=1,sockets=2 \
> -machine pseries-rhel7.6.0 \
> -vnc :10 \
> -device spapr-pci-host-bridge,index=1 \
> -device virtio-scsi-pci,bus=pci.1,id=scsi0,addr=0x3 \
> -device spapr-pci-host-bridge,index=2 \
> -device virtio-scsi-pci,id=scsi1,multifunction=on,bus=pci.2,addr=0x03.1 \
> -device virtio-serial-pci,id=virtio-serial0,max_ports=32 \
> -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
> -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x05 \
> -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x07 \
> -device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pci.0,addr=0x08 \
> -device
> scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-
> id=0,lun=0,bootindex=0 \
> -device spapr-vscsi,reg=0x71000001,id=scsi3 \
> -device
> virtio-net-pci,mac=9a:4f:50:51:52:53,id=id9HRc5V,netdev=idjlQN53,vectors=10,
> mq=on,status=on,bus=pci.0,addr=0xa \
> -device
> spapr-vlan,mac=9a:4f:50:51:52:54,netdev=hostnet0,id=net0,reg=0x71000002 \
> -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
> -device usb-mouse,id=input1,bus=usb1.0,port=2 \
> -device usb-kbd,id=input2,bus=usb1.0,port=3 \
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xb \
> -nodefaults  \
> -name "mouse-vm" \
> -netdev
> tap,id=idjlQN53,vhost=on,queues=4,script=/etc/qemu-ifup,downscript=/etc/qemu-
> ifdown \
> -netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
> -m 4096,slots=256,maxmem=32G \
> -drive
> file=/home/xianwang/rhel78-ppc64le-virtio-scsi.qcow2,if=none,id=drive_image1,
> snapshot=off,aio=threads,cache=none,format=qcow2 \
> -sandbox off \
> -chardev socket,id=serial_id_serial0,path=/tmp/console0,server,nowait \
> 
> 2.boot a incoming guest on dst host
> 3.do migration from hostA to hostB
> 4.reboot guest
> 
> result:
> guest works well after migration and reboot successfully

Thanks for checking; x86's reboot process is quite odd, so it doesn't surprise me it's an x86 only.

Comment 26 Dr. David Alan Gilbert 2020-03-11 13:20:09 UTC
4.2.0->4.2.0 works
however,
4.2.0 source using the older 2.12 bios->4.2.0 breaks

Comment 27 Dr. David Alan Gilbert 2020-03-11 15:29:10 UTC
(In reply to Yan Vugenfirer from comment #20)
> (In reply to Dr. David Alan Gilbert from comment #19)
> > and it doesn't reproduce on 2.12.1->upstream 4.1.1;  a good starting place
> > for a bisect tomorrow.
> 
> It looks like that the offending commit is this one:
> 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset()
> during incoming migration)

I've confirmed reverting it does fix it - now I'd just like to spend more time understanding why.
If we need a release-time hack then we can just revert it.

I'm guessing there's something different about how the aliased bios rom ends up because of the reset, which I hadn't expected.

Comment 28 Yvugenfi@redhat.com 2020-03-11 16:14:34 UTC
(In reply to Dr. David Alan Gilbert from comment #27)
> (In reply to Yan Vugenfirer from comment #20)
> > (In reply to Dr. David Alan Gilbert from comment #19)
> > > and it doesn't reproduce on 2.12.1->upstream 4.1.1;  a good starting place
> > > for a bisect tomorrow.
> > 
> > It looks like that the offending commit is this one:
> > 355477f8c73e9c6b60704c57472c71393ff39bca (migration: do not rom_reset()
> > during incoming migration)
> 
> I've confirmed reverting it does fix it - now I'd just like to spend more
> time understanding why.
> If we need a release-time hack then we can just revert it.
> 
> I'm guessing there's something different about how the aliased bios rom ends
> up because of the reset, which I hadn't expected.

The commit message mentioned that the commit was needed for arm64 VMs.
Maybe as a quick fix execute offending code based on the platform or machine type?
+if (runstate_check(RUN_STATE_INMIGRATE))
+        return;
+

Comment 29 Dr. David Alan Gilbert 2020-03-11 16:48:46 UTC
Sure, as a quick hack - but I actually want to understand what state is different on x86.
I mean the theory is that the incoming migration will overwrite all of the ROM anyway, so resetting it shouldn't be necessary.
But the fact the reset is failing makes me think that something involving the rom banking is different.

Comment 30 Dr. David Alan Gilbert 2020-03-13 11:27:07 UTC
As expected this fails if I just limit the inigrate to the bios-256k.bin

Comment 31 Dr. David Alan Gilbert 2020-03-13 12:10:19 UTC
OK, I see the problem.
rom_reset gets called:
  a) At initialisation
  b) On a reboot

The problem is the inmigrate check happens on (a) - and skips the rom_reset,
but also this leaves the rom data allocated.  rom_reset normally frees the rom data.

When (b) happens later on, because it's actually still got rom data, it does the memcpy
and overwrites the rom.

It's the rom_free that needs to happen even in our skip case.

Comment 32 Dr. David Alan Gilbert 2020-03-13 12:34:49 UTC
Posted upstream:
exec/rom_reset: Free rom data during inmigrate skip

set blocker? - this bug causes a hang at 'reboot' of the guest after a migration;
this ends up being latent - i.e. you do a migrate and some time later try and do the
reboot and it will hang.

Comment 35 Dr. David Alan Gilbert 2020-03-13 12:54:10 UTC
Please test:
  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27247232

seems to work for me.

Comment 36 Dr. David Alan Gilbert 2020-03-13 16:01:11 UTC
Please test this one instead:
   https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27249921

for the v2 I've just posted upstream.

Comment 38 Danilo de Paula 2020-03-13 17:37:54 UTC
Setting ITR=8.2.0 as this patch was sent for 8.2.0
But we have to consider moving to 8.2.1 if exception+ is not granted.

Comment 39 Danilo de Paula 2020-03-13 23:59:34 UTC
QA_ACK, please?

Comment 41 menli@redhat.com 2020-03-16 02:12:54 UTC
Hi Dave

Used qemu-kvm-4.2.0-14.el8.bz1809380b.x86_64 to test ,didn't reproduce it ,thanks.

Comment 47 errata-xmlrpc 2020-05-05 09:57:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017