Bug 1411275

Summary:	Migration from RHEL7-2-z to RHEL7-3-z on amd host, guest cannot be reboot normally
Product:	Red Hat Enterprise Linux 7	Reporter:	huiqingding <huding>
Component:	seabios	Assignee:	Virtualization Maintenance <virt-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	amit.shah, chayang, dgilbert, hhuang, huding, jinzhao, knoel, virt-maint, xfu
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:	A boot (or more commonly reboot) would hang at a point the BIOS was waiting for user input, the BIOS should have timed out and continued. An error in initialisation order meant that under some circumstances (complex guests typically) the timer could time out before the interrupt could be delivered. The initialisation order has now been changed so that can't happen.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-04 17:01:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description huiqingding 2017-01-09 10:26:06 UTC

Description of problem:
Migration a RHEL6.8 guest from RHEL7-2-z to RHEL7-3-z on an amd host, inside guest, do reboot twice, the second time, the guest cannot reboot normally, then do system_reboot, guest can reboot.

Version-Release number of selected component (if applicable):
RHEL7-2-z host:
qemu-kvm-rhev-2.3.0-31.el7_2.25.x86_64
3.10.0-327.44.2.el7.x86_64

RHEL7-3-z host:
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
3.10.0-514.7.1.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. boot rhel6.8 guest in src rhel7.2.z host:
/usr/libexec/qemu-kvm \
-name rhel7 \
-S -machine rhel6.6.0,accel=kvm,usb=off \
-realtime mlock=off \
-cpu Opteron_G4,check \
-sandbox off \
-m 7680   \
-smp 8,cores=1,threads=1,sockets=8  \
-uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel7.monitor,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=discard \
-no-hpet \
-no-shutdown \
-global PIIX4_PM.disable_s3=1 \
-global PIIX4_PM.disable_s4=1 \
-boot order=c,menu=on,splash-time=3000,strict=on \
-device ich9-usb-ehci1,id=usb0,bus=pci.0,addr=0x5.0x7 \
-device ich9-usb-uhci1,masterbus=usb0.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 \
-device ich9-usb-uhci2,masterbus=usb0.0,firstport=2,bus=pci.0,addr=0x5.0x1 \
-device ich9-usb-uhci3,masterbus=usb0.0,firstport=4,bus=pci.0,addr=0x5.0x2 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 \
-drive file=/mnt/en_windows_10_enterprise_x64_dvd_6851151.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= \
-device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 \
-drive file=/mnt/rhel6.8.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-drive file=/mnt/virtio-blk-disk,if=none,id=drive-scsi-disk,format=qcow2,cache=none,werror=stop,rerror=stop \
-device virtio-scsi-pci,id=scsi1,addr=0x13 \
-device scsi-hd,drive=drive-scsi-disk,bus=scsi1.0,id=data-disk2 \
-netdev tap,id=hostnet0,vhost=on,id=hostnet0,script=/etc/qemu-ifup \
-device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=00:52:4c:23:4d:80,bus=pci.0,addr=0x14 \
-netdev tap,id=hostnet2,vhost=on,script=/etc/qemu-ifup \
-device rtl8139,netdev=hostnet2,id=virtio-net-pci2,mac=00:52:4c:23:4d:84,bus=pci.0,addr=0x16,multifunction=off \
-monitor stdio \
-qmp tcp:0:4467,server,nowait -serial unix:/tmp/ttym,server,nowait \
-spice port=5910,addr=0.0.0.0,disable-ticketing,seamless-migration=on \
-k en-us \
-device qxl-vga,id=video0,ram_size=134217728,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 \
2.boot the guest in dst rhel7.3.z host with "-incoming tcp:0:5800"
3.do migration
4. inside guest
# reboot
5. do system_reboot

Actual results:
After step4, guest cannot reboot normally, the console shows the following message and Keyboard doesn't work.

SeaBIOS (version seabios-1.7.5-11.el7)
Machine UUID 49a3438a-70a3-4ba8-92ce-3a05e0934608

gPXE(http://etherboot.org) - 00:14.0 C900 PCI2.10 Pnp BBS PMMDFE0@10 C900
Press Ctrl-B to configure gPXE (PCI 00:14.0)

After step5, 
The guest can reboot.

Expected results:
Guest can reboot normally.

Additional info:

Comment 1 Dr. David Alan Gilbert 2017-01-09 11:04:05 UTC

Hi,
  Some questions:
     a) Does 7.2.z->7.2.z hit this bug?
     b) Does 7.3.z->7.3.z hit this bug?
     c) Does 7.2 -> 7.3 (not z) hit this bug?
     d) Does it only happen on opteron?

Dave

Comment 7 Dr. David Alan Gilbert 2017-01-10 18:38:44 UTC

I can reproduce this 7.3.z->7.3.z;
  boot a 6.9 nightly image
  migrate it to the same host
  inside the guest reboot
  inside the guest reboot a 2nd time
  <black screen>

Comment 8 Dr. David Alan Gilbert 2017-01-10 19:36:00 UTC

Hi Huiqing,
  Can you try a simpler test for me;  on the 7.3.z host, just try rebooting the guest 4 or 5 times (log in to the guest and type reboot) - no migration involved.
I have a case here where the reboot hangs even without the migration.

Comment 11 Dr. David Alan Gilbert 2017-01-11 11:40:38 UTC

using a rhel7 machine type seems to make this go away, and the ACPI DSDT= and Copying ACPI RSDP lines go away in the debug.

Comment 12 Dr. David Alan Gilbert 2017-01-12 11:40:50 UTC

just doing 7.3.z- reboots without migrate, the failures I get seems to vary a bit with different seabios builds.
Originally I was using 1.5.1-5.el7 and I was seeing just hangs after that 'Copying ACPI RSDP' line.
Then with 1.5.1-5.el7_3.1 I occasionally saw a stream of 'CPU Mhz=....' lines, but the change between those versions seems to be irrelevant and I don't see how we can get that message multiple times anyway.

Just rebuilding seabios with a different level of debug gets me a different failure mode. (A reliable KVM error - probably because it's a bad PC)

Comment 13 Dr. David Alan Gilbert 2017-01-13 16:37:17 UTC

I'm not yet sure what's going on, but it feels very much like a memory corruption - the very smallest changes to the BIOS change the behaviour.

Turning upto debug  level 9 isn't possible with the smaller rom sizes for rhel6, and so I have a build at debug level 4, where I've added CONFIG_CDROM_BOOT=n,CONFIG_CDROM_EMU=n,CONFIG_MOUSE=n to the config.base 
and added the following debug to it:

diff -urN a/src/fw/biostables.c b/src/fw/biostables.c
--- a/src/fw/biostables.c       2017-01-11 12:32:00.069748380 -0500
+++ b/src/fw/biostables.c       2016-01-18 05:13:20.000000000 -0500
@@ -116,7 +116,7 @@
         warn_noalloc();
         return;
     }
-    dprintf(1, "Copying ACPI RSDP from %p to %p\n", pos, newpos);
+    dprintf(1, "Copying ACPI RSDP from %p to %p (len=0x%x)\n", pos, newpos, length);
     memcpy(newpos, pos, length);
     RsdpAddr = newpos;
 }
diff -urN a/src/hw/timer.c b/src/hw/timer.c
--- a/src/hw/timer.c    2017-01-11 14:37:51.570684382 -0500
+++ b/src/hw/timer.c    2016-01-18 05:13:20.000000000 -0500
@@ -101,8 +101,10 @@
 void
 timer_setup(void)
 {
-    if (!CONFIG_TSC_TIMER || (CONFIG_PMTIMER && TimerPort != PORT_PIT_COUNTER0))
+    if (!CONFIG_TSC_TIMER || (CONFIG_PMTIMER && TimerPort != PORT_PIT_COUNTER0)) {
+        dprintf(1, "timer_setup early exit: TimerPort=0x%x\n", TimerPort);
         return;
+    }

     // Check if CPU has a timestamp counter
     u32 eax, ebx, ecx, edx, cpuid_features = 0;
@@ -120,7 +118,7 @@
 {
     if (!CONFIG_PMTIMER)
         return;
-    dprintf(1, "Using pmtimer, ioport 0x%x\n", ioport);
+    dprintf(1, "Using pmtimer, ioport 0x%x &TimerPort=%p\n", ioport, &TimerPort);
     TimerPort = ioport;
     TimerKHz = DIV_ROUND_UP(PMTIMER_HZ, 1000);
 }


and that fails almost every time on my test;
(I'd added that debug because the previous symptom was that tsctimer_setup  was printing debug even when my understanding was it was using pm timers)

from the logs it looked like it was failing just before printing the UUID, so I add:

diff -ur /root/src-withmymods/fw/biostables.c src/fw/biostables.c
--- a/src-withmymods/fw/biostables.c    2017-01-11 12:32:00.069748380 -0500
+++ b/src/fw/biostables.c       2017-01-13 11:19:06.923168822 -0500
@@ -297,9 +297,12 @@
 void
 display_uuid(void)
 {
+    dprintf(1, "display_uuid entry\n");
     struct smbios_type_1 *tbl = smbios_next(SMBiosAddr, NULL);
     int minlen = offsetof(struct smbios_type_1, uuid) + sizeof(tbl->uuid);
+    dprintf(1, "display_uuid preloop tbl=%p minlen=%d\n", tbl, minlen);
     for (; tbl; tbl = smbios_next(SMBiosAddr, tbl))
+        dprintf(1, "display_uuid loop top tbl=%p\n", tbl);
         if (tbl->header.type == 1 && tbl->header.length >= minlen) {
             u8 *uuid = tbl->uuid;
             u8 empty_uuid[sizeof(tbl->uuid)] = { 0 };


and it's now solid as a rock.

Comment 14 Dr. David Alan Gilbert 2017-01-13 18:48:46 UTC

Note my current minimal failure line is:

/usr/libexec/qemu-kvm \
-name rhel7 \
-S -machine rhel6.6.0,accel=kvm,usb=off,smm=off \
-realtime mlock=off \
-cpu Opteron_G4,check \
-sandbox off \
-m 7680   \
-smp 1, \
-uuid 49a3438a-70a3-4ba8-92ce-3a05e0934608 \
-no-user-config \
-nodefaults \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=discard \
-no-hpet \
-no-shutdown \
-global PIIX4_PM.disable_s3=1 \
-global PIIX4_PM.disable_s4=1 \
-boot order=c,menu=on,splash-time=3000,strict=on \
-drive file=rhel-guest-image-6.9-28.x86_64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,serial=f65effa5-90a6-47f2-8487-a9f64c95d4f5,cache=none,discard=unmap,werror=stop,rerror=stop,aio=threads \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-chardev stdio,mux=on,id=mon -mon chardev=mon,mode=readline --device isa-serial,chardev=mon \
-k en-us \
-device VGA \
-chardev file,id=bioslog,path=bioslog \
-device isa-debugcon,iobase=0x402,chardev=bioslog \
$*

Comment 15 Dr. David Alan Gilbert 2017-01-17 09:52:53 UTC

as pointed out by Laszlo, may be related to David's bug 1382906.

Comment 16 Dr. David Alan Gilbert 2017-01-17 18:24:20 UTC

There's lots of different hangs I get as I add different debug - but by far the commonest is it coming back to hanging in that display_uuid loop

Comment 17 Dr. David Alan Gilbert 2017-01-20 15:52:35 UTC

so still digging around display_uuid and I objdump the ROM, and I find this code:

from disassembling src/fw/biostables.o:

  73:   f3 ab                   rep stos %eax,%es:(%edi)
  75:   8d 43 08                lea    0x8(%ebx),%eax
  78:   b1 10                   mov    $0x10,%cl
  7a:   8d 54 24 44             lea    0x44(%esp),%edx
  7e:   e8 fc ff ff ff          call   7f <display_uuid+0x7f>
  83:   85 c0                   test   %eax,%eax
  85:   0f 84 33 01 00 00       je     1be <display_uuid+0x1be>
  8b:   8b 15 00 00 00 00       mov    0x0,%edx
  91:   80 7a 06 02             cmpb   $0x2,0x6(%edx)           <-- major version
  95:   0f b6 43 17             movzbl 0x17(%ebx),%eax
  99:   77 0c                   ja     a7 <display_uuid+0xa7>
  9b:   0f 85 80 00 00 00       jne    121 <display_uuid+0x121>
  a1:   80 7a 07 05             cmpb   $0x5,0x7(%edx)           <-- minor version?
  a5:   76 7a                   jbe    121 <display_uuid+0x121> <--- old format display

now working up as it gets linked, here is from code32flat.o:
    10be:       f3 ab                   rep stos %eax,%es:(%edi)
    10c0:       8d 43 08                lea    0x8(%ebx),%eax
    10c3:       b1 10                   mov    $0x10,%cl
    10c5:       8d 54 24 74             lea    0x74(%esp),%edx
    10c9:       e8 fc ff ff ff          call   10ca <maininit+0x10ca>
    10ce:       85 c0                   test   %eax,%eax
    10d0:       0f 84 33 01 00 00       je     1209 <maininit+0x1209>
    10d6:       8b 15 00 00 00 00       mov    0x0,%edx
    10dc:       80 7a 06 02             cmpb   $0x2,0x6(%edx)
    10e0:       0f b6 43 17             movzbl 0x17(%ebx),%eax
    10e4:       77 0c                   ja     10f2 <maininit+0x10f2>
    10e6:       0f 85 80 00 00 00       jne    116c <maininit+0x116c>
    10ec:       80 7a 07 05             cmpb   $0x5,0x7(%edx)
    10f0:       76 7a                   jbe    116c <maininit+0x116c>

ok, so it looks like it got combined with maininit - fine.

Now objdump from rom.o:
   ef79d:       f3 ab                   rep stos %eax,%es:(%edi)
   ef79f:       8d 43 08                lea    0x8(%ebx),%eax
   ef7a2:       b1 10                   mov    $0x10,%cl
   ef7a4:       8d 54 24 74             lea    0x74(%esp),%edx
   ef7a8:       e8 50 46 ff ff          call   e3dfd <memcmp>
   ef7ad:       85 c0                   test   %eax,%eax
   ef7af:       0f 84 33 01 00 00       je     ef8e8 <ExtraStack+0x110>
   ef7b5:       8b 15 0c 23 0f 00       mov    0xf230c,%edx
   ef7bb:       80 7a 06 02             cmpb   $0x2,0x6(%edx)
   ef7bf:       0f b6 43 17             movzbl 0x17(%ebx),%eax
   ef7c3:       77 0c                   ja     ef7d1 <StackPos+0x1>
   ef7c5:       0f 85 80 00 00 00       jne    ef84b <ExtraStack+0x73>
   ef7cb:       80 7a 07 05             cmpb   $0x5,0x7(%edx)
   ef7cf:       76 7a                   jbe    ef84b <ExtraStack+0x73>

Hang on - why are those symbols in 'ExtraStack'; jumps from just before that point are in maininit; but we seem to have run into the '"low" memory' segment??

Comment 18 Dr. David Alan Gilbert 2017-01-20 18:56:46 UTC

OK, backup a bit.
Turning on debug on the seabios build can cause a completely different bug which is what I think I've been chasing for the last few days - I can end up with ExtraStack/data overlaying the code in maininit and that's why I get crashes in display_uuid;  so that's different from the original bug I think where the layout looks more sane.

<goes back a few days to figure out what happened>

Comment 26 Dr. David Alan Gilbert 2017-02-07 16:08:12 UTC

I added some tracing of reboot types (on upstream); it's a bit odd.
The kernel normally reboots via the keyboard controller.
That lands in the bios and that relatively quickly reboots again with the keyboard controller.

That pair happens irrespective of if I tell the guest kernel to do a cold or hot reboot.

Sometimes, we also get KVM exiting with KVM_EXIT_SHUTDOWN  which I think means we hit a triple fault - it's mostly random when we hit it, except that is it's pretty reliable that the first few reboots follows the sequence of

keyboard controller
keyboard controller
KVM_EXIT_SHUTDOWN
keyboard controller

so I think that means the first time the guest does a reboot it ends up causing a triple fault after the bios does it's reboot.

Normally we recover from the triple fault and carry on; so maybe the triple fault happens more often that we really know and we survivie it.

Comment 27 Dr. David Alan Gilbert 2017-02-15 11:22:55 UTC

The problem with the reboot seems to be understood now.

QEMU doesn't reset the PAM registers on reset, this means the ROM mapping at reset is wrong, and ends up running out of RAM instead.  SeaBIOS has a workaround that detects this and reboots again having recopied the image, but the workaround ends up corrupting 3 bytes of memory (TimerLast) and overwriting part of the new copy - if you're unlucky with the memory layout and timer value then it might hang.

I have a fix for qemu's reset and Kevin is doing a seabios fix for the workaround.

Comment 34 Dr. David Alan Gilbert 2017-03-02 12:16:09 UTC

Note I've split this into two bugs; bz 1428347 is now the 'reboot' on rhel6 machine types failing occasionally without the migration.

Comment 36 Dr. David Alan Gilbert 2017-03-06 17:58:39 UTC

I can trigger this without the migration, but by using the 7.2 bios on 7.3 qemu.
e.g. taking a copy of /usr/share/qemu-kvm/bios-256k.bin onto my 7.3 box and using 
firmware=/root/bios-256k-copied-from.72.bin   it hangs at the ESC in the same way as a VM migrated from 7.2;  so the bug isn't actually the migration process, it's just the 7.2 bios doesn't like running on 7.3
Similarly if I copy the rhel6 rtl8139 rom and use that as the 8139 ROM with a 7.3 qemu, it hangs at the 'Press Ctrl-B' point.

Comment 37 Dr. David Alan Gilbert 2017-03-07 19:56:40 UTC

Running with the old bios is 'fun'
I've already compiled in a fix for the TimerLast issue, but even so there's a repeatable hang - but at the point it's waiting for input.
It's very hard to bisect, because it seems to survive for quite a lot of reboots; initially I thought it failed after 1-2 reboots, but I've got cases of failure after 29.
The earliest failure I've seen so far is on 2.4.0, but 2.3.0 I've not seen it fail (after a pair of runs, each with ~30 reboots).

Comment 38 Dr. David Alan Gilbert 2017-03-24 20:05:23 UTC

The behaviour is basically non-bisectable on qemu version - whenever you think you've got one that fails it decides to work again.
I've got a current setup where running with the -debug build of the host kernel it hangs at the same place at boot - not at reboot.
What I think is happening there is that it gets the first PIT interrupt at a point when the APIC is masked, so it never delivers it.  However the PIT still has the IRR=1, so I think any future PIT interrupt doesn't decide to get it delivered either.

I can get it to work using no-kernel-irqchip (sp?) but who knows if that's just changing the timing.

What I've not figured out yet is if there's a state where it should be delivering the interrupt and the kernel isn't, or if there is something that the BIOS has setup wrong/hasn't cleared.

Comment 39 Dr. David Alan Gilbert 2017-04-04 17:01:21 UTC

ok, I believe I understand this (with some help from Paolo), and it's a bug in the 7.2 bios that's fixed in later bioses.


In seabios prior to ~1.8.0 - i.e. prior to upstream change bbb3fbac the initialisation order was approximately:
     timers
     PCI probe
     APIC setup (in smp_scan)

if the PCI probe was long enough (i.e you had lots of PCI devices), then the main PIT timer could fire before the APIC setup code, that interrupt was never delivered and since it never got to the PIT handler it never got cleared and so no new interrupts were ever delivered.

After that change the order is:

      PCI probe
      APIC setup
      timers

so that there's no risk of the timer firing early.

7.2 had 1.7.5-11.el7
7.3 had 1.9.1-5.el7

So I'll mark this as 'fixed in current release' - 7.3 and newer should be fine.

Comment 50 Dr. David Alan Gilbert 2018-04-27 10:54:36 UTC

*** Bug 1571171 has been marked as a duplicate of this bug. ***