1377087 – shutdown rhel 5.11 guest failed and stop at "system halted"

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1377087 - shutdown rhel 5.11 guest failed and stop at "system halted"

Summary: shutdown rhel 5.11 guest failed and stop at "system halted"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Laszlo Ersek
QA Contact:	Yiqian Wei
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1394095 (view as bug list)
Depends On:
Blocks:	1382443 1392027
TreeView+	depends on / blocked

Reported:	2016-09-18 10:08 UTC by Yanan Fu
Modified:	2017-08-01 17:46 UTC (History)
CC List:	17 users (show)
Fixed In Version:	qemu-kvm-1.5.3-127.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1392027 (view as bug list)
Environment:
Last Closed:	2017-08-01 17:46:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screendump of the guest when system halted. (19.46 KB, image/png) 2016-09-18 10:08 UTC, Yanan Fu	no flags	Details
serial log for guest (25.57 KB, text/plain) 2016-09-18 10:10 UTC, Yanan Fu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1379288	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Product Errata	RHSA-2017:1856	0	normal	SHIPPED_LIVE	Moderate: qemu-kvm security, bug fix, and enhancement update	2017-08-01 18:03:36 UTC

Internal Links: 1379288

Description Yanan Fu 2016-09-18 10:08:50 UTC

Created attachment 1202146 [details]
screendump of the guest when system halted.

Description of problem:
Boot one rhel5.11 guest, then execute "shutdown -h now" in the guest, guest will fail to shutdown, and stop at "system halted"

Version-Release number of selected component (if applicable):
qemu: qemu-kvm-1.5.3-125.el7.x86_64
host kernel: kernel-3.10.0-506.el7.x86_64 
guest kernel: kernel-2.6.18-398.el5PAE

How reproducible:
100%

Steps to Reproduce:
1.Boot one rhel 5.11 guest.
2.Login the guest, execute "shutdown -h now".
3.Guest failed to shutdown, and stop at "system halted". 

At this time, from qemu side, guest status is "running".
(qemu) info status 
VM status: running

Actual results:
guest failed to shutdown, and stop at "system halted".

Expected results:
guest should shutdown successfully.

Additional info:
This is a regression bug since "qemu-kvm-1.5.3-125.el7.x86_64". 
With qemu-kvm-1.5.3-124.el7.x86_64, failed to hit this issue.

CLI:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga qxl \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=05 \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/usr/share/avocado/data/avocado-vt/images/RHEL-Server-5.11-32-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=06 \
    -device virtio-net-pci,mac=9a:0e:0f:10:11:12,id=idqsknof,vectors=4,netdev=idwergXL,bus=pci.0,addr=07  \
    -netdev tap,id=idwergXL,vhost=on  \
    -m 4096  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -cpu 'Opteron_G3' \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0 \
    -boot menu=on \
    -enable-kvm \
    -monitor stdio \

Comment 1 Yanan Fu 2016-09-18 10:10:58 UTC

Created attachment 1202147 [details]
serial log for guest

Comment 3 Yanan Fu 2016-09-18 11:37:24 UTC

Seems only rhel5.11 hit this issue. both 32 bit and 64 bit guest.

rhel6.8 and rhel7.3 guest are ok in my test.

Comment 4 juzhang 2016-09-19 01:28:26 UTC

> Additional info:
> This is a regression bug since "qemu-kvm-1.5.3-125.el7.x86_64". 
> With qemu-kvm-1.5.3-124.el7.x86_64, failed to hit this issue.
> 


Seems qemu-kvm-1.5.3-125.el7 build just fixed one bz.


Bug 1285453 - An NBD client can cause QEMU main loop to block when connecting to built-in NBD server

Hi Fam,

Could you have a look?

Best Regards,
Junyi

Comment 7 Fam Zheng 2016-09-22 10:35:36 UTC

Looking at the dmesg, the ACPI errors while booting is what is new in qemu-kvm-1.5.3-125.el7. But like Junyi said, that build only included a highly unrelated change compared to the previous, qemu-kvm-1.5.3-124.el7.

The ACPI errors are not seen on the previous build. For completeness, here is the full diff between good and bad boots:

# diff qemu-kvm-1.5.3-124.dmesg.log qemu-kvm-1.5.3-125.dmesg.log 
7,8c7,8
<  BIOS-e820: 0000000000100000 - 00000000bfffd000 (usable)
<  BIOS-e820: 00000000bfffd000 - 00000000c0000000 (reserved)
---
>  BIOS-e820: 0000000000100000 - 00000000bfffb000 (usable)
>  BIOS-e820: 00000000bfffb000 - 00000000c0000000 (reserved)
33c33
< Nosave address range: 00000000bfffd000 - 00000000c0000000
---
> Nosave address range: 00000000bfffb000 - 00000000c0000000
41c41
< Built 1 zonelists.  Total pages: 1029031
---
> Built 1 zonelists.  Total pages: 1029029
53c53
< Memory: 3908972k/5242880k available (2630k kernel code, 284868k reserved, 1679k data, 224k init)
---
> Memory: 3908964k/5242880k available (2630k kernel code, 284868k reserved, 1679k data, 224k init)
64a65,88
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004068 offset 4, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc2000000406f offset B, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004076 offset 12, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc2000000407d offset 19, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 3 at AML address ffffc20000004082 offset 1E, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004084 offset 20, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 3 at AML address ffffc20000004089 offset 25, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc2000000408b offset 27, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 3 at AML address ffffc20000004090 offset 2C, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004092 offset 2E, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004068 offset 4, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc2000000406f offset B, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004076 offset 12, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc2000000407d offset 19, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 3 at AML address ffffc20000004082 offset 1E, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004084 offset 20, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 3 at AML address ffffc20000004089 offset 25, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc2000000408b offset 27, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 3 at AML address ffffc20000004090 offset 2C, ignoring [20060707]
> ACPI Error (psloop-0196): Found unknown opcode 15 at AML address ffffc20000004092 offset 2E, ignoring [20060707]
> ACPI Error (dsobject-0134): [ON] Namespace lookup failure, AE_NOT_FOUND
> ACPI Exception (tbxface-0113): AE_NOT_FOUND, Could not load namespace [20060707]
> ACPI Exception (tbxface-0120): AE_NOT_FOUND, Could not load tables [20060707]
> ACPI: Unable to load the System Description Tables
73d96
< ACPI: bus type pci registered
75,85c98
< ACPI: Interpreter enabled
< ACPI: Using IOAPIC for interrupt routing
< ACPI: No dock devices found.
< ACPI: PCI Root Bridge [PCI0] (0000:00)
< PCI quirk: region 0600-063f claimed by PIIX4 ACPI
< PCI quirk: region 0700-070f claimed by PIIX4 SMB
< ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
< ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
< ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
< ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
< ACPI: PCI Interrupt Link [LNKS] (IRQs *9)
---
> ACPI: Interpreter disabled.
87,88c100
< pnp: PnP ACPI init
< pnp: PnP ACPI: found 6 devices
---
> pnp: PnP ACPI: disabled
91,92c103,106
< PCI: Using ACPI for IRQ routing
< PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
---
> PCI: Probing PCI hardware
> PCI quirk: region 0600-063f claimed by PIIX4 ACPI
> PCI quirk: region 0700-070f claimed by PIIX4 SMB
> pci 0000:00:01.0: PIIX/ICH IRQ router [8086/7000]
106c120
< type=2000 audit(1474539822.549:1): initialized
---
> type=2000 audit(1474540036.509:1): initialized
124d137
< ACPI: Invalid PBLK length [0]
130d142
< 00:05: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
143c155
< PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
---
> PNP: No PS/2 controller found. Probing ports directly.
153,154d164
< input: AT Translated Set 2 keyboard as /class/input/input0
< ACPI: (supports S5)
163a174
> input: AT Translated Set 2 keyboard as /class/input/input0
176,177d186
< ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
< ACPI: PCI Interrupt 0000:00:06.0[A] -> Link [LNKB] -> GSI 10 (level, high) -> IRQ 10

Comment 9 Fam Zheng 2016-09-22 10:42:03 UTC

BTW, failure to initialize ACPI is the reason why guest refuses to shutdown properly, and falls back to a "halt" state as seen in the report instead.

Comment 11 Laszlo Ersek 2016-09-23 04:26:16 UTC

(Adding Marcel.)

This is a very interesting bug, and I think the regression is actually caused by a SeaBIOS change, not a qemu-kvm change.

Like everyone else, I investigated the difference between "qemu-kvm-1.5.3-124.el7." and "qemu-kvm-1.5.3-125.el7", and that

  qemu_set_nonblock(client->sock);

call in nbd_co_client_start() is really completely irrelevant.

However, look at the timestamps! This bug was reported on 2016-Sep-18, and for our downstream SeaBIOS package, the only change in a very long time, since 2016-May-11 specifically, has been this one:

* Thu Sep 15 2016 Miroslav Rezanina <mrezanin> - 1.9.1-5.el7
- seabios-pci-don-t-map-virtio-1.0-storage-devices-above-4G.patch [bz#1373154]
- Resolves: bz#1373154
  (Guest fails boot up with ivshmem-plain and virtio-pci device)

That is, the first new SeaBIOS build became available, since May, just three days before this regression was reported. Thus I'm inclined to think that the qemu-kvm update to 1.5.3-125.el7 on QE's side *coincided* with the SeaBIOS update to 1.9.1-5.el7, and then the regression got mis-attributed to qemu-kvm-1.5.3-125.el7.

Now, the only difference between seabios-1.9.1-4.el7 and seabios-1.9.1-5.el7 is:

commit 01549028733315a513b1b5fcc1951fd271e8a531
Author: Marcel Apfelbaum <marcel>
Date:   Tue Sep 13 13:20:45 2016 +0200

    pci: don't map virtio 1.0 storage devices above 4G
    
    RH-Author: Marcel Apfelbaum <marcel>
    Message-id: <1473772845-913-1-git-send-email-marcel>
    Patchwork-id: 72292
    O-Subject: [RHEL-7.3 seabios PATCH V2] pci: don't map virtio 1.0 storage devices above 4G
    Bugzilla: 1373154
    RH-Acked-by: Maxime Coquelin <maxime.coquelin>
    RH-Acked-by: Gerd Hoffmann <kraxel>
    RH-Acked-by: Laszlo Ersek <lersek>
    RH-Acked-by: Michael S. Tsirkin <mst>
    
    v1->v2:
      - add the note to the commit message (Gerd)
    
    BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1373154
    Brew: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11741451
    Upstream: Fixed upstream by commit: 0e21548b15 (virtio: pci cfg access)
    Tests: Checked the virtio BARs are placed in the 32-bit range and
           the guest boots successfully.
    
    Otherwise SeaBIOS can't access virtio's modern BAR.
    
    Note: It works in the master branch but can't be merged easily
    into 1.9 branch, so use this as an interim solution
    until we'll rebase to 1.10.
    
    Signed-off-by: Marcel Apfelbaum <marcel>
    Signed-off-by: Miroslav Rezanina <mrezanin>

What likely happens here is that SeaBIOS now allocates the MMIO BAR of the sole virtio-blk device -- see comment #0 -- under 4GB. And then you end up with two more reserved pages -- and consequently two fewer free pages -- under 4GB:

(In reply to Fam Zheng from comment #7)
> # diff qemu-kvm-1.5.3-124.dmesg.log qemu-kvm-1.5.3-125.dmesg.log 
> 7,8c7,8
> <  BIOS-e820: 0000000000100000 - 00000000bfffd000 (usable)
> <  BIOS-e820: 00000000bfffd000 - 00000000c0000000 (reserved)
> ---
> >  BIOS-e820: 0000000000100000 - 00000000bfffb000 (usable)
> >  BIOS-e820: 00000000bfffb000 - 00000000c0000000 (reserved)
> [...]
> 41c41
> < Built 1 zonelists.  Total pages: 1029031
> ---
> > Built 1 zonelists.  Total pages: 1029029

In turn, this change in the 32-bit memory map probably displaces the ACPI payload generated by QEMU and installed by SeaBIOS from its original location to a different address. That should be no problem, of course, as long as the memory is not corrupted in some way.

Now, given that the rhel6.8 and rhel7.3 guests seem fine with the change (according to comment 3), I can't promise that the SeaBIOS patch actually regressed SeaBIOS -- it might just tickle something brittle in the RHEL-5.11 guest kernel's ACPI interpreter.

So:

(1) Please try to reproduce the issue with:
    - the same RHEL-5.11 guest,
    - qemu-kvm-1.5.3-125.el7,
    - seabios-1.9.1-4.el7 (i.e., SeaBIOS should be downgraded).

This should confirm whether the qemu-kvm or the seabios update triggers the issue.

(2) Please re-verify:
    - the same RHEL-6.8 guest and the same RHEL-7.3 guest
      (using an otherwise identical QEMU command line to the RHEL-5.11 case),
    - with qemu-kvm-1.5.3-125.el7,
    - and seabios-1.9.1-5.el7.

If these more modern guests are okay with the SeaBIOS change, then we might have to patch the RHEL-5 guest kernel. (Theoretically it's possible that the SeaBIOS patch causes genuine corruption in the ACPI tables, but that would be visible to the RHEL-6.8 and RHEL-7.3 guests as well.)

Thanks!

Comment 12 Fam Zheng 2016-09-23 05:19:06 UTC

When I fiddled around "qemu-kvm-1.5.3-124.el7." and "qemu-kvm-1.5.3-125.el7" on my machine yesterday, the seabios package was kept intact all the time:

    # rpm -q seabios-bin
    seabios-bin-1.9.1-5.el7.noarch

Maybe I'm making a stupid mistake, but my testing of just now shows that downgrading seabios to "seabios-1.9.1-4.el7" doesn't help at all. The dmesg is exactly the same before/after the downgrading, and the guest system still "halts".

Laszlo, let me know if I can do any other quick tests.

Comment 13 Laszlo Ersek 2016-09-23 05:25:11 UTC

Thanks Fam -- can you please confirm that you downgraded the "seabios-bin" package as well? (That's the one that supplies the firmware binary actually.)

Comment 14 Yanan Fu 2016-09-23 06:09:46 UTC

The version i have tested when hit this bug is :
qemu: qemu-kvm-1.5.3-125.el7.x86_64
seavgabios-bin: seavgabios-bin-1.9.1-4.el7.noarch
seabiso-bin:seabios-bin-1.9.1-4.el7.noarch

Then, only downgrade qemu version to "qemu-kvm-1.5.3-124.el7.x86_64", it is ok.

Now, test with latest seabios-bin version:
  qemu: qemu-kvm-1.5.3-126.el7.x86_64
  seavgabios-bin: seavgabios-bin-1.9.1-5.el7.noarch
  seabiso-bin: seabios-bin-1.9.1-5.el7.noarch
Still can hit this bug. 

Downgrade qemu version to "qemu-kvm-1.5.3-124.el7.x86_64", it is ok.



     qemu-kvm       |   seabios-bin/seavgabios-bin  | result  |
--------------------|-------------------------------|---------|
qemu-kvm-1.5.3-124  |             1.9.1-5           |   OK    |
                    |             1.9.1-4           |   OK    |
--------------------|-------------------------------|---------|
qemu-kvm-1.5.3-125  |             1.9.1-5           |   NG    |
                    |             1.9.1-4           |   NG    |

Comment 15 Fam Zheng 2016-09-23 06:38:32 UTC

(In reply to Laszlo Ersek from comment #13)
> Thanks Fam -- can you please confirm that you downgraded the "seabios-bin"
> package as well? (That's the one that supplies the firmware binary actually.)

Yes, that is what I did.

Comment 16 Laszlo Ersek 2016-09-23 09:23:18 UTC

Thanks guys, your data proves that my hypothesis about SeaBIOS was incorrect. I'll try to reproduce the issue locally and see if I can get more insight.

Comment 17 Laszlo Ersek 2016-09-23 12:59:01 UTC

(Click "Unwrap comments" to the right of comment 0 for reading this
comment.)

My test environment consist of SeaBIOS 1.9.1-4 (invariably), RHEL-5.11 GA
(invariably), and qemu-kvm-1.5.3-124 vs. qemu-kvm-1.5.3-125.

I launched the RHEL-5.11 guest under -124, and dumped the guest memory with

  virsh dump seabios.rhel5 seabios.rhel5.124.core \
      --memory-only --format kdump-snappy

Then I shut off the guest, upgraded qemu-kvm to -125, booted it, and
repeated the same:

  virsh dump seabios.rhel5 seabios.rhel5.125.core \
      --memory-only --format kdump-snappy

I shut off the guest (well, forced it off ultimately).

I also captured the dmesg in the guest (after booting with the
"ignore_loglevel" kernel param), for both qemu-kvm versions. As it's already
known from Fam's investigations, we have inexplicable differences in the
ACPI table addresses like:

> --- with-124/dmesg      2016-09-23 13:43:13.610686990 +0200
> +++ with-125/dmesg      2016-09-23 13:43:25.578544705 +0200
> @@ -13,10 +13,10 @@
>  DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011
>  kvm-clock: cpu 0, msr 7eff:804ab401, boot clock
>  ACPI: RSDP (v000 BOCHS                                 ) @ 0x00000000000f7350
> -ACPI: RSDT (v001 BOCHS  BXPCRSDT 0x00000001 BXPC 0x00000001) @ 0x00000000bffffaba
> -ACPI: FADT (v001 BOCHS  BXPCFACP 0x00000001 BXPC 0x00000001) @ 0x00000000bfffeeb7
> -ACPI: SSDT (v001 BOCHS  BXPCSSDT 0x00000001 BXPC 0x00000001) @ 0x00000000bfffef2b
> -ACPI: MADT (v001 BOCHS  BXPCAPIC 0x00000001 BXPC 0x00000001) @ 0x00000000bffffa0a
> +ACPI: RSDT (v001 BOCHS  BXPCRSDT 0x00000001 BXPC 0x00000001) @ 0x00000000bffffb30
> +ACPI: FADT (v001 BOCHS  BXPCFACP 0x00000001 BXPC 0x00000001) @ 0x00000000bfffef0b
> +ACPI: SSDT (v001 BOCHS  BXPCSSDT 0x00000001 BXPC 0x00000001) @ 0x00000000bfffef7f
> +ACPI: MADT (v001 BOCHS  BXPCAPIC 0x00000001 BXPC 0x00000001) @ 0x00000000bffffa80
>  ACPI: DSDT (v001 BOCHS  BXPCDSDT 0x00000001 BXPC 0x00000001) @ 0x(null)
>  No NUMA configuration found
>  Faking a node at 0000000000000000-0000000140000000

This difference is already unfathomable, but I wanted to see the contents of
those tables; most importantly, the SSDT, because that's what contains the
_S5 package description for powering off the machine.

So, I installed the "crash" utility on my RHEL-7 laptop, plus the following
two debuginfo RPMs, matching the RHEL-5.11 GA kernel that ran in the guest:

  kernel-debuginfo-2.6.18-398.el5.x86_64
  kernel-debuginfo-common-2.6.18-398.el5.x86_64

(Let me repeat -- you can install any kernel debuginfo package on your
laptop or workstation, it doesn't have to match your running kernel --
instead it has to match the dumped vmcore that you want to analyze with
"crash".)

So, here's what "crash" has to say about the contents of the SSDT, when the
guest is booted with -124, using the physical start address from the dmesg:

> crash> rd -p -8 0x00000000bfffef2b 100
>         bfffef2b:  53 53 44 54 df 0a 00 00 01 0d 42 4f 43 48 53 20   SSDT......BOCHS
>         bfffef3b:  42 58 50 43 53 53 44 54 01 00 00 00 42 58 50 43   BXPCSSDT....BXPC
>         bfffef4b:  01 00 00 00 10 42 05 5c 00 08 50 30 53 5f 0c 00   .....B.\..P0S_..
>         bfffef5b:  00 00 c0 08 50 30 45 5f 0c ff ff bf fe 08 50 31   ....P0E_......P1
>         bfffef6b:  56 5f 0a 00 08 50 31 53 5f 11 0b 0a 08 00 00 00   V_...P1S_.......
>         bfffef7b:  00 00 00 00 00 08 50 31 45 5f 11 0b 0a 08 00 00   ......P1E_......
>         bfffef8b:  00 00 00 00                                       ....

Okay. Let's see the same for -125 (using the right address again from the
-125 dmesg):

> crash> rd -p -8 0x00000000bfffef7f 100
>         bfffef7f:  53 53 44 54 01 0b 00 00 01 e1 42 4f 43 48 53 20   SSDT......BOCHS
>         bfffef8f:  42 58 50 43 53 53 44 54 01 00 00 00 42 58 50 43   BXPCSSDT....BXPC
>         bfffef9f:  01 00 00 00 a0 21 00 15 5c 2e 5f 53 42 5f 50 43   .....!..\._SB_PC
>         bfffefaf:  49 30 06 00 15 5c 2f 03 5f 53 42 5f 50 43 49 30   I0...\/._SB_PCI0
>         bfffefbf:  49 53 41 5f 06 00 10 42 05 5c 00 08 50 30 53 5f   ISA_...B.\..P0S_
>         bfffefcf:  0c 00 00 00 c0 08 50 30 45 5f 0c ff ff bf fe 08   ......P0E_......
>         bfffefdf:  50 31 56 5f                                       P1V_

(Side note: I had to use the "crash" utility and memory dumps for this
because RHEL-5 doesn't ship "acpidump". No RHEL-5 package provides it, and
when I built it from source, in the guest, it failed to dump anything at
all.)

What the heck??? The addresses of the tables differ because their sizes and
their contents differ too! These differences are obviously impossible to
correlate with the fix for bug 1285453, however.

So I opened the build pages in Brew, for both -124 [1] and -125 [2] -- see
the URLs in the next, private, comment --, downloaded the build log for each
[3] [4], and compared the "iasl" build messages.

(In the qemu-kvm version that we ship in base RHEL (forked from upstream
1.5.3), we still build the ACPI payload from DSL template files. The _S5
package, which controls ACPI power-off, is in "hw/i386/ssdt-misc.dsl".)

In the -124 build, iasl emitted the following messages:

> iasl -Pn -vs -l -tc -p ssdt-misc ssdt-misc.dsl.i   2>&1
> ASL Input:     ssdt-misc.dsl.i - 102 lines, 2567 bytes, 35 keywords
> AML Output:    ssdt-misc.aml - 354 bytes, 24 named objects, 11 executable opcodes
> Listing File:  ssdt-misc.lst - 7590 bytes
> Hex Dump:      ssdt-misc.hex - 3686 bytes
> Compilation complete. 0 Errors, 0 Warnings, 0 Remarks, 2 Optimizations

Whereas in the -125 build, iasl emitted:

> iasl -Pn -vs -l -tc -p ssdt-misc ssdt-misc.dsl.i   2>&1
> ASL Input:     ssdt-misc.dsl.i - 102 lines, 2567 bytes, 35 keywords
> AML Output:    ssdt-misc.aml - 388 bytes, 24 named objects, 11 executable opcodes
> Listing File:  ssdt-misc.lst - 7874 bytes
> Hex Dump:      ssdt-misc.hex - 3986 bytes
> Compilation complete. 0 Errors, 0 Warnings, 0 Remarks, 2 Optimizations

Note that the "AML Output" lines differ.

Given that the patch for bug 1285453 doesn't touch "ssdt-misc.dsl", this
difference can only be explained by a change in *iasl itself*. So, after the
build logs, I also downloaded the "root logs" (= the buildroot setup logs)
for both the -124 and -125 builds [5] [6], and compared them. Here we go:
for -124, we got

> DEBUG util.py:257:   --> acpica-tools-20150619-3.el7.x86_64

while for -125, we got

> DEBUG util.py:257:   --> acpica-tools-20160527-1.el7.x86_64

That is, the "rhel-7.3-candidate" Brew build root saw an upgrade for
"acpica-tools", form 20150619-3.el7 to 20160527-1.el7, unbeknownst to us.
This caused "iasl" (which is part of acpica-tools) to compile
"ssdt-misc.dsl" into a different AML byte-stream. The new AML can be
digested by the AML interpreters in the RHEL-6 and RHEL-7 guest kernels; the
RHEL-5 guest kernel chokes on the new AML however.

This BZ is definitely a blocker.

There are three approaches to fix the bug.

First, we could try to convince the new iasl, with various command line
options, to emit AML that the RHEL-5 guest kernel can digest.

Second, we could modify the spec file for qemu-kvm so that it BuildRequires
the known-good, exact version of iasl. That is,
"acpica-tools-20150619-3.el7.x86_64".

Third, the qemu-kvm build system supports the inclusion of pre-generated AML
(which is actually checked into the git tree), should the "iasl" utility be
unavailable on the build host. In RHEL-7 downstream we don't use this
fallback, but we could -- we could remove the "BuildRequires: iasl" RPM
macro completely, and make sure that the pre-generated AML is the right one.

My preference is option #2. Option #1 is a moving target; every new iasl
version could mess up stuff for us in a different way. And option #3 is not
too safe either; even if we don't require "iasl", the build root could
include it at some point independently, and then the safe fallback wouldn't
be used at build.

There's option #3/b as well: we could modify the qemu-kvm build system to
*only* consider the pre-generated AML, and never use "iasl", even when it's
available. I think #3/b would also be viable, but it's more intrusive than
option #2, so I prefer to try option #2 first.

Comment 19 Laszlo Ersek 2016-09-23 13:02:33 UTC

Note that the same bug shouldn't affect qemu-kvm-rhev: in qemu-kvm-rhev, we  have no template DSL files, and "iasl" does not partake in the build process. The complete ACPI payload is generated by qemu-kvm-rhev at runtime, implemented in C.

Comment 20 Laszlo Ersek 2016-09-23 13:22:34 UTC

Option #2 is a no-go. I tried to build qemu-kvm with the following patch in
place:

> diff --git a/redhat/qemu-kvm.spec.template b/redhat/qemu-kvm.spec.template
> index c82642de3614..b478eaa54544 100644
> --- a/redhat/qemu-kvm.spec.template
> +++ b/redhat/qemu-kvm.spec.template
> @@ -228,7 +228,7 @@ BuildRequires: librdmacm-devel
>  # iasl and cpp for acpi generation (not a hard requirement as we can use
>  # pre-compiled files, but it's better to use this)
>  %ifarch %{ix86} x86_64
> -BuildRequires: iasl
> +BuildRequires: acpica-tools = 20150619-3.el7
>  BuildRequires: cpp
>  %endif
>  %if 0%{!?build_only_sub:1}

But then Brew said,

> Error: No Package found for acpica-tools = 20150619-3.el7

So, the next choice is option #3/b.

Comment 22 Michael S. Tsirkin 2016-09-23 15:54:21 UTC

Do we think it's a bug in iasl? That means upstream qemu builds on rhel will produce a broken binary (not latest qemu, that does not use iasl anymore).
why do we want to work around and not fix iasl?

Comment 25 Laszlo Ersek 2016-09-23 18:30:05 UTC

(In reply to Michael S. Tsirkin from comment #22)
> Do we think it's a bug in iasl? That means upstream qemu builds on rhel will
> produce a broken binary (not latest qemu, that does not use iasl anymore).
> why do we want to work around and not fix iasl?

It's not a bug in iasl; the AML emitted by the new iasl is consumed by RHEL-6.8 and RHEL-7.3 guests just fine (see comment 3).

Instead, it's an ACPI compat bug in the AML interpreter of RHEL-5.11. And, even if we fixed that bug in RHEL-5.11.z, the 5.11 GA installer ISO would no longer work; a new installer ISO (= 5.12) would be necessary, which I don't think will happen (certainly not just for qemu's / iasl's sake).

It's practically the same thing as with old Windows guests: in upstream QEMU we've been careful lately not to generate otherwise valid AML that is known to break old Windows guests. The RHEL-5.11 guest is now in the same category.

Comment 26 Michael S. Tsirkin 2016-09-23 18:34:01 UTC

Do you know what the change is?
It does not prove a lot that some ASPMs can
consume it, iasl should generate code that
is compatible with the claimed version of
the spec, which is ACPI 1 for our case.

Comment 27 Laszlo Ersek 2016-09-23 19:41:29 UTC

More info: the specific opcode that trips up the RHEL-5 guest is 0x15 ("ExternalOp"). This opcode was added in ACPI 6.0, and its sole purpose is to support the disassembler in determining the prototype of external methods. At execution time, the opcode should be ignored, it is embedded in a if(0){} block. The following is an excerpt from the upstream ACPI CA git tree, file documents/changes.txt, at current master (git commit 0c1666287140, 2016-Sep-23):

> ----------------------------------------
> 12 February 2016. Summary of changes for version 20160212:
> 
> [...]
> 
> 2) iASL Compiler/Disassembler and Tools:
> 
> Completed full support for the ACPI 6.0 External() AML opcode. The
> compiler emits an external AML opcode for each ASL External statement.
> This opcode is used by the disassembler to assist with the disassembly of
> external control methods by specifying the required number of arguments
> for the method. AML interpreters do not use this opcode. To ensure that
> interpreters do not even see the opcode, a block of one or more external
> opcodes is surrounded by an "If(0)" construct. As this feature becomes
> commonly deployed in BIOS code, the ability of disassemblers to correctly
> disassemble AML code will be greatly improved. David Box.

The If(0) trick works with the RHEL-6.8 and RHEL-7.3 guests, but it does not prevent the RHEL-5.11 guest from seeing the 0x15 (ExternalOp) opcode, and unfortunately RHEL-5.11 chokes on it.

The iasl utility doesn't seem to support a command line option that turns off this feature.

Comment 28 Laszlo Ersek 2016-09-23 19:49:07 UTC

I'm also in the process of checking whether current upstream iasl behaves any different. Namely, I built the iasl binary at upstream commit 0c1666287140, and embedded it in the SRPM by adding it to EXTRA_SOURCES in "redhat/Makefile.common", adding it to the spec file as Source21, and passing it to configure with --iasl=%{SOURCE21}. It's currently brewing. Once done, I'll check the build log (to make sure it was indeed used to build the tables) and then I'll repeat the RHEL-5.11 test.

Comment 30 Laszlo Ersek 2016-09-23 20:02:04 UTC

From the build log (note the pathname of the iasl binary):

> /builddir/build/SOURCES/iasl-0c1666287140 -Pn -vs -l -tc -p ssdt-misc ssdt-misc.dsl.i   2>&1
> ASL Input:     ssdt-misc.dsl.i - 102 lines, 2567 bytes, 35 keywords
> AML Output:    ssdt-misc.aml - 388 bytes, 24 named objects, 11 executable opcodes
> Listing File:  ssdt-misc.lst - 7874 bytes
> Hex Dump:      ssdt-misc.hex - 3986 bytes
> Compilation complete. 0 Errors, 0 Warnings, 0 Remarks, 2 Optimizations

The "AML Output" line matches that under "-125 build" in comment 17, that
is, the flawed build.

After launching the RHEL-5.11 guest with the qemu-kvm binary built like
this, I get the same "Found unknown opcode 15 at AML address ..." error
messages as originally reported. Thus, I confirm that current upstream iasl
presents the same reported problem for the RHEL-5.11 guest.

Comment 36 Laszlo Ersek 2016-11-11 11:45:23 UTC

*** Bug 1394095 has been marked as a duplicate of this bug. ***

Comment 37 Danilo de Paula 2016-11-16 17:18:23 UTC

Fix included in qemu-kvm-1.5.3-127.el7

Comment 39 Yanan Fu 2017-03-13 08:58:40 UTC

Verify this bz with the latest qemu-kvm build by now.

Test version:
kernel: kernel-3.10.0-591.el7.x86_64
qemu: qemu-kvm-1.5.3-133.el7.x86_64
seabios: seavgabios-bin-1.10.1-2.el7.noarch
         seabios-bin-1.10.1-2.el7.noarch

This test is covered by acceptance test.
Test with both 32 bit and 64 bit rhel 5.11 guest. all pass.


020-smp_8.8192m.repeat1.Host_RHEL.m7.u4.spice.qcow2.virtio_blk.up.virtio_net.RHEL.5.11.x86_64.io-github-autotest-qemu.shutdown 	
PASS
022-smp_8.8192m.repeat1.Host_RHEL.m7.u4.spice.qcow2.virtio_blk.up.virtio_net.Win2012.x86_64.r2.io-github-autotest-qemu.shutdown 	
PASS

And it is ok too when test manually.

According to the test result above, move to VERIFIED.

Comment 40 errata-xmlrpc 2017-08-01 17:46:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1856

Note You need to log in before you can comment on or make changes to this bug.