Bug 2043382 - rawhide kernel 5.17.0-0.rc0.20220112gitdaadb3bd0e8d.63.fc36.x86_64 fails to boot on qemu in detect_thinkpad_privacy_screen
Summary: rawhide kernel 5.17.0-0.rc0.20220112gitdaadb3bd0e8d.63.fc36.x86_64 fails to b...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs 2045693 2046225 2047772
TreeView+ depends on / blocked
 
Reported: 2022-01-21 06:50 UTC by Han Han
Modified: 2022-02-08 21:17 UTC (History)
20 users (show)

Fixed In Version: kernel-5.17.0-0.rc3.89.fc36
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-08 21:17:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
The log of libguestfs (34.91 KB, text/plain)
2022-01-21 06:50 UTC, Han Han
no flags Details

Description Han Han 2022-01-21 06:50:59 UTC
Created attachment 1852425 [details]
The log of libguestfs

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libguestfs-1.47.2-2.fc36.x86_64
qemu-6.2.0-2.fc36.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Download image
➜  ~ wget https://dl.fedoraproject.org/pub/fedora/linux/releases/35/Cloud/x86_64/images/Fedora-Cloud-Base-35-1.2.x86_64.qcow2 -O /var/lib/libvirt/images/fedora.qcow2

2. Execute virt-customize
➜  ~ LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 virt-customize -a /var/lib/libvirt/images/fedora.qcow2 --uninstall cloud-init --install qemu-guest-agent --install NetworkManager --install libselinux-python3 --network 2>&1 |tee /tmp/log

Actual results:
[    2.155549] Code: 10 00 00 00 4c 89 44 24 08 4c 89 4c 24 10 e8 05 bd 00 00 48 83 fb ff 75 07 48 8b 1d 57 12 cd 03 e8 c4 bc 00 00 44 8b 7c 24 18 <4c> 8b 73 18 45 31 e4 c7 04 24 00 00 00 00
 bd 01 00 00 00 41 83 e7                                                                       
[    2.155549] RSP: 0000:ffffb62f8000bd40 EFLAGS: 00010293                                                                                                                                    
[    2.155549] RAX: ffffffffac4ec9c7 RBX: 0000000000000000 RCX: 0000000000000010
[    2.155549] RDX: ffffffffac4ec9c7 RSI: ffffffffac4ec9b0 RDI: 00000000000000a6
[    2.155549] RBP: 0000000000000000 R08: ffffffffab8782de R09: 0000000000000000
[    2.155549] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    2.155549] R13: ffffffffac8fd657 R14: 0000000000000000 R15: 0000000000000001
[    2.155549] FS:  0000000000000000(0000) GS:ffff9b764e200000(0000) knlGS:0000000000000000
[    2.155549] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033         
[    2.155549] CR2: 0000000000000018 CR3: 000000004b028001 CR4: 0000000000770ef0
[    2.155549] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.155549] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.155549] PKRU: 55555554
[    2.155549] Call Trace:                                                                     
[    2.155549]  <TASK>      
[    2.155549]  ? acpi_walk_namespace+0x13e/0x13e    
[    2.155549]  acpi_get_devices+0xd3/0x110
[    2.155549]  ? drm_core_init+0xd4/0xd4                                                      
[    2.155549]  ? drm_kms_helper_init+0xa/0xa
[    2.155549]  detect_thinkpad_privacy_screen+0x51/0x8d
[    2.155549]  drm_privacy_screen_lookup_init+0xa/0x43                                                                                                                                       [    2.155549]  drm_core_init+0xac/0xd4 
[    2.155549]  do_one_initcall+0x67/0x350
[    2.155549]  ? kernel_init_freeable+0x273/0x2cf
[    2.155549]  kernel_init_freeable+0x283/0x2cf
[    2.155549]  ? rest_init+0x260/0x260
[    2.155549]  kernel_init+0x16/0x130
[    2.155549]  ret_from_fork+0x22/0x30
[    2.155549]  </TASK>
[    2.155549] Modules linked in:
[    2.155549] CR2: 0000000000000018
[    2.155549] ---[ end trace 02928d238129499d ]---
[    2.155549] RIP: 0010:acpi_ns_walk_namespace+0x60/0x27f
[    2.155549] Code: 10 00 00 00 4c 89 44 24 08 4c 89 4c 24 10 e8 05 bd 00 00 48 83 fb ff 75 07 48 8b 1d 57 12 cd 03 e8 c4 bc 00 00 44 8b 7c 24 18 <4c> 8b 73 18 45 31 e4 c7 04 24 00 00 00 00
 bd 01 00 00 00 41 83 e7

libguestfs: child_cleanup: 0x556254c5c9d0: child process died
libguestfs: trace: launch = -1 (error)
virt-customize: error: libguestfs error: guestfs_launch failed, see earlier 
error messages         
                                               
If reporting bugs, run virt-customize with debugging enabled and include 
the complete output:     

Expected results:
No kernel panic

Additional info:
See the full log in the attachment

Comment 1 Richard W.M. Jones 2022-01-21 08:54:18 UTC
This is likely a rawhide kernel bug.  You might want to try upgrading
to one of the newer kernels:

https://koji.fedoraproject.org/koji/packageinfo?packageID=8

or use a non-rawhide kernel.  But basically it's a kernel bug.

Comment 2 Richard W.M. Jones 2022-01-21 09:05:02 UTC
This bug also stopped a package rebuilding in the Fedora 36 mass rebuild yesterday:

https://koji.fedoraproject.org/koji/taskinfo?taskID=81519983

Comment 3 Richard W.M. Jones 2022-01-21 11:07:40 UTC
And it breaks libguestfs builds on x86-64:

https://koji.fedoraproject.org/koji/taskinfo?taskID=81593116

Comment 4 Richard W.M. Jones 2022-01-25 20:37:12 UTC
Still happening with kernel-5.17.0-0.rc0.20220112gitdaadb3bd0e8d.63.fc36.x86_64

Comment 5 Justin M. Forbes 2022-01-26 00:50:37 UTC
Not really surprised that it hasn't been fixed in the original kernel that the error was reported on. I haven't been able to build a kernel since then due to gcc 12. No, I do not know when I will have kernels building again, every time I get passed one error, it exposes another.

Comment 6 Richard W.M. Jones 2022-01-26 10:24:35 UTC
I bisected this to:

f809891ee51b706e1a2a42998d8766c120660796 is the first bad commit
commit f809891ee51b706e1a2a42998d8766c120660796
Author: Hans de Goede <hdegoede>
Date:   Tue Oct 5 22:23:20 2021 +0200

    platform/x86: thinkpad_acpi: Register a privacy-screen device
    
    Register a privacy-screen device on laptops with a privacy-screen,
    this exports the PrivacyGuard features to user-space using a
    standardized vendor-agnostic sysfs interface. Note the sysfs interface
    is read-only.
    
    Registering a privacy-screen device with the new privacy-screen class
    code will also allow the GPU driver to get a handle to it and export
    the privacy-screen setting as a property on the DRM connector object
    for the LCD panel. This DRM connector property is a new standardized
    interface which all user-space code should use to query and control
    the privacy-screen.
    
    Reviewed-by: Emil Velikov <emil.l.velikov>
    Reviewed-by: Lyude Paul <lyude>
    Reviewed-by: Mark Pearson <markpearson>
    Signed-off-by: Hans de Goede <hdegoede>
    Link: https://patchwork.freedesktop.org/patch/msgid/20211005202322.700909-9-hdegoede@redhat.com

 drivers/platform/x86/Kconfig         |  2 +
 drivers/platform/x86/thinkpad_acpi.c | 97 ++++++++++++++++++++++++++----------
 2 files changed, 74 insertions(+), 25 deletions(-)

Comment 7 Richard W.M. Jones 2022-01-26 10:27:10 UTC
(In reply to Justin M. Forbes from comment #5)
> Not really surprised that it hasn't been fixed in the original kernel that
> the error was reported on. I haven't been able to build a kernel since then
> due to gcc 12. No, I do not know when I will have kernels building again,
> every time I get passed one error, it exposes another.

I'm also unable to compile the kernel with GCC 12.  To bisect
this bug I reverted my Rawhide machine back to GCC 11.

Comment 8 Richard W.M. Jones 2022-01-26 10:49:41 UTC
Confirmed that reverting f809891ee51b70 (on top of current kernel head)
fixes the problem.  I don't really understand why though.

Comment 9 Hans de Goede 2022-02-01 11:56:37 UTC
So the issue here is that acpi_walk_devices does not like to be called on systems where ACPI has not been initialized and the qemu model being used by guestfs is so old that ACPI fails to initialize:

[    0.013339] ACPI BIOS Error (bug): A valid RSDP was not found (20211217/tbxfroot-210)

A kernel-fix for this has already landed in 5.17-rc2
https://cgit.freedesktop.org/drm-misc/commit/?h=drm-misc-fixes&id=7fde14d705985dd933a3d916d39daa72b1668098

And a pull-req has been submitted to Linus:
https://lore.kernel.org/dri-devel/CAPM=9tweQ-RgLm5oewCYqVzRuiQ6cSQrb2yzVYP_537V67pdDQ@mail.gmail.com/

Note the fix talks about using acpi=off on the kernel commandline, but the acpi_disabled bool for which a check is added also gets set on systems where parsing the ACPI tables fails, so the patch should also fix this bug.

Comment 10 Hans de Goede 2022-02-01 12:00:26 UTC
Ugh, I somehow ended up submitting my comment while I was still editing it. I meant to drop the:

> And a pull-req has been submitted to Linus:
> https://lore.kernel.org/dri-devel/CAPM=9tweQ-RgLm5oewCYqVzRuiQ6cSQrb2yzVYP_537V67pdDQ@mail.gmail.com/

Since as mentioned above that I noticed Linus has already pulled the fix:
https://cgit.freedesktop.org/drm-misc/commit/?h=drm-misc-fixes&id=7fde14d705985dd933a3d916d39daa72b1668098

into 5.17-rc2. So this should be fixed as soon as we are able to build kernels in rawhide again.

Richard, can you confirm that 5.17-rc2 fixes this by testing a 5.17-rc2 build with gcc11 ?

Note a possible (temporary) workaround might be to use a newer machine model in qemu which does actually support ACPI.

Comment 11 Richard W.M. Jones 2022-02-01 12:02:08 UTC
(In reply to Hans de Goede from comment #9)
> So the issue here is that acpi_walk_devices does not like to be called on
> systems where ACPI has not been initialized and the qemu model being used by
> guestfs is so old that ACPI fails to initialize:

I don't think we set any model?  Does libvirt / qemu pick some default model?

The libvirt XML was:

<?xml version="1.0"?>
<domain type="qemu" xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0">
  <name>guestfs-2slh17gcjz8370pb</name>
  <memory unit="MiB">1280</memory>
  <currentMemory unit="MiB">1280</currentMemory>
  <cpu mode="maximum"/>
  <vcpu>1</vcpu>
  <clock offset="utc">
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
    <timer name="hpet" present="no"/>
  </clock>
  <os>
    <type>hvm</type>
    <kernel>/builddir/build/BUILD/guestfs-tools-1.47.3/tmp/.guestfs-1000/appliance.d/kernel</kernel>
    <initrd>/builddir/build/BUILD/guestfs-tools-1.47.3/tmp/.guestfs-1000/appliance.d/initrd</initrd>
    <cmdline>panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=UUID=d51c4ea7-05ba-48c9-a149-708edb152d66 selinux=0 guestfs_verbose=1 TERM=vt100</cmdline>
    <bios useserial="yes"/>
  </os>
  <seclabel type="none"/>
  <on_reboot>destroy</on_reboot>
  <devices>
    <emulator>/usr/bin/qemu-kvm</emulator>
    <rng model="virtio">
      <backend model="random">/dev/urandom</backend>
    </rng>
    <controller type="scsi" index="0" model="virtio-scsi"/>
    <disk device="disk" type="file">
      <source file="/builddir/build/BUILD/guestfs-tools-1.47.3/tmp/libguestfslYmzSB/devnull1.img"/>
      <target dev="sda" bus="scsi"/>
      <driver name="qemu" type="raw" cache="writeback"/>
      <address type="drive" controller="0" bus="0" target="0" unit="0"/>
    </disk>
    <disk type="file" device="disk">
      <source file="/builddir/build/BUILD/guestfs-tools-1.47.3/tmp/libguestfslYmzSB/overlay2.qcow2"/>
      <target dev="sdb" bus="scsi"/>
      <driver name="qemu" type="qcow2" cache="unsafe"/>
      <address type="drive" controller="0" bus="0" target="1" unit="0"/>
    </disk>
    <serial type="unix">
      <source mode="connect" path="/tmp/libguestfsidUYqU/console.sock"/>
      <target port="0"/>
    </serial>
    <channel type="unix">
      <source mode="connect" path="/tmp/libguestfsidUYqU/guestfsd.sock"/>
      <target type="virtio" name="org.libguestfs.channel.0"/>
    </channel>
    <controller type="usb" model="none"/>
    <memballoon model="none"/>
  </devices>
  <qemu:commandline>
    <qemu:env name="TMPDIR" value="/builddir/build/BUILD/guestfs-tools-1.47.3/tmp"/>
  </qemu:commandline>
</domain>

Comment 12 Hans de Goede 2022-02-01 12:13:03 UTC
A better workaround might actually be to add modprobe.blacklist=<module-name> to the kernel commandline to stop the GPU driver from loading (which will cause drm.ko to get loaded as dep and drm.ko has the bug).

To figure out the <module-name> run:

"lsmod | grep drm"

On a vm using the above config. And then see which module(s) is/are depending on drm. There might be multiple, but the others which depend on drm are likely only being loaded because the driver for the emulated gfx-card also depends on some helper-libs. Just stopping the gfx-card driver itself from loading should be enough.

I expect the gfx-card driver to be one of "cirrus", "qxl", "vmwgfx" or "virtio-gpu". Since no specific card is specified I guess it will be "cirrus".

Comment 13 Richard W.M. Jones 2022-02-01 12:27:20 UTC
> Richard, can you confirm that 5.17-rc2 fixes this by testing a 5.17-rc2 build with gcc11 ?

Yes, I built 5.17-rc2 from git (using GCC 11) and can confirm that the
bug has been fixed.

I'll leave this bug open until we get it into Fedora.

Comment 14 Richard W.M. Jones 2022-02-08 21:17:06 UTC
This is now fixed in Fedora.


Note You need to log in before you can comment on or make changes to this bug.