2357044 – Rawhide kernel-6.15 randomly shutting down system

Bug 2357044 - Rawhide kernel-6.15 randomly shutting down system

Summary: Rawhide kernel-6.15 randomly shutting down system

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2025-04-02 22:50 UTC by Ian Laurie
Modified:	2025-04-14 21:05 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2025-04-14 21:05:23 UTC
Type:	---
Embargoed:
Dependent Products:
Flags:	mario.limonciello: needinfo- mario.limonciello: needinfo? (yijun_shen) mario.limonciello: needinfo-

Attachments	(Terms of Use)
acpidump (887.71 KB, text/plain) 2025-04-04 00:56 UTC, Ian Laurie	no flags	Details
View All

Description Ian Laurie 2025-04-02 22:50:02 UTC

1. Please describe the problem:
The system is being randomly shut down.  Happens anything from 10 minutes after boot to as long as two hours.

2. What is the Version-Release number of the kernel:
kernel-6.15.0-0.rc0.20250327git1a9239bb4253.5.fc43.x86_64
kernel-6.15.0-0.rc0.20250401git08733088b566.8.fc43.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
First noticed on kernel-6.15.0-0.rc0.20250327git1a9239bb4253.5.fc43.x86_64.
Previous kernel-6.14.0-0.rc7.20250321gitb3ee1e460951.60.fc43.x86_64 works as expected.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
100% reproducible, simply boot any currently available 6.15 kernel.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
As of this writing, yes.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No, not on this box.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
The system "believes" the power button is being pressed.  The important log entries from 2 episodes follow:

Apr 01 12:01:00 zorac CROND[1494]: (root) CMD (run-parts /etc/cron.hourly)
Apr 01 12:01:00 zorac run-parts[1497]: (/etc/cron.hourly) starting 0anacron
Apr 01 12:01:00 zorac run-parts[1503]: (/etc/cron.hourly) finished 0anacron
Apr 01 12:01:00 zorac CROND[1493]: (root) CMDEND (run-parts /etc/cron.hourly)
Apr 01 12:22:28 zorac systemd-logind[821]: Power key pressed short.
Apr 01 12:22:28 zorac systemd-logind[821]: Powering off...
Apr 01 12:22:28 zorac systemd-logind[821]: System is powering down.

****

Apr 01 15:30:21 zorac audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Apr 01 15:30:21 zorac audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Apr 01 15:31:37 zorac systemd-logind[847]: Power key pressed short.
Apr 01 15:31:37 zorac systemd-logind[847]: Powering off...
Apr 01 15:31:37 zorac systemd-logind[847]: System is powering down.

I installed 'evtest' as advised on the test mailing list and determined my power button was event2, however the events shutting down the system are arriving as event1 as shown when trapping event1:

zorac$ sudo evtest --grab /dev/input/event1
[sudo] password for admin: 
Input driver version is 1.0.1
Input device ID: bus 0x19 vendor 0x0 product 0x1 version 0x0
Input device name: "Power Button"
Supported events:
  Event type 0 (EV_SYN)
  Event type 1 (EV_KEY)
    Event code 116 (KEY_POWER)
    Event code 143 (KEY_WAKEUP)
Properties:
Testing ... (interrupt to exit)
Event: time 1743497614.264130, type 1 (EV_KEY), code 116 (KEY_POWER), value 1
Event: time 1743497614.264130, -------------- SYN_REPORT ------------
Event: time 1743497614.264135, type 1 (EV_KEY), code 116 (KEY_POWER), value 0
Event: time 1743497614.264135, -------------- SYN_REPORT ------------
Event: time 1743500523.593170, type 1 (EV_KEY), code 116 (KEY_POWER), value 1
Event: time 1743500523.593170, -------------- SYN_REPORT ------------
Event: time 1743500523.593175, type 1 (EV_KEY), code 116 (KEY_POWER), value 0
Event: time 1743500523.593175, -------------- SYN_REPORT ------------
Event: time 1743502114.807090, type 1 (EV_KEY), code 116 (KEY_POWER), value 1
Event: time 1743502114.807090, -------------- SYN_REPORT ------------
Event: time 1743502114.807095, type 1 (EV_KEY), code 116 (KEY_POWER), value 0
Event: time 1743502114.807095, -------------- SYN_REPORT ------------
Event: time 1743507242.211034, type 1 (EV_KEY), code 116 (KEY_POWER), value 1
Event: time 1743507242.211034, -------------- SYN_REPORT ------------
Event: time 1743507242.211039, type 1 (EV_KEY), code 116 (KEY_POWER), value 0
Event: time 1743507242.211039, -------------- SYN_REPORT ------------
Event: time 1743540057.620123, type 1 (EV_KEY), code 116 (KEY_POWER), value 1
Event: time 1743540057.620123, -------------- SYN_REPORT ------------
Event: time 1743540057.620128, type 1 (EV_KEY), code 116 (KEY_POWER), value 0
Event: time 1743540057.620128, -------------- SYN_REPORT ------------
Event: time 1743541608.688139, type 1 (EV_KEY), code 116 (KEY_POWER), value 1
Event: time 1743541608.688139, -------------- SYN_REPORT ------------
Event: time 1743541608.688144, type 1 (EV_KEY), code 116 (KEY_POWER), value 0
Event: time 1743541608.688144, -------------- SYN_REPORT ------------

My event list looks as follows:

Available devices:
/dev/input/event0:	Sleep Button
/dev/input/event1:	Power Button
/dev/input/event10:	HDA Intel PCH Headphone Mic
/dev/input/event11:	HDA Intel PCH Front Line Out
/dev/input/event12:	HDA Intel PCH HDMI/DP,pcm=3
/dev/input/event13:	HDA Intel PCH HDMI/DP,pcm=7
/dev/input/event14:	HDA Intel PCH HDMI/DP,pcm=8
/dev/input/event2:	Power Button
/dev/input/event3:	PixArt Dell MS116 USB Optical Mouse
/dev/input/event4:	Dell KB216 Wired Keyboard
/dev/input/event5:	Dell KB216 Wired Keyboard System Control
/dev/input/event6:	Dell KB216 Wired Keyboard Consumer Control
/dev/input/event7:	Video Bus
/dev/input/event8:	PC Speaker
/dev/input/event9:	Dell WMI hotkeys

Please note the real power button is event2.

As a final test I booted the latest available 6.14 kernel, trapping event1, and it ran (and it is still running) more than 20 hours without any event1 events being reported.  I believe therefore this is a 6.15 issue.

My Hardware:
Dell Optiplex 3040 1 x 6th Gen Intel(R) Core(TM) i3-6100T CPU @ 3.20GHz
Intel Corporation HD Graphics 530 (rev 06)
16G RAM

For reference the associated thread on Fedora Test List:
https://lists.fedoraproject.org/archives/list/test@lists.fedoraproject.org/thread/SYNFSBCLQ7VUSGWIULVWUDXJM5JHYNH3/

Reproducible: Always

Comment 1 Mario Limonciello 2025-04-03 14:34:29 UTC

As mentioned in https://lore.kernel.org/linux-acpi/CAJZ5v0hbA6bqxHupTh4NZR-GVSb9M5RL7JSb2yQgvYYJg+z2aQ@mail.gmail.com/T/#t

I'd like to understand how this is actually happening to decide what we 
should do about it.

1) Could you please add an acpidump into the bug report?
2) Can you please use acpica tracing to determine what is happening when 
this notify event comes in?  The basic way to do it:

echo 0x00000004 | sudo tee /sys/module/acpi/parameters/trace_debug_layer
echo 0x00000004 | sudo tee /sys/module/acpi/parameters/trace_debug_level
echo enable | sudo tee /sys/module/acpi/parameters/trace_state

This should then save to the journal the associated event info.

Comment 2 Ian Laurie 2025-04-04 00:49:03 UTC

Is there something I need to do before activating acpica tracing?  I'm getting the following error:

zorac$ echo 0x00000004 | sudo tee /sys/module/acpi/parameters/trace_debug_layer
tee: /sys/module/acpi/parameters/trace_debug_layer: Permission denied
0x00000004

I get this even if I 'su' to root.

Comment 3 Ian Laurie 2025-04-04 00:56:44 UTC

Created attachment 2083357 [details]
acpidump

This is after booting into:

kernel-6.15.0-0.rc0.20250401git08733088b566.8.fc43.x86_64

Comment 4 Ian Laurie 2025-04-04 01:47:58 UTC

I cannot write into /sys/module/acpi/parameters/ as root, even if I change the permissions on /sys to 755.

Comment 5 Mario Limonciello 2025-04-04 02:47:09 UTC

Is your kernel built with CONFIG_ACPI_DEBUG?  If not; you might need to build with that for it to work.
Here is more information on it:
https://www.kernel.org/doc/html/v6.14-rc7/firmware-guide/acpi/method-tracing.html

From your acpidump, am I right that your ACPI power button is \_SB_.PWRB?  You can confirm it with this:
# cat /sys/bus/acpi/drivers/button/PNP0C0C:00/path

I see a few ways that this is notified.
* Level triggered GPE 6D:
  Notify (\_SB.PWRB, 0x02) // Device Wake
* PME for the root port an XHCI controller is attached to (when PME is enabled for that root port)
 Notify (PWRB, 0x02) // Device Wake
 Notify (XHC, 0x02) // Device Wake
* System _WAK which notifies Super IO (PNP0C02) via \_SB.PCI0.LPCB.SIO1.SIOW
                        If ((PMS1 & 0x08))
                        {
                            Notify (PS2K, 0x02) // Device Wake
                            Notify (PWRB, 0x02) // Device Wake
                        }

                        If ((PMS1 & 0x10))
                        {
                            Notify (PS2M, 0x02) // Device Wake
                            Notify (PWRB, 0x02) // Device Wake
                        }
* System _WAK which notifies RWAK

Some other questions for you that might help me understand how this is happening.

1) What was your system doing when this happened?  Did you by chance plug something into your USB controller?  Or remove something?  Did you do suspend/resume near then?
2) Is it possible for you to capture /sys/firmware/interrupts/gpe6D both at bootup and if it normally doesn't increment right after the issue happens?  You might need to configure logind to ignore power button events for now to make sure your system doesn't turn off when it happens.
3) Would it be possible for to you try to revert the suspected patch to see if this issue goes away?

Comment 6 Mario Limonciello 2025-04-04 02:51:52 UTC

One more thing.  Assuming that the root cause is this patch, can you test if this patch helps?

diff --git a/drivers/acpi/button.c b/drivers/acpi/button.c
index 90b09840536dd..740c80cb17033 100644
--- a/drivers/acpi/button.c
+++ b/drivers/acpi/button.c
@@ -444,10 +444,14 @@ static void acpi_button_notify(acpi_handle handle, u32 event, void *data)
        struct input_dev *input;
        int keycode;

+       button = acpi_driver_data(device);
+
        switch (event) {
        case ACPI_BUTTON_NOTIFY_STATUS:
                break;
        case ACPI_BUTTON_NOTIFY_WAKE:
+               if (!button->suspended)
+                       return;
                break;
        default:
                acpi_handle_debug(device->handle, "Unsupported event [0x%x]\n",
@@ -457,7 +461,6 @@ static void acpi_button_notify(acpi_handle handle, u32 event, void *data)

        acpi_pm_wakeup_event(&device->dev);

-       button = acpi_driver_data(device);
        if (button->suspended)
                return;

@Yijun:

Can you check if the original reason for https://git.kernel.org/torvalds/c/a7e23ec17feec still works with that patch?

Comment 7 Ian Laurie 2025-04-04 03:41:28 UTC

In config-6.15.0-0.rc0.20250401git08733088b566.8.fc43.x86_64 I can see this:

# CONFIG_ACPI_DEBUG is not set

Comment 8 Ian Laurie 2025-04-04 03:42:30 UTC

zorac$ cat /sys/bus/acpi/drivers/button/PNP0C0C:00/path
\_SB_.PWRB

Comment 9 Ian Laurie 2025-04-04 03:51:07 UTC

> 1) What was your system doing when this happened?  Did you by chance plug something into your USB controller?  Or remove something?  Did you do suspend/resume near then?

No to all.  Suspend/Resume is disabled.  System was just sitting doing nothing. Normally this system is headless, I ssh/Xrdp into it as needed. I can now access console because in trying to do Comment#1 I now have a console (I thought being remote may have been the issue).

> 2) Is it possible for you to capture /sys/firmware/interrupts/gpe6D both at bootup and if it normally doesn't increment right after the issue happens?  You might need to configure logind to ignore power button events for now to make sure your system doesn't turn off when it happens.

I am willing to do anything, but no clue how to do this.

> 3) Would it be possible for to you try to revert the suspected patch to see if this issue goes away?

I don't think I have the space on this box to compile a kernel, and honestly this is uncharted territory for me. I need local Fedora help to test this.

Comment 10 Mario Limonciello 2025-04-04 14:55:09 UTC

Here's a patch that I think should help your issue and still work for Yijun.

https://lore.kernel.org/linux-acpi/20250404145034.2608574-1-superm1@kernel.org/T/#u

Hopefully some Fedora guys can make you a test kernel.  I'll needinfo Hans, maybe he can.

Comment 11 Mario Limonciello 2025-04-04 14:55:45 UTC

Or maybe Justin.  Whoever does; please clear the needinfos for other Fedora guys when you post it.

Comment 12 Ian Laurie 2025-04-05 00:33:02 UTC

This could be a "red herring" because we're dealing with an element of randomness, but is there any change having acpica-tools installed (or not) could influence this problem?

I was running a 6.15 kernel overnight for maybe 6 hours and I didn't see any shutdown events.  This morning I uninstalled acpica-tools and across about 3 hours I've seen 3 shutdown events.  It could easily be a coincidence but it seems suspicious to me.

Comment 13 Mario Limonciello 2025-04-05 02:15:51 UTC

I don't see any reason to believe those two are linked.  That package doesn't install any daemons, the tools inside it are launched on demand.

Comment 14 Justin M. Forbes 2025-04-05 19:58:58 UTC

(In reply to Mario Limonciello from comment #10)
> Here's a patch that I think should help your issue and still work for Yijun.
> 
> https://lore.kernel.org/linux-acpi/20250404145034.2608574-1-superm1@kernel.
> org/T/#u
> 
> Hopefully some Fedora guys can make you a test kernel.  I'll needinfo Hans,
> maybe he can.

https://koji.fedoraproject.org/koji/taskinfo?taskID=131147672 should be done soon for testing.

Comment 15 Ian Laurie 2025-04-05 22:57:24 UTC

> https://koji.fedoraproject.org/koji/taskinfo?taskID=131147672 should be done
> soon for testing.

Thanks Justin, running it now and trapping event type 1.

Comment 16 Ian Laurie 2025-04-06 05:31:11 UTC

More than 6 hours later still no bogus events.  The acid test is overnight though.  But it's looking really good so far.

Comment 17 Ian Laurie 2025-04-06 23:16:10 UTC

Still no bogus events after 24 hours.

Comment 18 Mario Limonciello 2025-04-07 00:43:25 UTC

Sounds like the correct root cause.  If you wouldn't mind, please leave a Tested-by tag [1] on the v2 patch submission.

[1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes

Comment 19 Ian Laurie 2025-04-07 02:24:46 UTC

(In reply to Mario Limonciello from comment #18)
> Sounds like the correct root cause.  If you wouldn't mind, please leave a
> Tested-by tag [1] on the v2 patch submission.

Hopefully I did that correctly.

Comment 20 Ian Laurie 2025-04-07 09:45:28 UTC

I ran updates on my Rawhide box and allowed the kernel to update to:

kernel-6.15.0-0.rc0.20250404gite48e99b6edf4.11.fc43.x86_64

Which (from 2025-04-05) I'm guessing would not yet have the patch to fix the issue, and I got my first bogus event1 in under 30 minutes.

Comment 21 Ian Laurie 2025-04-14 21:05:23 UTC

If I'm not mistaken rc2 upstream has the fix for this, and rc2 is available in Fedora now.

I suspect the patch made it into at least one of the later Fedora rc1 kernels as well because:

    kernel-6.15.0-0.rc1.20250413git7cdabafc0012.21.fc43 

tested OK.  I'll close this as fixed.

Note You need to log in before you can comment on or make changes to this bug.