Bug 2357044
Summary: | Rawhide kernel-6.15 randomly shutting down system | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ian Laurie <nixuser> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rawhide | CC: | acaringi, adscvr, airlied, bskeggs, hdegoede, hpa, jforbes, josef, kernel-maint, linville, mario.limonciello, masami256, mchehab, ptalbert, steved, suraj.ghimire7, yijun_shen | ||||
Target Milestone: | --- | Flags: | mario.limonciello:
needinfo-
mario.limonciello: needinfo? (yijun_shen) mario.limonciello: needinfo- |
||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | --- | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2025-04-14 21:05:23 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Ian Laurie
2025-04-02 22:50:02 UTC
As mentioned in https://lore.kernel.org/linux-acpi/CAJZ5v0hbA6bqxHupTh4NZR-GVSb9M5RL7JSb2yQgvYYJg+z2aQ@mail.gmail.com/T/#t I'd like to understand how this is actually happening to decide what we should do about it. 1) Could you please add an acpidump into the bug report? 2) Can you please use acpica tracing to determine what is happening when this notify event comes in? The basic way to do it: echo 0x00000004 | sudo tee /sys/module/acpi/parameters/trace_debug_layer echo 0x00000004 | sudo tee /sys/module/acpi/parameters/trace_debug_level echo enable | sudo tee /sys/module/acpi/parameters/trace_state This should then save to the journal the associated event info. Is there something I need to do before activating acpica tracing? I'm getting the following error: zorac$ echo 0x00000004 | sudo tee /sys/module/acpi/parameters/trace_debug_layer tee: /sys/module/acpi/parameters/trace_debug_layer: Permission denied 0x00000004 I get this even if I 'su' to root. Created attachment 2083357 [details]
acpidump
This is after booting into:
kernel-6.15.0-0.rc0.20250401git08733088b566.8.fc43.x86_64
I cannot write into /sys/module/acpi/parameters/ as root, even if I change the permissions on /sys to 755. Is your kernel built with CONFIG_ACPI_DEBUG? If not; you might need to build with that for it to work. Here is more information on it: https://www.kernel.org/doc/html/v6.14-rc7/firmware-guide/acpi/method-tracing.html From your acpidump, am I right that your ACPI power button is \_SB_.PWRB? You can confirm it with this: # cat /sys/bus/acpi/drivers/button/PNP0C0C:00/path I see a few ways that this is notified. * Level triggered GPE 6D: Notify (\_SB.PWRB, 0x02) // Device Wake * PME for the root port an XHCI controller is attached to (when PME is enabled for that root port) Notify (PWRB, 0x02) // Device Wake Notify (XHC, 0x02) // Device Wake * System _WAK which notifies Super IO (PNP0C02) via \_SB.PCI0.LPCB.SIO1.SIOW If ((PMS1 & 0x08)) { Notify (PS2K, 0x02) // Device Wake Notify (PWRB, 0x02) // Device Wake } If ((PMS1 & 0x10)) { Notify (PS2M, 0x02) // Device Wake Notify (PWRB, 0x02) // Device Wake } * System _WAK which notifies RWAK Some other questions for you that might help me understand how this is happening. 1) What was your system doing when this happened? Did you by chance plug something into your USB controller? Or remove something? Did you do suspend/resume near then? 2) Is it possible for you to capture /sys/firmware/interrupts/gpe6D both at bootup and if it normally doesn't increment right after the issue happens? You might need to configure logind to ignore power button events for now to make sure your system doesn't turn off when it happens. 3) Would it be possible for to you try to revert the suspected patch to see if this issue goes away? One more thing. Assuming that the root cause is this patch, can you test if this patch helps? diff --git a/drivers/acpi/button.c b/drivers/acpi/button.c index 90b09840536dd..740c80cb17033 100644 --- a/drivers/acpi/button.c +++ b/drivers/acpi/button.c @@ -444,10 +444,14 @@ static void acpi_button_notify(acpi_handle handle, u32 event, void *data) struct input_dev *input; int keycode; + button = acpi_driver_data(device); + switch (event) { case ACPI_BUTTON_NOTIFY_STATUS: break; case ACPI_BUTTON_NOTIFY_WAKE: + if (!button->suspended) + return; break; default: acpi_handle_debug(device->handle, "Unsupported event [0x%x]\n", @@ -457,7 +461,6 @@ static void acpi_button_notify(acpi_handle handle, u32 event, void *data) acpi_pm_wakeup_event(&device->dev); - button = acpi_driver_data(device); if (button->suspended) return; @Yijun: Can you check if the original reason for https://git.kernel.org/torvalds/c/a7e23ec17feec still works with that patch? In config-6.15.0-0.rc0.20250401git08733088b566.8.fc43.x86_64 I can see this: # CONFIG_ACPI_DEBUG is not set zorac$ cat /sys/bus/acpi/drivers/button/PNP0C0C:00/path \_SB_.PWRB > 1) What was your system doing when this happened? Did you by chance plug something into your USB controller? Or remove something? Did you do suspend/resume near then? No to all. Suspend/Resume is disabled. System was just sitting doing nothing. Normally this system is headless, I ssh/Xrdp into it as needed. I can now access console because in trying to do Comment#1 I now have a console (I thought being remote may have been the issue). > 2) Is it possible for you to capture /sys/firmware/interrupts/gpe6D both at bootup and if it normally doesn't increment right after the issue happens? You might need to configure logind to ignore power button events for now to make sure your system doesn't turn off when it happens. I am willing to do anything, but no clue how to do this. > 3) Would it be possible for to you try to revert the suspected patch to see if this issue goes away? I don't think I have the space on this box to compile a kernel, and honestly this is uncharted territory for me. I need local Fedora help to test this. Here's a patch that I think should help your issue and still work for Yijun. https://lore.kernel.org/linux-acpi/20250404145034.2608574-1-superm1@kernel.org/T/#u Hopefully some Fedora guys can make you a test kernel. I'll needinfo Hans, maybe he can. Or maybe Justin. Whoever does; please clear the needinfos for other Fedora guys when you post it. This could be a "red herring" because we're dealing with an element of randomness, but is there any change having acpica-tools installed (or not) could influence this problem? I was running a 6.15 kernel overnight for maybe 6 hours and I didn't see any shutdown events. This morning I uninstalled acpica-tools and across about 3 hours I've seen 3 shutdown events. It could easily be a coincidence but it seems suspicious to me. I don't see any reason to believe those two are linked. That package doesn't install any daemons, the tools inside it are launched on demand. (In reply to Mario Limonciello from comment #10) > Here's a patch that I think should help your issue and still work for Yijun. > > https://lore.kernel.org/linux-acpi/20250404145034.2608574-1-superm1@kernel. > org/T/#u > > Hopefully some Fedora guys can make you a test kernel. I'll needinfo Hans, > maybe he can. https://koji.fedoraproject.org/koji/taskinfo?taskID=131147672 should be done soon for testing. > https://koji.fedoraproject.org/koji/taskinfo?taskID=131147672 should be done
> soon for testing.
Thanks Justin, running it now and trapping event type 1.
More than 6 hours later still no bogus events. The acid test is overnight though. But it's looking really good so far. Still no bogus events after 24 hours. Sounds like the correct root cause. If you wouldn't mind, please leave a Tested-by tag [1] on the v2 patch submission. [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes (In reply to Mario Limonciello from comment #18) > Sounds like the correct root cause. If you wouldn't mind, please leave a > Tested-by tag [1] on the v2 patch submission. Hopefully I did that correctly. I ran updates on my Rawhide box and allowed the kernel to update to: kernel-6.15.0-0.rc0.20250404gite48e99b6edf4.11.fc43.x86_64 Which (from 2025-04-05) I'm guessing would not yet have the patch to fix the issue, and I got my first bogus event1 in under 30 minutes. If I'm not mistaken rc2 upstream has the fix for this, and rc2 is available in Fedora now. I suspect the patch made it into at least one of the later Fedora rc1 kernels as well because: kernel-6.15.0-0.rc1.20250413git7cdabafc0012.21.fc43 tested OK. I'll close this as fixed. |