Bug 2355276 - Fedora boot failure on on Dell XPS 9640 after BIOS update to 1.13.0
Summary: Fedora boot failure on on Dell XPS 9640 after BIOS update to 1.13.0
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 41
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Hans de Goede
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-03-27 09:36 UTC by a-team
Modified: 2025-06-19 08:58 UTC (History)
32 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-06-18 12:44:15 UTC
Type: ---
Embargoed:
hdegoede: mirror+


Attachments (Terms of Use)
Dell Support help (1.27 MB, application/pdf)
2025-05-01 04:32 UTC, Christopher Patrick
no flags Details
Screenshot of boot attempt (3.35 MB, image/jpeg)
2025-05-08 16:49 UTC, Tom "spot" Callaway
no flags Details
photo of attempt to boot jflory's kernel (3.53 MB, image/jpeg)
2025-05-17 12:04 UTC, Tom "spot" Callaway
no flags Details
Error message with custom ISO (2.23 MB, image/jpeg)
2025-05-26 03:56 UTC, Christopher Patrick
no flags Details
"acpidump" output after the Dell XPS 9640 1.13.0 BIOS update (4.26 MB, text/plain)
2025-05-31 19:56 UTC, Peter Williams
no flags Details
XPS 9640 BIOS version 1.12.0 SSDT changes which workaround Fedora kernels not booting (16.36 KB, patch)
2025-06-05 16:10 UTC, Hans de Goede
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FC-1712 0 None None None 2025-06-02 13:13:04 UTC

Description a-team 2025-03-27 09:36:24 UTC
1. Please describe the problem:

After the latest BIOS update to version 1.12.0, it's impossible to boot the machine. GRUB is displayed and I can choose between 3 kernel versions or the rescue option, but none of the 4 will boot. The screen remains black with a white cursor or the message "Booting Fedora ...". The only way to shutdown the laptop is keeping the power button pressed.

Note that I also tried to but from a USB stick with F40 and F42 without success

2. What is the Version-Release number of the kernel:

6.13.8-200.fc41.x56_64
6.13.7-200.fc41.x56_64
6.13.6-200.fc41.x56_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

6.13.8-200.fc41.x56_64


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Unfortunately, I can't access the command line to run journalctl


Reproducible: Always

Comment 1 d3d9 2025-04-07 21:06:01 UTC
Same here on Dell Inspiron 16 Plus 7640. In this case, version 1.13.0. Downgrade to 1.12.0 fixed the issue.
It occurred on various kernels from 6.12.6 to 6.13.7 and USB also didn't help.
journalctl doesn't show anything at all between the previous successful boot and the related shutdown a week ago and my current first successful boot after the downgrade.

According to a post on the Dell forums it also occurred with a linux mint USB as well as on fedora and for yet another model / bios variant with an update from the similar point in time / reason for the critical update (CVE-2024-38796). https://www.dell.com/community/en/conversations/inspiron/dell-bios-update-breaks-linux-installs/67d09c66c5ead74c2bf65cd5

Comment 2 Christopher Patrick 2025-05-01 04:21:31 UTC
Same here on Alienware m16 R2. I upgraded version 1.1.10+. I cannot downgrade back to 1.9.0 which is the last version that worked. I cannot boot openSUSE TW, Linux Mint or any Fedora ISO via Fedora USB Installer or Ventoy.

Comment 3 Christopher Patrick 2025-05-01 04:32:00 UTC
Created attachment 2087940 [details]
Dell Support help

I contacted Dell via Facebook Messenger and they following is what they had me try with no success.

Comment 4 Peter Williams 2025-05-01 20:51:15 UTC
I don't know anything about kernel debugging, so I suspect that this won't be helpful at all. But, I booted the Fedora 42 installer with parameters "acpi=off earlyprintk=efi earlycon=efifb nosmp nowatchdog console=" and was able to get some diagnostic output on a new XPS 9640 with the 1.12.0 BIOS. The issue happens around here:

```
Booting paravirtualized kernel on bare hardware
BUG: unable to handle page fault for address: ffffffffff5fc330
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
```

The key call trace lines look like they are:

```
? asm_exc_page_fault+0x26/030
? native_apic_mem_read+0x6/0x20
? intel_thermal_supported+0x5/0x30
? therm_lvt_init+0x23/0x30
```

But here's a report from someone else that looks pretty different: https://lore.kernel.org/lkml/Z-aD1ughy6fd8Ask@archimedes.dunstkreis.ch/T/#u . I think they didn't use "acpi=off", which might be the cause of the difference?

Comment 5 Orion Leidl Wilson 2025-05-08 12:57:24 UTC
I am having this issue too, with the m16 r2 bios 1.1.10

Comment 6 Tom "spot" Callaway 2025-05-08 16:44:48 UTC
(In reply to Peter Williams from comment #4)
> The issue happens around here:
> 
> ```
> Booting paravirtualized kernel on bare hardware
> BUG: unable to handle page fault for address: ffffffffff5fc330
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page


I have the exact same error on my Alienware M16 R2 (1.11.0 firmware) 6.14.4-300.fc42, booting with those options appended:

My first call trace:

? therm_lvt_init+0x23/0x30
? setup_arch+0x87c/0x8c0
? start_kernel+0x64/0x490
? x86_64_start_reservations+0x24/0x30
? x86_64_start_kernel+0xed/0xf0
? common_startup_64+0x13e/0x141

I'll take a photo and attach it too.

Comment 7 Tom "spot" Callaway 2025-05-08 16:49:46 UTC
Created attachment 2089090 [details]
Screenshot of boot attempt

Comment 9 Peter Williams 2025-05-09 13:32:43 UTC
Also: the people experiencing this issue are reporting that Ubuntu and OpenSUSE kernels can boot successfully, FWIW. So it's not an issue that's universal to *all* Linux kernel builds.

Comment 10 Albert Amadeo 2025-05-11 14:23:15 UTC
I have a Dell Inspiron Plus 7640 with the same issue trying to install first Rocky Linux and then Fedora 42.
In the grub screen I added a line with "set debug=all" after the initrd line and I got this output:

script/lexer.c:336:lexer token 259 text []
script/lexer.c:336:lexer token 0 text []
(same 2 lines once more)
loader/efi/linux.c:236:linux kernel_address: 0x10000000 handover_offset: 0x1015e70 params: 0x5251e000
loader/efi/linux.c:252:nx: Setting attributes for 0x10000000-0x5adefff to r-x
loader/efi/linux.c:252:nx: permissions for 0x10000000 are ---
loader/efi/linux.c:252:nx: Setting attributes for stack at 0xNumberA-0xNumberB to rw-
loader/efi/linux.c:252:nx: permissions for 0xNumberA are ---

*NumberA and NumberB is just me simplifying the numbers because I'm copying them manually.

I hope this helps

Comment 11 Justin M. Forbes 2025-05-13 17:20:37 UTC
I suppose the real task is to figure out why OpenSUSE kernel works and ours does not.

Comment 12 Christopher Patrick 2025-05-13 20:28:13 UTC
(In reply to Justin M. Forbes from comment #11)
> I suppose the real task is to figure out why OpenSUSE kernel works and ours
> does not.

I couldn't get openSUSE TW to boot a few weeks ago. Might need to figure out why Ubuntu will boot.

Comment 13 tts26 2025-05-13 21:06:33 UTC
Dell's release notes for BIOS update 1.10.0 (available at https://www.dell.com/support/kbdoc/en-us/000270384/dsa-2025-044) indicate that this update addresses security vulnerability DSA-2025-044. This vulnerability is further detailed in the Tianocore EDK2 security advisory: https://github.com/tianocore/edk2/security/advisories/GHSA-xpcr-7hjq-m6qm.

It appears that the fix implemented in BIOS version 1.10.0 (for Alienware M16 R2), while addressing the security vulnerability, has introduced a regression that prevents the operating system kernel from booting.

Comment 14 tts26 2025-05-13 21:08:34 UTC
above fix for https://www.dell.com/support/kbdoc/en-us/000270384/dsa-2025-044 was also pushed to XPS 16 9640 with a BIOS update.

Comment 15 kasunt 2025-05-15 21:56:24 UTC
+1 on this issue

OpenSUSE stock kernel worked for me.

where i got stuck next was when trying to build the nvidia drivers for that kernel.

Comment 16 kasunt 2025-05-15 21:58:01 UTC
Forgot to say previously that I faced this issue on 9640

Comment 17 Justin M. Forbes 2025-05-17 00:31:20 UTC
Not a proper fix, but does someone want to try https://koji.fedoraproject.org/koji/taskinfo?taskID=132842118 and let me know if it works.  This is the SUSE config on our kernel source, so that at least tells me if it is a config option or a patch that they carry.

Comment 18 Christopher Patrick 2025-05-17 03:18:56 UTC
I would, but I cannot install Fedora at all. How would I add the sprc rpm to the default 42 iso?

Comment 19 tts26 2025-05-17 03:29:09 UTC
You can unpack the ISO, replace its kernel with the given build, and pack it back up. I have done a similar thing to try "bootconfig" logging, but I have had no luck. It hangs even before it's initialized.

Comment 20 kasunt 2025-05-17 03:29:30 UTC
If you have a 41 build I can certainly test it.

Unfortunately, I do have a MDM that's not supported on 42.

Comment 21 Tom "spot" Callaway 2025-05-17 12:03:28 UTC
(In reply to Justin M. Forbes from comment #17)
> Not a proper fix, but does someone want to try
> https://koji.fedoraproject.org/koji/taskinfo?taskID=132842118 and let me
> know if it works.  This is the SUSE config on our kernel source, so that at
> least tells me if it is a config option or a patch that they carry.

Installing, then rebooting into that kernel does change the behavior, but does not result in a booting system.

Upon booting, this time plymouth starts, but it hangs almost immediately as the "loading" circle spins forever. Rebooting with "rhgb quiet" removed gives us more output, showing systemd starting in the initrd, but the process stops, reporting "tpm tpm0: auth session is active", and freezing there. I repeated this boot several times with the same outcome. Rebooting back into the OpenSUSE kernel got me back to a working Fedora userspace.

I will attach a photo of the output.

Comment 22 Tom "spot" Callaway 2025-05-17 12:04:34 UTC
Created attachment 2090199 [details]
photo of attempt to boot jflory's kernel

Comment 23 Tom "spot" Callaway 2025-05-17 12:05:25 UTC
(In reply to Tom "spot" Callaway from comment #22)
> Created attachment 2090199 [details]
> photo of attempt to boot jflory's kernel

Whoops. Should be "jforbes". I'm awake, I promise.

Comment 24 Justin M. Forbes 2025-05-17 15:15:56 UTC
Interesting, so it isn't just their config.  Thanks for letting me know. I will dig a bit more and see what I can find.

Comment 25 Christopher Patrick 2025-05-18 05:46:39 UTC
Interesting, but I tried Nobara 42 Nvidia gnome version and it booted. Link to the main site and a site with their kernel modifications,.  https://nobaraproject.org/ https://wiki.nobaraproject.org/modifications/kernel

Comment 26 Christopher Patrick 2025-05-18 06:47:24 UTC
Update to my previous comment, I installed it but it didn't have a boot entry so I added it and it still didn't boot. It does boot into the live environment.

Comment 27 Justin M. Forbes 2025-05-19 19:35:09 UTC
https://koji.fedoraproject.org/koji/taskinfo?taskID=132972310 Can someone give this one a try and LMK.

Comment 28 Peter Williams 2025-05-20 23:17:42 UTC
(In reply to Justin M. Forbes from comment #27)
> https://koji.fedoraproject.org/koji/taskinfo?taskID=132972310 Can someone
> give this one a try and LMK.

I'm *pretty* sure that this one gave me the exact same crash as I reported above. Because my problematic laptop doesn't actually have Linux installed on it yet, I needed to build a custom installer ISO using that kernel to try to test, but I believe that I managed to do that correctly, based on the kernel version number reported during the boot process.

Comment 29 Peter Williams 2025-05-21 01:25:22 UTC
OK, here's a more detailed report after doing a bit more work.

First thing, it appears that the kernel arguments I was using earlier were too aggressive. I'm now booting my installer live images with the following arguments:

```
rd.live.image earlyprintk=efi earlycon=efifb console=
```

(That is, **not** using `acpi=off nosmp nowatchdog`, and no `rhgb quiet` either.) This allows my boots to progress farther than I was seeing before.

I get the best results from the first custom kernel build, "task 132842118" (6.14.6-300). If I create a custom installer ISO and then boot it on my XPS 9640 with the Dell 1.13.0 firmware and the shorter list of boot args, my system boots up to the point that systemd starts up. Eventually the boot fails with this error:

```
FATAL: iscsiroot requested but kernel/initrd does not support iscsi
```

This seems to plausibly be an issue with how I'm constructing my custom installer image, or maybe the OpenSUSE kernel config really is turning off iscsi? Importantly, it's not a kernel panic! So that's something.

If I boot the second custom kernel build, "task 132972310" (6.14.7-300) with the shorter list of boot args, I do get a kernel panic during boot. It appears to be the same one as reported in the email to the kernel mailing list that I linked before: https://lore.kernel.org/lkml/Z-aD1ughy6fd8Ask@archimedes.dunstkreis.ch/T/#u . I now think that the different panic I was seeing before was a result of bad boot args and should be ignored.

So my results are now somewhat consistent with what @spot reported, in that the "task 132842118" kernel gets farther and doesn't obviously suffer a panic on boot. The boot still fails for both of us but for possibly different reasons.

Comment 30 Peter Williams 2025-05-21 03:09:33 UTC
A final note for now. Based on the discussion in https://bugzilla.redhat.com/show_bug.cgi?id=2041094 , it seems at least possible that my boot failure with the "task 132842118" kernel is because the OpenSUSE kernel configuration includes:

```
CONFIG_LOCALVERSION="-default"
```

It appears that this kernel causes `uname -r` to look like "6.14.6-300.fc42.x86_64-default", which makes the `kmod-static-nodes.service` systemd unit (and possible other parts of the system?) want to search for files in `/lib/modules/6.14.6-300.fc42.x86_64-default`. But the built RPMs provided by jforbes place files in `/lib/modules/6.14.6-300.fc42.x86_64`, without the `-default` suffix. The iscsi error might just be the first symptom of a general issue locating kernel modules, possibly due to this.

Comment 31 Justin M. Forbes 2025-05-22 13:21:18 UTC
(In reply to Peter Williams from comment #30)
> A final note for now. Based on the discussion in
> https://bugzilla.redhat.com/show_bug.cgi?id=2041094 , it seems at least
> possible that my boot failure with the "task 132842118" kernel is because
> the OpenSUSE kernel configuration includes:
> 
> ```
> CONFIG_LOCALVERSION="-default"
> ```
> 
> It appears that this kernel causes `uname -r` to look like
> "6.14.6-300.fc42.x86_64-default", which makes the
> `kmod-static-nodes.service` systemd unit (and possible other parts of the
> system?) want to search for files in
> `/lib/modules/6.14.6-300.fc42.x86_64-default`. But the built RPMs provided
> by jforbes place files in `/lib/modules/6.14.6-300.fc42.x86_64`, without the
> `-default` suffix. The iscsi error might just be the first symptom of a
> general issue locating kernel modules, possibly due to this.

Thanks for the catch there, that was very helpful.  I have done another build with the suse config and LOCALVERSION set correctly.  So this may still come down to a config only change, which is good because I did not see any patch that should be relevant.

Please try https://koji.fedoraproject.org/koji/taskinfo?taskID=133058701

Comment 32 Peter Williams 2025-05-22 15:46:28 UTC
With the "task 133058701" kernel (6.14.7-301), I get much farther in the boot process, but still eventually get to an oops. Systemd gets to the point of trying to start up targets like `plymouth-start.service`, `cryptsetup.target`, and `systemd-battery-check.service`, but at some point the logs show:

```
hub 2-0:1:0: USB hub found
hub 2-0:1:0: 4 ports detected
BUG: kernel NULL pointer dereference, address: 000000000000000a
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[...]
RIP: 0010:acpi_ds_exec_end_control_op+0x69/0x3f0
[...]
 <TASK>
 acpi_ds_exec_end_op+0x4bd/0x890
```

So this once again seems to be the same crash, just occurring much later :-(

If I boot with the `rhgb quiet` args, I see the boot screen spinner going, but pressing Escape to look at the console messages reveals the same oops, and "BUG: workqueue lockup" messages. So the system hasn't ground completely to a halt but it's not healthy.

Comment 33 Gee Rr 2025-05-23 14:44:56 UTC
Same issue noticed with Centos stream 10 as well. https://issues.redhat.com/browse/RHEL-93391

Comment 34 Peter Williams 2025-05-23 18:52:34 UTC
Aha! I tried a few more times with the "task 133058701" kernel, and sometimes it does work! But not every time. Tentatively, it seems like it usually works when the laptop is plugged into AC power, and it's never worked if it isn't.

Comment 35 Hans de Goede 2025-05-23 22:12:54 UTC
(In reply to Peter Williams from comment #34)
> Aha! I tried a few more times with the "task 133058701" kernel, and
> sometimes it does work! But not every time. Tentatively, it seems like it
> usually works when the laptop is plugged into AC power, and it's never
> worked if it isn't.

Hmm, I wonder if this has something to do with the new dell-wmi-ddv battery extension driver. Can you try adding: "modprobe.blacklist=dell-wmi-ddv" to the kernel commandline ?

Comment 36 Peter Williams 2025-05-24 16:56:40 UTC
We might be getting somewhere! With the laptop *not* plugged into AC, I was able to boot four times in a row successfully with the following kernel args:

```
ro selinux=0 nomodeset modprobe.blacklist=dell-wmi-ddv
```

(The `ro selinux=0` are Fedora's defaults on my machine, and the `nomodeset` seems to avoid occasional graphics hangs that I deeply hope have nothing to do with the current bug.) In the course of other experiments I *never* got the oops with the blacklist enabled.

With everything identical except for the blacklist argument, I got the oops about 50% of the time in a sample of around 6 boots. 

This is all still running the "task 133058701" kernel that has the OpenSUSE configuration.

Comment 37 Peter Williams 2025-05-24 17:38:12 UTC
FWIW, if I attempt to boot a stock Fedora kernel (6.14.6-300) with the blacklist argument, I still get the oops early in the boot process (2 out of 2 attempts, AC not plugged in).

This is extremely handwavey, but I note that stock Fedora has CONFIG_ACPI_AC=y and CONFIG_ACPI_BATTERY=y, while OpenSUSE has them set to "m". Possibly relevant?? I may try building my own kernel packages to check. I don't see any relevant-looking differences among the CONFIG_DELL_* settings.

Comment 38 Hans de Goede 2025-05-25 14:05:39 UTC
Hmm, I forgot that the dell-laptop driver also includes a battery-extension part.

For those where only blacklisting the dell-wmi-ddv driver does not help, can you try adding"

modprobe.blacklist=dell-wmi-ddv,dell-laptop

to your kernel commandline ?

Comment 39 Peter Williams 2025-05-25 15:25:53 UTC
Adding dell-laptop yields no change on the stock Fedora kernels.

I've started playing around with building my own kernels and trying to see if I can identify specific configuration options that affect whether the boot succeeds. Unfortunately, no luck as of yet. I tried copying over all of the OpenSUSE ACPI-related setting changes into `kernel-local` and I still got the same results as the stock kernel. A few other tests with relevant-sounding option names also haven't yielded any leads.

Now that I can actually boot my laptop, is there some kind of ACPI bytecode decompiler/debugger that I can use to try to identify the broken code that is actually leading to the crash? It would be great to be able to identify the root cause here. I've tried inserting dell-wmi-ddv after a successful boot but doing so doesn't instantly lead to a panic or anything. I've also tried booting with super low level ACPI debugging flags turned on, but the spew slows things down so much that the system can't boot correctly because it thinks everything is timing out.

Comment 40 Christopher Patrick 2025-05-25 17:03:43 UTC
Could someone please upload an ISO with the working kernel so more people can test it?

Comment 41 Gee Rr 2025-05-25 21:26:29 UTC
Is this something that happens exclusively with Intel processors, or does it also happen with AMD?

Comment 42 Peter Williams 2025-05-25 21:56:49 UTC
Here's a live CD / installer ISO that I made with the "task 133058701" kernel:

https://drive.google.com/file/d/1kjAWgKlXhKqWKIi1X6jXRCwaIXqJ7Rf8/view?usp=sharing

File size 2,388,236,288 bytes, SHA256 digest 5a1a226518548d419a1a74feb6185fcf10539f515d89bbae5dc5d92c0478f7ea. On my XPS 9640 with the 1.13.0 BIOS, I have consistent success booting when I use the "e" (edit) key to edit the kernel boot parameters to end with:

```
rd.live.image nomodeset modprobe.blacklist=dell-wmi-ddv
```

(I followed these general instructions: https://fedoraproject.org/wiki/Livemedia-creator-_How_to_create_and_use_a_Live_CD, with a customized kickstart file and file:/// repo to provide and install the custom kernel RPMs.)

Comment 43 tts26 2025-05-26 00:10:09 UTC
In given ISO, CONFIG_DELL_WMI_DDV is set to m. in case of Ubuntu, it isn't. from live env ran:
rm -f .../dell-wmi-ddv.ko.xz
sudo depmod -a
boots without any kernel params.

Comment 44 tts26 2025-05-26 00:19:01 UTC
I think I have a rough idea, After the BIOS update (to fix CVE-2024-38796), Dell may have changed the ACPI tables. 

current behavior:

dell-wmi-ddv loads
runs wmi_has_guid() + acpi_evaluate_object() on some Dell WMI GUID
the ACPI method tries to dereference something that no longer exists or is misaligned
kernel dies with:
BUG: kernel NULL pointer dereference
RIP: acpi_ds_exec_end_control_op+0x69/0x3f0

further,
Older BIOS: the ACPI method wasn’t buggy yet.
Ubuntu: it never loaded dell-wmi-ddv.
Fedora: oops after buggy ACPI method is introduced.

I think this explains it.

Comment 45 Christopher Patrick 2025-05-26 01:39:03 UTC
If this solves it, could someone from Fedora let Dell know so they could fix it?

Comment 46 tts26 2025-05-26 01:41:47 UTC
and drop the updates to here, if possible. thanks.

Comment 47 Christopher Patrick 2025-05-26 03:54:47 UTC
(In reply to Peter Williams from comment #42)
> Here's a live CD / installer ISO that I made with the "task 133058701"
> kernel:
> 
> https://drive.google.com/file/d/1kjAWgKlXhKqWKIi1X6jXRCwaIXqJ7Rf8/
> view?usp=sharing
> 
> File size 2,388,236,288 bytes, SHA256 digest
> 5a1a226518548d419a1a74feb6185fcf10539f515d89bbae5dc5d92c0478f7ea. On my XPS
> 9640 with the 1.13.0 BIOS, I have consistent success booting when I use the
> "e" (edit) key to edit the kernel boot parameters to end with:
> 
> ```
> rd.live.image nomodeset modprobe.blacklist=dell-wmi-ddv
> ```
> 
> (I followed these general instructions:
> https://fedoraproject.org/wiki/Livemedia-creator-
> _How_to_create_and_use_a_Live_CD, with a customized kickstart file and
> file:/// repo to provide and install the custom kernel RPMs.)

The ISO boots with any changes to the kernel boot parameters. I tried to install it and it succeeded, but grub only shows Windows and UEFI Firmware Settings as options to boot to.

Comment 48 Christopher Patrick 2025-05-26 03:56:48 UTC
Created attachment 2091572 [details]
Error message with custom ISO

When I boot with the ISO provided in the Google Drive link it shows the following error message 4hen boots.

Comment 49 Peter Williams 2025-05-26 19:04:20 UTC
Lots of people get emails whenever this bug is updated, so I propose that any discussion of the workaround ISO happen in this forum thread:

https://discussion.fedoraproject.org/t/cannot-boot-into-installation-media-or-installed-fedora-system-on-dell-xps-16-9640-after-bios-update/148548/10

I've reposted the ISO info there and can try to help people out with it without adding noise to this thread.

Comment 50 Peter Williams 2025-05-31 19:56:41 UTC
Created attachment 2092408 [details]
"acpidump" output after the Dell XPS 9640 1.13.0 BIOS update

I've been undertaking some printf (well, printk) debugging of the crash. It seems to be happening when the ACPI module attempts to evaluate the method `\_SB.PC00.XHCI.RHUB.HS01._PLD` during early scanning of the ACPI namespace.

Comparing kernels that panic early (stock Fedora) and ones that can boot (OpenSUSE), both of them appear to do the same ACPI scan and call this method. But for whatever reason, on stock Fedora kernels it leads to the null pointer dereference, while on the bootable kernels it doesn't.

The method is supposed to return information about the Physical Location of Device (PLD) of the ACPI device `\_SB.PC00.XHCI.RHUB.HS01`. On my system, that device has PCI address 0000:00:14.0 (from the contents of /sys/bus/acpi/devices/device:13/path), which in turn corresponds to:

```
0000:00:14.0 USB controller: Intel Corporation Meteor Lake-P USB 3.2 Gen 2x1 xHCI Host Controller (rev 20)
```

However, I don't see anything indicative of why the error might be associated with this particular device. Its _PLD method invokes another method called `\_SB.UBTC.RUCC` which in turn invokes `\_SB.UBTC.TPLD` (and `\_SB.UBTC.FPMN`).

I'm attaching the output of `acpidump` here which can be used to explore the detailed ACPI tables. As best I can tell the above methods are doing fairly vanilla things, so it remains completely unclear to me why the different kernel configuration parameters would lead to such different outcomes.

Comment 51 Hans de Goede 2025-06-01 13:27:30 UTC
Peter, thank you for diving into this. I should get a loaner XPS 9640 from someone I know locally coming Tuesday and then I plan to dive into this. Your observation will hopefully make a good starting point ...

Comment 52 Hans de Goede 2025-06-04 13:46:49 UTC
I have access to a Dell XPS 9640 myself now and I've been experimenting with this and I'm able to reproduce the issue.

The XPS initially had BIOS 1.11.1 and all was fine, I did an acpidump there as well as of 1.12.0 after upgrading, this shows various changes related to the _PLD and _UPC method on the USB and Thunderbolt controller Type-C ports.

And disabling both the _PLD and _UPC calls like this:

diff --git a/drivers/acpi/utils.c b/drivers/acpi/utils.c
index 526563a0d188..4b1c7d2ca0a1 100644
--- a/drivers/acpi/utils.c
+++ b/drivers/acpi/utils.c
@@ -501,6 +501,8 @@ acpi_get_physical_device_location(acpi_handle handle, struct acpi_pld_info **pld
 	struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
 	union acpi_object *output;
 
+	return false;
+
 	status = acpi_evaluate_object(handle, "_PLD", NULL, &buffer);
 	if (ACPI_FAILURE(status))
 		return false;
diff --git a/drivers/usb/core/usb-acpi.c b/drivers/usb/core/usb-acpi.c
index 935c0efea0b6..9dd02c78a2b5 100644
--- a/drivers/usb/core/usb-acpi.c
+++ b/drivers/usb/core/usb-acpi.c
@@ -217,6 +217,9 @@ usb_acpi_get_connect_type(struct usb_port *port_dev, acpi_handle *handle)
 		port_dev->location = USB_ACPI_LOCATION_VALID |
 			pld->group_token << 8 | pld->group_position;
 
+	port_dev->connect_type = USB_PORT_CONNECT_TYPE_UNKNOWN;
+	return;
+
 	status = acpi_evaluate_object(handle, "_UPC", NULL, &buffer);
 	if (ACPI_FAILURE(status))
 		goto out;

Results in a working kernel.

It seems that the problematic code gets triggered when the XHCI driver loads, which explains why the OpenSuse kernel config works some of the time. The OpenSuse kernel config has the XHCI driver as a module. Booting with the OpenSuse config can likely be made reliable by blacklisting the xhci-pci module there.

Note the above is not a proper fix, I'll be investigating this further as time permits.

Comment 53 Hans de Goede 2025-06-05 16:08:43 UTC
Another way of fixing things is using ACPI table overrides in the initrd from: https://www.kernel.org/doc/html/latest/admin-guide/acpi/initrd_table_override.html

To bring the SSDT15 and SSDT17 tables partly back to their previous state of before the 1.12.0 BIOS update. I'll attach a diff for those interested. Using a extra (pre-pended) initrd with these ACPI table overrides the original 6.14.0 kernel from the F42 workstation livecd works fine.

As you can see in the patch for the ssdt15.dsl / ssdt17.dsl file the only change being undone is an extra parameter being passed to the \_SB.UBTC.RUCC() method. The extra parameter comes from a bunch of new fields in the GNVS region but that is not the problem, if I replace the fields with just a direct "Zero" constant the kernel still crashes.

At first I was thinking this might be a kernel stack-overflow but I changed the kernel stack size from 16kb to 32kb and that did not help.

To be continued ...

Comment 54 Hans de Goede 2025-06-05 16:10:06 UTC
Created attachment 2093123 [details]
XPS 9640 BIOS version 1.12.0 SSDT changes which workaround Fedora kernels not booting

Comment 55 tts26 2025-06-05 16:14:16 UTC
have we notified Dell about this?

Comment 56 Peter Williams 2025-06-05 17:56:48 UTC
I'm glad that you were able to extract the ACPI data from the 1.11.0 BIOS and check out the diff!

More and more I'm feeling like the dell-wmi-ddv stuff was a red herring. I don't feel like I'm seeing any consistent behavior changes when I blacklist it or not, after many more boot attempts to experiment with different kernel tweaks.

Here's a patch I've been using to turn on detailed ACPI debugging inside the method that's most often associated with the kernel panics:

```
diff --git a/drivers/acpi/acpica/nseval.c b/drivers/acpi/acpica/nseval.c
index 63748ac699f7..696385133d98 100644
--- a/drivers/acpi/acpica/nseval.c
+++ b/drivers/acpi/acpica/nseval.c
@@ -105,6 +105,12 @@ acpi_status acpi_ns_evaluate(struct acpi_evaluate_info *info)
 			      &info->full_pathname[1],
 			      acpi_ut_get_type_name(info->node->type)));
 
+	if (strcmp(info->full_pathname, "\\_SB.PC00.XHCI.RHUB.HS01._PLD") == 0) {
+		printk("PKGW ACPI ******* nseval start debug!!!\n\n\n\n");
+		acpi_dbg_layer = ACPI_PARSER;
+		acpi_dbg_level = 0xFFFFFFFF;
+	}
+
 	/* Count the number of arguments being passed in */
 
 	info->param_count = 0;
@@ -290,6 +296,13 @@ acpi_status acpi_ns_evaluate(struct acpi_evaluate_info *info)
 			  info->relative_pathname));
 
 cleanup:
+
+	if (strcmp(info->full_pathname, "\\_SB.PC00.XHCI.RHUB.HS01._PLD") == 0) {
+		printk("\n\n\n\n\n\n\nPKGW ACPI ******* nseval end debug!!!\n\n\n");
+		acpi_dbg_layer = 0;
+		acpi_dbg_level = 0;
+	}
+
 	/* Optional object evaluation log */
 
 	ACPI_DEBUG_PRINT_RAW((ACPI_DB_EVALUATION,
```

The resulting logging output shows that the fatal error is only occurring at the very end of the process of evaluating the _PLD function, right as the final return value is being returned up the call chain.

Beyond the ACPI logging output, I also often get the following report as the method evaluation is wrapping up:

```
BUG: KFENCE: use-after-free read in acpi_ps_parse_loop+0xb9/0x700
```

(I guess turning on the debug flags turns on some kind of extra memory allocation debugging layer?) There's a bunch of additional KFENCE output reporting the stack trace and where the memory in question was originally allocated and freed, but I believe that the details are not actually super relevant because to my eye they're indicating that an invalid pointer is being dereferenced and that the apparent use-after-free is downstream of that. Also, the specific details will sometimes vary from one boot to the next. (But I can type up the info if desired.)

A final suspicious thing that I see in the debug output is the following right at the start of the tracing, as the _PLD method transfers control into "RUCC":

```
pswalk-0030 ps_delete_parse_tree :  root <pointer>
  -MethodCall- <pointer>
    -NamePath- <pointer> \/♥_    # <=== yes, a heart symbol!
    One <pointer>
    ByteConst <point>
```

The heart symbol in the -NamePath- output is incredibly suspicious to me. It makes me think that *maybe* some kind of memory corruption happens right at the beginning of the evaluation of the _PLD ACPI method, but that it only results in the crash as the ACPI call stack unwinds as the method evaluation completes.

I can think of two huge caveats to all of this analysis though:

1) Naively, the stuff that I'm seeing suggests that there's some kind of memory management bug in the kernel ACPI code. But while that code is awfully spaghetti-y, it feels incredibly unlikely to me that this particular Dell BIOS update exposes a bug that has simply never been seen before.

2) Even if it did, I still don't understand how some kernel configurations might surface the bug, while others don't.

On that note, I've tried building test kernels using the OpenSUSE XHCI options in `kernel-local`:

```
CONFIG_USB_ROLES_INTEL_XHCI=m
CONFIG_USB_XHCI_HCD=m
CONFIG_USB_XHCI_PCI=m
CONFIG_USB_XHCI_PCI_RENESAS=m
CONFIG_USB_XHCI_PLATFORM=m
```

Unfortunately they still crash ~immediately on boot. I suspect that the overall diagnosis is correct, and it's just that some other setting is causing the XCHI module to be loaded very early in the boot process; but I haven't been able to find a combination of settings tweaks that avoids the issue (as opposed to adopting the OpenSUSE config wholesale).

Comment 57 Hans de Goede 2025-06-06 11:03:49 UTC
Peter,

thank you for your continued digging into this. Yes I agree that this seems to be some sort of heap or stack (I suspect stack...) corruption. May I ask how you are getting all this debug info?

For me when the systems hangs I do not get any output on the screen at all. And since this happens when bringing up USB using a USB to uart dongle for a serial console also seems like it will not be helpful ?

Comment 58 Hans de Goede 2025-06-06 12:07:04 UTC
I agree that the dell-wmi-ddv thing is a red-herring even with that blacklisted I still see the occasional crash when the opensuse kernel loads the xhci-pci module and the backtrace starts with code calling _PLD / _UPC and ends deep inside the ACPICA code.

Comment 59 Peter Williams 2025-06-06 13:51:28 UTC
I've been using the patch above and the following boot options:

```
earlyprintk=efi earlycon=efifb console= no-hash-pointers
```

(And no rhgb or quiet) With those, I get lots of output to the laptop screen with no need for any serial console tomfoolery. The patch adds *very* detailed ACPI tracing for the execution of the _PLD method - the EFI console prints are sufficiently slow that it usually takes about 5 minutes of wall-clock time just to fully evaluate that one method! The main downside is that if I want to preserve the output I need to record it on my phone and type it up manually.

Comment 60 Peter Williams 2025-06-06 16:58:57 UTC
Hans:

(1) Forgot to mention that I use `nomodeset` above too

(2) Potential lead!? In my ACPI disassembly, I see that `ssdt21.dsl` lists the RUCC method as taking two arguments and calls it as such, whereas ssdt15.dsl and its definition in ssdt17.dsl both call for three arguments. In my tracing output (with ACPI_DISPATCHER added to acpi_dbg_layer) I think I see some output suggesting that some parts of the code are expecting two args, while others are expecting three. This feels like exactly the kind of thing that could lead to memory errors, and coincides nicely with your mention that the new firmware adds an extra argument to RUCC. Do you see the same thing?

I haven't done the initrd table override thing yet -- if it's convenient, maybe see whether updating ssdt21 to use three args *also* makes the problem go away?

Comment 61 Peter Williams 2025-06-06 23:02:58 UTC
I haven't tested this extensively yet, but at first blush it looks like the following patch gets me booting!

```
diff --git a/drivers/acpi/acpica/dsmethod.c b/drivers/acpi/acpica/dsmethod.c
index e809c2aed78a..a81de3265472 100644
--- a/drivers/acpi/acpica/dsmethod.c
+++ b/drivers/acpi/acpica/dsmethod.c
@@ -509,6 +509,17 @@ acpi_ds_call_control_method(struct acpi_thread_state *thread,
 	 */
 	this_walk_state->operands[this_walk_state->num_operands] = NULL;
 
+	if (this_walk_state->num_operands != obj_desc->method.param_count) {
+		printk(
+			"PKGW ACPI method exec `%4.4s` no=%d pc=%d\n",
+			method_node->name.ascii,
+			this_walk_state->num_operands,
+			obj_desc->method.param_count
+		);
+		status = AE_AML_UNINITIALIZED_ARG;
+		goto pop_walk_state;
+	}
+
 	/*
 	 * Allocate and initialize the evaluation information block
 	 * TBD: this is somewhat inefficient, should change interface to
@@ -539,7 +550,7 @@ acpi_ds_call_control_method(struct acpi_thread_state *thread,
 	 * Delete the operands on the previous walkstate operand stack
 	 * (they were copied to new objects)
 	 */
-	for (i = 0; i < obj_desc->method.param_count; i++) {
+	for (i = 0; i < this_walk_state->num_operands; i++) {
 		acpi_ut_remove_reference(this_walk_state->operands[i]);
 		this_walk_state->operands[i] = NULL;
 	}
```

Is this a correct fix? I don't have the expertise to say ...

As for the origin of the config-dependence: I'm an ACPI amateur but I *think* that the behavior of the relevent AML code varies depends on various settings that are influenced by things that go on outside of the ACPI stack (the "GNVS region" referenced by Hans above?). I suspect that different kernel configurations manage to change the order in which various USB and ACPI bits get initialized, in a way that changes the AML behavior in a way that avoids the crash.

As for the black heart icon, I believe that `acpi_ps_delete_parse_tree()` has a bug in its debugging output, surfaced by:

```
diff --git a/drivers/acpi/acpica/pswalk.c b/drivers/acpi/acpica/pswalk.c
index d92817c72b8d..f0b790c728c7 100644
--- a/drivers/acpi/acpica/pswalk.c
+++ b/drivers/acpi/acpica/pswalk.c
@@ -57,8 +57,15 @@ void acpi_ps_delete_parse_tree(union acpi_parse_object *subtree_root)
                                               op);
 
                                if (op->named.aml_opcode == AML_INT_NAMEPATH_OP) {
-                                       acpi_os_printf("  %4.4s",
-                                                      op->common.value.string);
+                                       // no means ought to call acpi_ex_get_name_string?
+                                       char *pkgwtmp = "bad-output";
+
+                                       if (op->common.flags & ACPI_PARSEOP_IN_STACK) {
+                                               pkgwtmp = "ok-output";
+                                       }
+
+                                       acpi_os_printf("  %4.4s PKGW:%s",
+                                                      op->common.value.string, pkgwtmp);
                                }
                                if (op->named.aml_opcode == AML_STRING_OP) {
                                        acpi_os_printf("  %s",

```

... but I don't think this affects anything beyond the debug output.

Comment 62 Hans de Goede 2025-06-07 15:57:34 UTC
Peter,

Great detective work there. I think that you're on to something. I missed the RUCC() call with only 2 arguments in ssdt21 because for some reason the first time I ran "iasl -d ssdt21.dat" I got a 0 bytes ssdt21.dsl file so my "grep RUCC *.dsl" did not find it.

I think it is time to file a bug report with your findings and your proposed patch with the upstream ACPICA project and see what they have to say:

https://github.com/acpica/acpica/issues

Comment 63 Peter Williams 2025-06-07 16:56:27 UTC
Filed as https://github.com/acpica/acpica/issues/1027

It's not clear to me how fixes in the ACPICA upstream are propagated into the actual kernel tree, so maybe a separate bug ought to be filed on bugzilla.kernel.org, and/or reported to the linux-acpi list?

Comment 64 kasunt 2025-06-09 09:36:37 UTC
Ive been following this thread closely but its a bit confusing why Opensuse kernel isnt affected by this issue and working consistently ?

If I'm understanding correctly the code path Peter has found only gets triggered by the 12/13 bios changes. Is that right ?

Comment 65 Hans de Goede 2025-06-10 12:51:01 UTC
(In reply to kasunt from comment #64)
> Ive been following this thread closely but its a bit confusing why Opensuse
> kernel isnt affected by this issue and working consistently ?

It is some kind of memory corruption, either some overflow or a use after free or double-free or something like that. These bugs actually causing an issue is somewhat of a position of the moon thing. Different compiler version / different options can cause these to trigger or not trigger. IOW the openSuse kernel's like have the same root issue bit the memory corruption happens to hit a less important piece of memory there...

> If I'm understanding correctly the code path Peter has found only gets
> triggered by the 12/13 bios changes. Is that right ?

That is correct.

Comment 66 Tom "spot" Callaway 2025-06-10 17:20:28 UTC
The first patch in Comment 61 works well for me, I'm finally booting into a (patched) Fedora kernel again for the first time since _MARCH_. :)

Comment 67 Christopher Patrick 2025-06-10 17:36:51 UTC
Do you know when an ISO will be available with this patch so everyone can use Fedora again. Also has Dell been notified of the problem and solution?

Comment 68 Gee Rr 2025-06-10 19:18:31 UTC
(In reply to Christopher Patrick from comment #67)
> Also has Dell been notified of the problem and solution?

The Dell tech support escalation team mentioned that they can fix only if the BIOS update affects either Windows or Ubuntu (for select few desktops/laptops). Both Windows and Ubuntu boots fine, and so the Dell tech support team won't help.

Comment 69 Peter Williams 2025-06-11 02:26:21 UTC
I've updated the discussions thread with a link to a new ISO built with a "pkfix1" kernel that contains my patch above. It's been working for me reliably:

https://discussion.fedoraproject.org/t/cannot-boot-into-installation-media-or-installed-fedora-system-on-dell-xps-16-9640-after-bios-update/148548/11

Hopefully the Red Hat folks have some kind of backchannel to the right Dell engineers to flag all of this to them. I would be truly stunned if standard Dell support was any help, alas.

Comment 70 kasunt 2025-06-11 02:45:17 UTC
Credit to you for the deep analysis on this Peter. Thank you.

Comment 71 tts26 2025-06-11 03:18:17 UTC
Thank you Peter, I appreciate the work. Dell should know better.

Comment 72 Hans de Goede 2025-06-11 18:40:29 UTC
Status update:

* Fixing the bug in SSDT21 where it calls RUCC() with 2 arguments instead of the expected 3 using an initrd ACPI table override fixes things, no surprise there.
* But Linux should not crash on such a trivial ACPI table bug, instead it should just error out of the current evaluate() call, so I've been digging deeper into this. But I'm not familiar enough with the ACPICA code to really get anywhere.
* In the end I've build a kernel with kasan memory access checking and that has found a use-after-free (or more likely a wrong / too early free) bug in the ACPICA code. See the github issue for a detailed backtrace.

For more details see:
https://github.com/acpica/acpica/issues/1027

Comment 73 Karl Hastings 2025-06-13 16:20:24 UTC
For the XPS 16 9640 I recieved the following from my Dell contacts:


[The issue] Related to SB.UBTC.RUCC was fixed in BIOS version 1.14.2.
They can try to use BIOS v1.14.2(already WPCO) to solve the SB.UBTC.RUCC issue.

Since XPS 16 9460 supports Windows only, the URL provided below directs to support.dell.com.

https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=kkk3x&oscode=w2021&productcode=xps-16-9640-laptop

Comment 74 tts26 2025-06-13 19:42:00 UTC
I can confirm that after the BIOS update, I can boot Void, Fedora, Kali and other Linux distributions without any changes to the latest ISO. Dell seems to have pushed out a BIOS update for devices affected, which also includes my Alienware laptop.

Comment 75 kasunt 2025-06-13 22:00:24 UTC
Can confirm on XPS I can boot to latest kernel F41 using the v1.14.2 bios update.

Comment 76 kasunt 2025-06-13 22:00:39 UTC
Can confirm on XPS I can boot to latest kernel F41 using the v1.14.2 bios update.

Comment 77 Christopher Patrick 2025-06-14 01:59:21 UTC
Alienware released a BIOS update for the m16r2 and it fixes the problem with the default Fedora 42 workstation ISO.

Comment 78 Hans de Goede 2025-06-16 09:26:12 UTC
Thank you for confirming that the latest Dell BIOS update fixes this.

I'm going to keep my loaner XPS 9640 at BIOS 1.12.0 and I'll continue looking into this, since even with the BIOS ACPI table bug Linux should simply error out on the ACPI method invocation and not crash.

I'm also going to keep this bug open for now until the Linux crash is resolved upstream.

Comment 79 Hans de Goede 2025-06-18 12:35:17 UTC
While working on fixing Linux so that it will properly boot even with the broken BIOS versions I noticed that not only the ACPI RUCC() method is being called with instead of the expected 3 arguments in the 1.12.0 BIOS, but there is a second problem in the SSDT21 BIOS ACPI table where it is calling TUPC() with not enough arguments also.

I would like to verify that the fixed BIOS fixed both cases, so that I can report the second problem to Dell if necessary.

Can someone who has upgraded their machine to the fixed BIOS please collect a dump of the ACPI tables with the new BIOS? :

sudo dnf install acpica-tools
sudo acpidump -o acpidump.txt

and then attach the generated acpidump.txt file here ?

(as mentioned I'm keeping the loaner XPS 9640 at BIOS 1.12.0 for now)

Comment 80 Hans de Goede 2025-06-18 12:44:15 UTC
A patch fixing this on the Linux side, making Linux work even with the broken 1.12.0 BIOS has been submitted upstream now and will be merged soon:

https://lore.kernel.org/linux-acpi/5909446.DvuYhMxLoT@rjwysocki.net/

closing this as being handled upstream.

Comment 81 Justin M. Forbes 2025-06-18 14:45:16 UTC
(In reply to Hans de Goede from comment #80)
> A patch fixing this on the Linux side, making Linux work even with the
> broken 1.12.0 BIOS has been submitted upstream now and will be merged soon:
> 
> https://lore.kernel.org/linux-acpi/5909446.DvuYhMxLoT@rjwysocki.net/
> 
> closing this as being handled upstream.

Thanks for that.  I have pulled that patch into 6.15 so it should go out with the next kernel build (6.15.3)

Comment 82 Hans de Goede 2025-06-19 08:58:21 UTC
(In reply to Justin M. Forbes from comment #81)
> Thanks for that.  I have pulled that patch into 6.15 so it should go out
> with the next kernel build (6.15.3)

Thank you for adding the patch to the Fedora kernels.


Note You need to log in before you can comment on or make changes to this bug.