Bug 2224839 - 6.4.4 nvidia driver 535.54.03 doesnt work after kernel update
Summary: 6.4.4 nvidia driver 535.54.03 doesnt work after kernel update
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 38
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-23 13:05 UTC by aligoldenhat
Modified: 2023-08-24 22:19 UTC (History)
18 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2023-08-24 22:19:13 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description aligoldenhat 2023-07-23 13:05:57 UTC
I am using Lenovo Legion 5 Pro 16ITH6 with rtx 3050 mobile nvidia gpu
Upgrading from 6.3.12 to 6.4.4 caused the nvidia driver (nvidia-powerd.service) failed in boot, and it still works fine when i having booted in 6.3.12

lspci output 6.3.12 kernel:
00:00.0 Host bridge: Intel Corporation 11th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller #1 (rev 05)
00:02.0 VGA compatible controller: Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01)
00:04.0 Signal processing controller: Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant (rev 05)
00:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller #0 (rev 05)
00:07.0 PCI bridge: Intel Corporation Tiger Lake-H Thunderbolt 4 PCI Express Root Port #0 (rev 05)
00:07.2 PCI bridge: Intel Corporation Tiger Lake-H Thunderbolt 4 PCI Express Root Port #2 (rev 05)
00:0a.0 Signal processing controller: Intel Corporation Tigerlake Telemetry Aggregator Driver (rev 01)
00:0d.0 USB controller: Intel Corporation Tiger Lake-H Thunderbolt 4 USB Controller (rev 05)
00:0d.2 USB controller: Intel Corporation Tiger Lake-H Thunderbolt 4 NHI #0 (rev 05)
00:0d.3 USB controller: Intel Corporation Tiger Lake-H Thunderbolt 4 NHI #1 (rev 05)
00:14.0 USB controller: Intel Corporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller (rev 11)
00:14.2 RAM memory: Intel Corporation Tiger Lake-H Shared SRAM (rev 11)
00:14.3 Network controller: Intel Corporation Tiger Lake PCH CNVi WiFi (rev 11)
00:15.0 Serial bus controller: Intel Corporation Tiger Lake-H Serial IO I2C Controller #0 (rev 11)
00:15.1 Serial bus controller: Intel Corporation Tiger Lake-H Serial IO I2C Controller #1 (rev 11)
00:15.2 Serial bus controller: Intel Corporation Device 43ea (rev 11)
00:16.0 Communication controller: Intel Corporation Tiger Lake-H Management Engine Interface (rev 11)
00:17.0 SATA controller: Intel Corporation Tiger Lake SATA AHCI Controller (rev 11)
00:1d.0 PCI bridge: Intel Corporation Tiger Lake-H PCI Express Root Port #9 (rev 11)
00:1d.6 PCI bridge: Intel Corporation Device 43b6 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Tiger Lake-H LPC/eSPI Controller (rev 11)
00:1f.3 Audio device: Intel Corporation Tiger Lake-H HD Audio Controller (rev 11)
00:1f.4 SMBus: Intel Corporation Tiger Lake-H SMBus Controller (rev 11)
00:1f.5 Serial bus controller: Intel Corporation Tiger Lake-H SPI Controller (rev 11)
01:00.0 VGA compatible controller: NVIDIA Corporation GA107BM [GeForce RTX 3050 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 2291 (rev a1)
02:00.0 Non-Volatile memory controller: SK hynix Gold P31/BC711/PC711 NVMe Solid State Drive
5c:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)




Reproducible: Always

Steps to Reproduce:
1. upgrade to 6.4.4
2. reboot

Comment 1 aligoldenhat 2023-07-23 13:10:41 UTC
Here is the lspci output, having booted into 6.4.4:

00:00.0 Host bridge: Intel Corporation 11th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller #1 (rev 05)
00:02.0 VGA compatible controller: Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01)
00:04.0 Signal processing controller: Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant (rev 05)
00:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller #0 (rev 05)
00:07.0 PCI bridge: Intel Corporation Tiger Lake-H Thunderbolt 4 PCI Express Root Port #0 (rev 05)
00:07.2 PCI bridge: Intel Corporation Tiger Lake-H Thunderbolt 4 PCI Express Root Port #2 (rev 05)
00:0a.0 Signal processing controller: Intel Corporation Tigerlake Telemetry Aggregator Driver (rev 01)
00:0d.0 USB controller: Intel Corporation Tiger Lake-H Thunderbolt 4 USB Controller (rev 05)
00:0d.2 USB controller: Intel Corporation Tiger Lake-H Thunderbolt 4 NHI #0 (rev 05)
00:0d.3 USB controller: Intel Corporation Tiger Lake-H Thunderbolt 4 NHI #1 (rev 05)
00:14.0 USB controller: Intel Corporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller (rev 11)
00:14.2 RAM memory: Intel Corporation Tiger Lake-H Shared SRAM (rev 11)
00:14.3 Network controller: Intel Corporation Tiger Lake PCH CNVi WiFi (rev 11)
00:15.0 Serial bus controller: Intel Corporation Tiger Lake-H Serial IO I2C Controller #0 (rev 11)
00:15.1 Serial bus controller: Intel Corporation Tiger Lake-H Serial IO I2C Controller #1 (rev 11)
00:15.2 Serial bus controller: Intel Corporation Device 43ea (rev 11)
00:16.0 Communication controller: Intel Corporation Tiger Lake-H Management Engine Interface (rev 11)
00:17.0 SATA controller: Intel Corporation Tiger Lake SATA AHCI Controller (rev 11)
00:1d.0 PCI bridge: Intel Corporation Tiger Lake-H PCI Express Root Port #9 (rev 11)
00:1d.6 PCI bridge: Intel Corporation Device 43b6 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Tiger Lake-H LPC/eSPI Controller (rev 11)
00:1f.3 Audio device: Intel Corporation Tiger Lake-H HD Audio Controller (rev 11)
00:1f.4 SMBus: Intel Corporation Tiger Lake-H SMBus Controller (rev 11)
00:1f.5 Serial bus controller: Intel Corporation Tiger Lake-H SPI Controller (rev 11)
01:00.0 VGA compatible controller: NVIDIA Corporation GA107BM [GeForce RTX 3050 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 2291 (rev a1)
02:00.0 Non-Volatile memory controller: SK hynix Gold P31/BC711/PC711 NVMe Solid State Drive
5c:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)


And the "systemctl status nvidia-powerd.service" output, having booted into 6.4.4:
× nvidia-powerd.service - nvidia-powerd service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-powerd.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: failed (Result: exit-code) since Sun 2023-07-23 16:36:48 +0330; 1min 43s ago
    Process: 1138 ExecStart=/usr/bin/nvidia-powerd (code=exited, status=1/FAILURE)
   Main PID: 1138 (code=exited, status=1/FAILURE)
        CPU: 8ms

Jul 23 16:36:48 fedora systemd[1]: Starting nvidia-powerd.service - nvidia-powerd service...
Jul 23 16:36:48 fedora /usr/bin/nvidia-powerd[1138]: nvidia-powerd version:1.0(build 1)
Jul 23 16:36:48 fedora /usr/bin/nvidia-powerd[1138]: Allocate client failed 38
Jul 23 16:36:48 fedora /usr/bin/nvidia-powerd[1138]: Failed to initialize RM Client
Jul 23 16:36:48 fedora systemd[1]: nvidia-powerd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 16:36:48 fedora systemd[1]: nvidia-powerd.service: Failed with result 'exit-code'.
Jul 23 16:36:48 fedora systemd[1]: Failed to start nvidia-powerd.service - nvidia-powerd service.

nvidia-smi output in 6.4.4:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Comment 2 Garrett Mitchener 2023-08-02 14:54:02 UTC
I have an Acer Predator Helios 300 from 2019, which is a laptop, with an Nvidia GTX 1660Ti.
Using Nvidia's drivers from rpmfusion, video works properly with kernel 6.3.13.
Specifically, I can plug an external monitor into the HDMI port, and it and the laptop's built-in display are both fully functional.
I'm using version 535.86.05 of the Nvidia drivers, which is a bit newer than in the original report.
After upgrading to kernel 6.4.6 and with these packages installed

nvidia-gpu-firmware-20230625-151.fc38.noarch
kmod-nvidia-6.3.11-200.fc38.x86_64-535.54.03-1.fc38.x86_64
xorg-x11-drv-nvidia-cuda-libs-535.86.05-1.fc38.x86_64
xorg-x11-drv-nvidia-libs-535.86.05-1.fc38.x86_64
xorg-x11-drv-nvidia-kmodsrc-535.86.05-1.fc38.x86_64
akmod-nvidia-535.86.05-1.fc38.x86_64
nvidia-settings-535.86.05-1.fc38.x86_64
xorg-x11-drv-nvidia-power-535.86.05-1.fc38.x86_64
xorg-x11-drv-nvidia-535.86.05-1.fc38.x86_64
nvidia-persistenced-535.86.05-1.fc38.x86_64
xorg-x11-drv-nvidia-cuda-535.86.05-1.fc38.x86_64
kmod-nvidia-6.3.12-200.fc38.x86_64-535.86.05-1.fc38.x86_64
kmod-nvidia-6.4.6-200.fc38.x86_64-535.86.05-1.fc38.x86_64

the laptop's screen is always blank after I select the new kernel in grub.

I can plug an external monitor into the laptop's HDMI port, and that works.
But the laptop's built-in screen isn't recognized by gnome or xrandr.

Kernel command line from /etc/default/grub is

pci=noaer rd.driver.blacklist=nouveau modprobe.blacklist=nouveau resume=UUID=733ad07d-7906-4fd7-baff-e0e24f52c9f6 rhgb quiet


Here's the result of lspci:

00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 07)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:14.3 Network controller: Intel Corporation Cannon Lake PCH CNVi WiFi (rev 10)
00:15.0 Serial bus controller: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:15.1 Serial bus controller: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1d.4 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #13 (rev f0)
00:1f.0 ISA bridge: Intel Corporation HM470 Chipset LPC/eSPI Controller (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation TU116M [GeForce GTX 1660 Ti Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU116 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller: NVIDIA Corporation TU116 USB Type-C UCSI Controller (rev a1)
06:00.0 Non-Volatile memory controller: SK hynix BC501 NVMe Solid State Drive
07:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
08:00.0 Ethernet controller: Qualcomm Atheros Killer E2500 Gigabit Ethernet Controller (rev 10)

Comment 3 Garrett Mitchener 2023-08-02 15:04:00 UTC
I just tried kernel 6.4.7 from updates-testing.
It's just like 6.4.6: built-in display is not recognized, external monitor works fine.

Comment 4 Garrett Mitchener 2023-08-02 15:53:28 UTC
I tried adding  initcall_blacklist=simpledrm_platform_driver_init and nvidia-drm.modeset=1  to the kernel command line.
They didn't improve anything.

I'm also seeing this secondary symptom that other virtual consoles, which should be accessible with Ctrl+Alt+F4 etc. are not working under kernel 6.4.7.
They work under 6.3.12.

Further discussion here:

https://forums.developer.nvidia.com/t/fedora-38-nvidia-driver-530-41-03/255724

https://forums.developer.nvidia.com/t/nvidia-driver-isnt-compatible-with-simpledrm-so-boot-output-and-ttys-are-blank/238007

Comment 5 Garrett Mitchener 2023-08-02 15:56:31 UTC
Also related:

https://bugzilla.redhat.com/show_bug.cgi?id=2071209

Comment 6 Garrett Mitchener 2023-08-24 20:08:31 UTC
I've tried kernel versions 6.4.10 and 6.4.11.  The behavior is the same with those as well.

Comment 7 Justin M. Forbes 2023-08-24 22:19:13 UTC
This is not a kernel bug, this is a bug with nvidia's driver. It is closed source, and I have no ability to do anything with it.  Based on feedback I have received though, nvidia drivers in rpmfusion are working for several people.  Perhaps the driver needs an update?


Note You need to log in before you can comment on or make changes to this bug.