Bug 1716138 - Direct firmware load for amdgpu/raven_gpu_info.bin failed with error -2
Summary: Direct firmware load for amdgpu/raven_gpu_info.bin failed with error -2
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-amdgpu
Version: 30
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Christopher Atherton
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-02 08:57 UTC by gobbledegeek
Modified: 2020-05-26 14:38 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-26 14:38:03 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journalctl -ab-1 for 5.1.5-300.fc30.x86_64 boot fail (42.66 KB, application/gzip)
2019-06-02 08:57 UTC, gobbledegeek
no flags Details

Description gobbledegeek 2019-06-02 08:57:07 UTC
Created attachment 1576228 [details]
journalctl -ab-1 for 5.1.5-300.fc30.x86_64 boot fail

Description of problem: 

Jun 02 12:40:04 fowl kernel: fb0: switching to amdgpudrmfb from EFI VGA
Jun 02 12:40:04 fowl kernel: amdgpu 0000:08:00.0: vgaarb: deactivate vga console
Jun 02 12:40:04 fowl kernel: amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
Jun 02 12:40:04 fowl kernel: [drm] initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x1002:0x15DD 0xC6).
Jun 02 12:40:04 fowl kernel: [drm] register mmio base: 0xFCD00000
Jun 02 12:40:04 fowl kernel: [drm] register mmio size: 524288
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 0 <soc15_common>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 1 <gmc_v9_0>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 2 <vega10_ih>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 3 <psp>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 4 <gfx_v9_0>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 5 <sdma_v4_0>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 6 <powerplay>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 7 <dm>
Jun 02 12:40:04 fowl kernel: [drm] add ip block number 8 <vcn_v1_0>
Jun 02 12:40:04 fowl kernel: amdgpu 0000:08:00.0: Direct firmware load for amdgpu/raven_gpu_info.bin failed with error -2
Jun 02 12:40:04 fowl kernel: amdgpu 0000:08:00.0: Failed to load gpu_info firmware "amdgpu/raven_gpu_info.bin"
Jun 02 12:40:04 fowl kernel: amdgpu 0000:08:00.0: Fatal error during GPU init
Jun 02 12:40:04 fowl kernel: [drm] amdgpu: finishing device.
Jun 02 12:40:04 fowl kernel: amdgpu: probe of 0000:08:00.0 failed with error -2


Version-Release number of selected component (if applicable): amdgpu-5.1.5-300.fc30.x86_64


How reproducible: After a kernel update to 5.1.5-300.fc30.x86_64 boot fails with blank screen. Rebooting to previous kernel version 5.0.17-300.fc30.x86_64 and checking the logs for the previous boot shows this


Actual results: Blank screen - cannot login because amdgpu driver fails to load


Expected results: Normal boot


Additional info: The said file exists in the filesystem but fails to load.

# diff -y amdgpu-5.0.17-300.fc30.x86_64.conf amdgpu-5.1.5-300.fc30.x86_64.conf 
add_drivers+=" amdgpu"						add_drivers+=" amdgpu"
fw_dir+="/lib/firmware/5.0.17-300.fc30.x86_64"		      |	fw_dir+="/lib/firmware/5.1.5-300.fc30.x86_64"

ls -l /lib/firmware/5.1.5-300.fc30.x86_64/amdgpu/raven_gpu_info.bin 
-rw-r--r--. 1 root root 316 Apr  8 12:42 /lib/firmware/5.1.5-300.fc30.x86_64/amdgpu/raven_gpu_info.bin

ls -l /lib/firmware/amdgpu/raven_gpu_info.bin 
-rw-r--r--. 1 root root 316 May 15 02:17 /lib/firmware/amdgpu/raven_gpu_info.bin

ls -l /usr/lib/firmware/amdgpu/raven_gpu_info.bin 
-rw-r--r--. 1 root root 316 May 15 02:17 /usr/lib/firmware/amdgpu/raven_gpu_info.bin

ls  /usr/lib/firmware/5.1.5-300.fc30.x86_64/amdgpu/raven_gpu_info.bin -l
-rw-r--r--. 1 root root 316 Apr  8 12:42 /usr/lib/firmware/5.1.5-300.fc30.x86_64/amdgpu/raven_gpu_info.bin

Comment 1 gobbledegeek 2019-06-03 07:08:32 UTC
On the side - why are these firmware files installed in 4 different places without symlinks (including "/lib/firmware/amdgpu/" path that makes 4 locations with the previous kernel path included). I saw that "/usr/lib/firmware/5.1.5-300.fc30.x86_64/" had older files dated from April so I copied over the latest may firmware files from "/lib/firmware/amdgpu/" but that did not have any effect. This path location not updated even when I re-installed linux-firmware...

G

Comment 2 gobbledegeek 2019-06-03 15:50:39 UTC
I ran a dracut --force to rebuild initramfs and now even 5.0.17-300 kernel will not load amdgpu. Luckily I had a custom 5.0.17-300 grub menu set for a last-known-good-menu type boot option in which I had copied the /boot/ files including initramfs so I am able to boot Fedora at the least.  

Googling around I see this issue has come and gone since 2017 but I am not clear why the error happens in this instance even with the files present in */firmware/* folders.

G

Comment 3 gobbledegeek 2019-06-03 16:12:44 UTC
I fixed the problem with this command: #>dracut --force --kver 5.1.5-300.fc30.x86_64  --install "/lib/firmware/amdgpu/*"
My boot system is now 5.1.5-300.fc30.x86_64 #1 SMP Sat May 25 18:00:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Thanks for your help on this.

G

Comment 4 gobbledegeek 2019-06-06 06:23:55 UTC
Hi Folks
The problem came back with the next kernel update. 

Jun 06 11:37:39 fowl kernel: amdgpu 0000:08:00.0: Direct firmware load for amdgpu/raven_gpu_info.bin failed with error -2
Jun 06 11:37:39 fowl kernel: amdgpu 0000:08:00.0: Failed to load gpu_info firmware "amdgpu/raven_gpu_info.bin"
Jun 06 11:37:39 fowl kernel: amdgpu 0000:08:00.0: Fatal error during GPU init

An 'lsinitrd -k 5.1.6-300.fc30.x86_64 | grep amdgpu' did not show any ravenridge drivers.

adding amdgpu to dracut.conf also does not work with a dracut rebuild. 

cat /etc/dracut.conf
# PUT YOUR CONFIG IN separate files
# in /etc/dracut.conf.d named "<name>.conf"
# SEE man dracut.conf(5) for options
#add_dracutmodules+=" amdgpu "
add_modules+=" amdgpu "
add_drivers+=" amdgpu "
omit_dracutmodules+=" dmraid "
install_items+=" /lib/firmware/edid/BenQ2710Q.bin /lib/firmware/edid/BenQ2711U.bin "


I had to manually run 'dracut --force --kver 5.1.6-300.fc30.x86_64 --install "/lib/firmware/amdgpu/raven*"' before I could boot to the login screen.

Is something broken about the toolchain or is it something specific to my setup? What else can I check/fix to ensure the next fedora kernel update builds initrd with amdgpu seamlessly without a manual redo like I did?

Thanks
G

Comment 5 rhbug 2019-06-23 10:49:14 UTC
Been having to manually dracut --regenerate-all --force --install=/lib/firmware/amdgpu/* with every kernel update.
Did some digging around and apparently, the problem was an old dkms configuration from a previous ROCm install.  Since I wasn't using ROCm anymore, deleted the files from /var/lib/dkms/amdgpu and it stopped generating bad /etc/dracut.conf.d/amdgpu-KERNEL-VERSION.conf files.

Comment 6 gobbledegeek 2019-06-24 15:58:20 UTC
Hi
You are right yes - I  had ROCm for a week before the problem started. But uninstalling all ROCm packages a few weeks back did not fix. My /var/lib/dkms does not have an amdgpu folder. Instead /I now worked around the problem by addthig this line to dracut.conf:
install_items+=" /lib/firmware/amdgpu/raven* "

G

Comment 7 gobbledegeek 2019-06-24 15:58:54 UTC
Hi
You are right yes - I  had ROCm for a week before the problem started. But uninstalling all ROCm packages a few weeks back did not fix. My /var/lib/dkms does not have an amdgpu folder. Instead I  worked around the problem by adding this line to dracut.conf:
install_items+=" /lib/firmware/amdgpu/raven* "

G

Comment 8 rhbug 2019-06-24 22:35:31 UTC
My observation is that if there isn't any amdgpu-*.conf files in /etc/dracut.conf.d directory, dracut does "the right thing" and includes /usr/lib/firmware/MODULENAME files.  

Your workaround would just make the initrd slightly smaller by only including the raven* files instead of all amd/ati firmwares.

Using lsinitrd /boot/initramfs-KERNEL-VERSION helped me debug this problem immensely.

In the end, an empty /etc/dracut.conf file and empty /etc/dracut.conf.d/ fixed my issues.

Comment 9 gobbledegeek 2019-07-13 08:29:47 UTC
I deleted  the files in /etc/dracut.conf.d as described in the above post and  also removed the entries in dracut.conf and  it has passed two kernel upgrades successfully. Thanks for your help.

G

Comment 10 Sylvain Réault 2019-10-28 18:10:10 UTC
I reproducte the bogue On Fedora 31.


1. Please describe the problem:

My Radeon RX 5700 XT causes a black screen because the firmware  amdgpu/navi10_gpu does not load on the boot.

2. What is the Version-Release number of the kernel:
kernel : 5.3.7



3. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
At each start



4. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Oct 28 19:50:12 zeus5 kernel: amdgpu 0000:28:00.0: Direct firmware load for amdgpu/navi10_gpu_info.bin failed with error -2
Oct 28 19:50:12 zeus5 kernel: amdgpu 0000:28:00.0: Failed to load gpu_info firmware "amdgpu/navi10_gpu_info.bin"
Oct 28 19:50:12 zeus5 kernel: amdgpu 0000:28:00.0: Fatal error during GPU init
Oct 28 19:50:12 zeus5 kernel: [drm] amdgpu: finishing device.
Oct 28 19:50:12 zeus5 kernel: ------------[ cut here ]------------
Oct 28 19:50:12 zeus5 kernel: sysfs group 'fw_version' not found for kobject '0000:28:00.0'
Oct 28 19:50:12 zeus5 kernel: WARNING: CPU: 2 PID: 490 at fs/sysfs/group.c:278 sysfs_remove_group+0x74/0x80
Oct 28 19:50:12 zeus5 kernel: Modules linked in: fjes(-) amdgpu(+) hid_logitech_hidpp(+) amd_iommu_v2 gpu_sched ttm drm_kms_helper drm igb crc32c_intel dca i2c_algo_bit nvme hid_logitech nvme_core ff_memless hid_logitech_dj wmi pinctrl_amd fuse i2c_dev
Oct 28 19:50:12 zeus5 kernel: CPU: 2 PID: 490 Comm: systemd-udevd Not tainted 5.3.7-301.fc31.x86_64 #1
Oct 28 19:50:12 zeus5 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.A0 07/27/2019
Oct 28 19:50:12 zeus5 kernel: RIP: 0010:sysfs_remove_group+0x74/0x80
Oct 28 19:50:12 zeus5 kernel: Code: ff 5b 48 89 ef 5d 41 5c e9 29 bc ff ff 48 89 ef e8 41 b9 ff ff eb cc 49 8b 14 24 48 8b 33 48 c7 c7 78 ca 15 94 e8 0a 4d d4 ff <0f> 0b 5b 5d 41 5c c3 0f 1f 44 00 00 0f 1f 44 00 00 48 85 f6 74 31
Oct 28 19:50:12 zeus5 kernel: RSP: 0018:ffffba7000b079f8 EFLAGS: 00010282
Oct 28 19:50:12 zeus5 kernel: RAX: 0000000000000000 RBX: ffffffffc096abc0 RCX: 0000000000000006
Oct 28 19:50:12 zeus5 kernel: RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff91157e697900
Oct 28 19:50:12 zeus5 kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000449
Oct 28 19:50:12 zeus5 kernel: R10: 00000000000173a8 R11: 0000000000000003 R12: ffff911579f540b0
Oct 28 19:50:12 zeus5 kernel: R13: ffff91156b090018 R14: ffff91156d6ee4a0 R15: 0000000000000000
Oct 28 19:50:12 zeus5 kernel: FS:  00007f25015df940(0000) GS:ffff91157e680000(0000) knlGS:0000000000000000
Oct 28 19:50:12 zeus5 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 28 19:50:12 zeus5 kernel: CR2: 00005642af3dedc8 CR3: 00000007ef7e0000 CR4: 00000000003406e0
Oct 28 19:50:12 zeus5 kernel: Call Trace:
Oct 28 19:50:12 zeus5 kernel: amdgpu_device_fini+0x441/0x475 [amdgpu]
Oct 28 19:50:12 zeus5 kernel: amdgpu_driver_unload_kms+0x4a/0x90 [amdgpu]
Oct 28 19:50:12 zeus5 kernel: amdgpu_driver_load_kms.cold+0x8f/0xb1 [amdgpu]
Oct 28 19:50:12 zeus5 kernel: drm_dev_register+0x111/0x150 [drm]
Oct 28 19:50:12 zeus5 kernel: amdgpu_pci_probe+0xbd/0x120 [amdgpu]
Oct 28 19:50:12 zeus5 kernel: ? __pm_runtime_resume+0x58/0x80
Oct 28 19:50:12 zeus5 kernel: local_pci_probe+0x42/0x80
Oct 28 19:50:12 zeus5 kernel: pci_device_probe+0x107/0x1a0
Oct 28 19:50:12 zeus5 kernel: really_probe+0xf0/0x380
Oct 28 19:50:12 zeus5 kernel: driver_probe_device+0x59/0xd0
Oct 28 19:50:12 zeus5 kernel: device_driver_attach+0x53/0x60
Oct 28 19:50:12 zeus5 kernel: __driver_attach+0x8a/0x150
Oct 28 19:50:12 zeus5 kernel: ? device_driver_attach+0x60/0x60
Oct 28 19:50:12 zeus5 kernel: bus_for_each_dev+0x78/0xc0
Oct 28 19:50:12 zeus5 kernel: bus_add_driver+0x14a/0x1e0
Oct 28 19:50:12 zeus5 kernel: driver_register+0x6c/0xb0
Oct 28 19:50:12 zeus5 kernel: ? 0xffffffffc0ba1000
Oct 28 19:50:12 zeus5 kernel: do_one_initcall+0x46/0x1f4
Oct 28 19:50:12 zeus5 kernel: ? _cond_resched+0x15/0x30
Oct 28 19:50:12 zeus5 kernel: ? kmem_cache_alloc_trace+0x162/0x220
Oct 28 19:50:12 zeus5 kernel: ? do_init_module+0x23/0x230
Oct 28 19:50:12 zeus5 kernel: do_init_module+0x5c/0x230
Oct 28 19:50:12 zeus5 kernel: load_module+0x27b1/0x2990
Oct 28 19:50:12 zeus5 kernel: ? __do_sys_init_module+0x16e/0x1a0
Oct 28 19:50:12 zeus5 kernel: ? _cond_resched+0x15/0x30
Oct 28 19:50:12 zeus5 kernel: __do_sys_init_module+0x16e/0x1a0
Oct 28 19:50:12 zeus5 kernel: do_syscall_64+0x5f/0x1a0
Oct 28 19:50:12 zeus5 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 28 19:50:12 zeus5 kernel: RIP: 0033:0x7f250263509e
Oct 28 19:50:12 zeus5 kernel: Code: 48 8b 0d ed fd 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ba fd 0b 00 f7 d8 64 89 01 48
Oct 28 19:50:12 zeus5 kernel: RSP: 002b:00007ffdd96c0f98 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
Oct 28 19:50:12 zeus5 kernel: RAX: ffffffffffffffda RBX: 00005642ae2db490 RCX: 00007f250263509e
Oct 28 19:50:12 zeus5 kernel: RDX: 00007f250225684d RSI: 00000000008435ce RDI: 00005642aeb9b7f0
Oct 28 19:50:12 zeus5 kernel: RBP: 00005642aeb9b7f0 R08: 0000000000000006 R09: 00007ffdd96c040e
Oct 28 19:50:12 zeus5 kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 00007f250225684d
Oct 28 19:50:12 zeus5 kernel: R13: 0000000000000007 R14: 00005642ae2e5bd0 R15: 00005642ae2db490
Oct 28 19:50:12 zeus5 kernel: ---[ end trace 694806b803847eca ]---

Comment 11 Ben Cotton 2020-04-30 20:50:10 UTC
This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Ben Cotton 2020-05-26 14:38:03 UTC
Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.