1. Please describe the problem: Updated the kernel to 6.6.2-201.fc39. Since Davinci Resolve crash at startup. Using AMDGPU and Rocm. 2. What is the Version-Release number of the kernel: Linux fedora 6.6.2-201.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 22 21:31:42 UTC 2023 x86_64 GNU/Linux 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : With kernel 6.5.12-300.fc39, davinci resolve works fine. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: To start Davinci resolve, one must install the linux package from the website: https://www.blackmagicdesign.com/products/davinciresolve Install it. Install rocm-opencl too (for AMDGPU). Then you need some "magic" to get it run : $ LD_PRELOAD="/usr/lib64/libglib-2.0.so.0 /usr/lib64/libgio-2.0.so.0 /usr/lib64/libgmodule-2.0.so.0" /opt/resolve/bin/resolve One can test with kernel 6.5.x, then it start, or with kernel 6.6, then it crash. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: 6. Are you running any modules that not shipped with directly Fedora's kernel?: lsmod Module Size Used by uinput 20480 0 rfcomm 102400 16 snd_seq_dummy 12288 0 snd_hrtimer 12288 1 nf_conntrack_netbios_ns 12288 1 nf_conntrack_broadcast 12288 1 nf_conntrack_netbios_ns nft_fib_inet 12288 1 nft_fib_ipv4 12288 1 nft_fib_inet nft_fib_ipv6 12288 1 nft_fib_inet nft_fib 12288 3 nft_fib_ipv6,nft_fib_ipv4,nft_fib_inet nft_reject_inet 12288 16 nf_reject_ipv4 16384 1 nft_reject_inet nf_reject_ipv6 20480 1 nft_reject_inet nft_reject 12288 1 nft_reject_inet nft_ct 24576 8 nft_chain_nat 12288 3 nf_nat 65536 1 nft_chain_nat nf_conntrack 200704 4 nf_nat,nft_ct,nf_conntrack_netbios_ns,nf_conntrack_broadcast nf_defrag_ipv6 24576 1 nf_conntrack nf_defrag_ipv4 12288 1 nf_conntrack ip_set 65536 0 nf_tables 368640 418 nft_ct,nft_reject_inet,nft_fib_ipv6,nft_fib_ipv4,nft_chain_nat,nft_reject,nft_fib,nft_fib_inet nfnetlink 20480 3 nf_tables,ip_set qrtr 57344 4 bnep 36864 2 sunrpc 888832 1 binfmt_misc 28672 1 iwlmvm 696320 0 vfat 20480 1 mac80211 1572864 1 iwlmvm fat 106496 1 vfat intel_rapl_msr 20480 0 intel_rapl_common 40960 1 intel_rapl_msr edac_mce_amd 53248 0 snd_hda_codec_hdmi 94208 2 kvm_amd 204800 0 snd_hda_intel 65536 5 libarc4 12288 1 mac80211 snd_intel_dspcfg 40960 1 snd_hda_intel snd_usb_audio 462848 8 snd_intel_sdw_acpi 16384 1 snd_intel_dspcfg kvm 1372160 1 kvm_amd iwlwifi 471040 1 iwlmvm snd_hda_codec 225280 2 snd_hda_codec_hdmi,snd_hda_intel btusb 86016 0 btrtl 32768 1 btusb snd_usbmidi_lib 49152 1 snd_usb_audio snd_ump 36864 1 snd_usb_audio snd_hda_core 151552 3 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec btintel 57344 1 btusb snd_rawmidi 57344 2 snd_usbmidi_lib,snd_ump btbcm 24576 1 btusb mc 90112 1 snd_usb_audio snd_hwdep 20480 2 snd_usb_audio,snd_hda_codec btmtk 12288 1 btusb snd_seq 126976 7 snd_seq_dummy irqbypass 12288 1 kvm snd_seq_device 16384 3 snd_seq,snd_ump,snd_rawmidi cfg80211 1331200 3 iwlmvm,iwlwifi,mac80211 bluetooth 1060864 44 btrtl,btmtk,btintel,btbcm,bnep,btusb,rfcomm rapl 20480 0 intel_wmi_thunderbolt 16384 0 wmi_bmof 12288 0 snd_pcm 184320 6 snd_hda_codec_hdmi,snd_hda_intel,snd_usb_audio,snd_hda_codec,snd_hda_core pcspkr 12288 0 snd_timer 53248 3 snd_seq,snd_hrtimer,snd_pcm k10temp 16384 0 i2c_piix4 32768 0 snd 155648 39 snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_usb_audio,snd_usbmidi_lib,snd_hda_codec,snd_timer,snd_ump,snd_pcm,snd_rawmidi rfkill 40960 9 iwlmvm,bluetooth,cfg80211 thunderbolt 516096 0 soundcore 16384 1 snd joydev 24576 0 gpio_amdpt 16384 0 gpio_generic 20480 1 gpio_amdpt loop 40960 0 zram 32768 2 dm_crypt 65536 1 hid_logitech_hidpp 77824 0 amdgpu 12435456 199 i2c_algo_bit 20480 1 amdgpu drm_ttm_helper 12288 1 amdgpu ttm 110592 2 amdgpu,drm_ttm_helper drm_exec 12288 1 amdgpu drm_suballoc_helper 12288 1 amdgpu crct10dif_pclmul 12288 1 amdxcp 12288 1 amdgpu crc32_pclmul 12288 0 drm_buddy 20480 1 amdgpu crc32c_intel 16384 3 hid_logitech_dj 40960 0 polyval_clmulni 12288 0 polyval_generic 12288 1 polyval_clmulni r8169 114688 0 gpu_sched 57344 1 amdgpu nvme 65536 3 ghash_clmulni_intel 16384 0 drm_display_helper 229376 1 amdgpu nvme_core 229376 4 nvme sha512_ssse3 53248 0 sp5100_tco 20480 0 ccp 155648 1 kvm_amd nvme_common 24576 1 nvme_core cec 86016 1 drm_display_helper video 77824 1 amdgpu wmi 45056 3 video,intel_wmi_thunderbolt,wmi_bmof ip6_tables 36864 0 ip_tables 36864 0 fuse 208896 5 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Reproducible: Always
I experience something similar - I have an AMD RX 570 GPU with Fedora 39 + ROCm 5.7.1 + Kernel 6.6.3 and am running GPU compute projects through BOINC. With all 6.6.x kernels tested so far the computation "fails" --> it does not throw obvious errors in the application but it never finishes computing. Booting the same system (without config changes) with kernel 6.5.12 does work just fine. The following errors are logged with journalctl -k: Dez 02 16:09:18 kernel: amdgpu 0000:02:00.0: amdgpu: Disabling VM faults because of PRT request! Dez 02 17:01:53 kernel: amdgpu: Failed to reserve buffers in ttm. Dez 02 17:01:53 kernel: amdgpu: Failed to reserve buffers in ttm. Dez 02 17:01:53 kernel: amdgpu: Failed to reserve buffers in ttm. Dez 02 17:01:53 kernel: ------------[ cut here ]------------ Dez 02 17:01:53 kernel: WARNING: CPU: 0 PID: 10 at drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:1518 amdgpu_amd> Dez 02 17:01:53 kernel: Modules linked in: uinput snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_br> Dez 02 17:01:53 kernel: drm_suballoc_helper amdxcp polyval_clmulni drm_buddy polyval_generic ghash_clmulni_intel g> Dez 02 17:01:53 kernel: CPU: 0 PID: 10 Comm: kworker/0:1 Not tainted 6.6.3-200.fc39.x86_64 #1 Dez 02 17:01:53 kernel: Hardware name: Hewlett-Packard HP Z440 Workstation/212B, BIOS M60 v02.61 03/23/2023 Dez 02 17:01:53 kernel: Workqueue: events delayed_fput Dez 02 17:01:53 kernel: RIP: 0010:amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 02 17:01:53 kernel: Code: df 5b 5d 41 5c e9 7a ad cc db 5b 5d 41 5c c3 cc cc cc cc e8 fc 5b 46 dc eb cc be 03 0> Dez 02 17:01:53 kernel: RSP: 0000:ffffc900000b7cc0 EFLAGS: 00010206 Dez 02 17:01:53 kernel: RAX: ffff88818b509020 RBX: ffff88818b509000 RCX: ffff88818b509000 Dez 02 17:01:53 kernel: RDX: ffff88832e9a7d48 RSI: ffff888269f1b730 RDI: ffff88818b509040 Dez 02 17:01:53 kernel: RBP: ffff888269f1b000 R08: 0000000000000000 R09: 0000000080200010 Dez 02 17:01:53 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88818b509040 Dez 02 17:01:53 kernel: R13: ffff88829ff07a00 R14: 0000000000000000 R15: ffff88862f000001 Dez 02 17:01:53 kernel: FS: 0000000000000000(0000) GS:ffff888fefa00000(0000) knlGS:0000000000000000 Dez 02 17:01:53 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dez 02 17:01:53 kernel: CR2: 0000559837911784 CR3: 0000000555518002 CR4: 00000000003706f0 Dez 02 17:01:53 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dez 02 17:01:53 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Dez 02 17:01:53 kernel: Call Trace: Dez 02 17:01:53 kernel: <TASK> Dez 02 17:01:53 kernel: ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 02 17:01:53 kernel: ? __warn+0x81/0x130 Dez 02 17:01:53 kernel: ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 02 17:01:53 kernel: ? report_bug+0x171/0x1a0 Dez 02 17:01:53 kernel: ? handle_bug+0x3c/0x80 Dez 02 17:01:53 kernel: ? exc_invalid_op+0x17/0x70 Dez 02 17:01:53 kernel: ? asm_exc_invalid_op+0x1a/0x20 Dez 02 17:01:53 kernel: ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 02 17:01:53 kernel: amdgpu_vm_fini+0x49/0x550 [amdgpu] Dez 02 17:01:53 kernel: amdgpu_driver_postclose_kms+0x191/0x280 [amdgpu] Dez 02 17:01:53 kernel: drm_file_free+0x21c/0x270 Dez 02 17:01:53 kernel: drm_release+0x74/0xf0 Dez 02 17:01:53 kernel: __fput+0xf5/0x290 Dez 02 17:01:53 kernel: delayed_fput+0x23/0x30 Dez 02 17:01:53 kernel: process_one_work+0x174/0x340 Dez 02 17:01:53 kernel: worker_thread+0x27b/0x3a0 Dez 02 17:01:53 kernel: ? __pfx_worker_thread+0x10/0x10 Dez 02 17:01:53 kernel: kthread+0xe8/0x120 Dez 02 17:01:53 kernel: ? __pfx_kthread+0x10/0x10 Dez 02 17:01:53 kernel: ret_from_fork+0x34/0x50 Dez 02 17:01:53 kernel: ? __pfx_kthread+0x10/0x10 Dez 02 17:01:53 kernel: ret_from_fork_asm+0x1b/0x30 Dez 02 17:01:53 kernel: </TASK> Dez 02 17:01:53 kernel: ---[ end trace 0000000000000000 ]--- Dez 02 17:01:53 kernel: amdgpu 0000:02:00.0: amdgpu: still active bo inside vm I'll try to get some more data on this and report back here.
This seems to be fixed upstream in the upcoming kernel 6.7 series - so we'll just have to wait for it to be backported. See https://gitlab.freedesktop.org/drm/amd/-/issues/3007#note_2199326 as well as https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=v6.7-rc4&qt=grep&q=amdkfd
Still broken with 6.6.4 and 6.6.6.
OpenCL is now working for me since kernel 6.6.7 - although the traces are still logged: Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm. Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm. Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm. Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm. Dez 19 05:30:31 kernel: ------------[ cut here ]------------ Dez 19 05:30:31 kernel: WARNING: CPU: 3 PID: 11999 at drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:1518 amdgpu_a> Dez 19 05:30:31 kernel: Modules linked in: uinput snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_bro> Dez 19 05:30:31 kernel: polyval_clmulni drm_suballoc_helper amdxcp polyval_generic drm_buddy ghash_clmulni_intel nv> Dez 19 05:30:31 kernel: CPU: 3 PID: 11999 Comm: kworker/3:2 Tainted: G W 6.6.7-200.fc39.x86_64 #1 Dez 19 05:30:31 kernel: Hardware name: Hewlett-Packard HP Z440 Workstation/212B, BIOS M60 v02.61 03/23/2023 Dez 19 05:30:31 kernel: Workqueue: events delayed_fput Dez 19 05:30:31 kernel: RIP: 0010:amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 19 05:30:31 kernel: Code: df 5b 5d 41 5c e9 ba 7b cd cf 5b 5d 41 5c c3 cc cc cc cc e8 7c 30 47 d0 eb cc be 03 00> Dez 19 05:30:31 kernel: RSP: 0018:ffffc90009043cc0 EFLAGS: 00010287 Dez 19 05:30:31 kernel: RAX: ffff8886accbd020 RBX: ffff8886accbd000 RCX: ffff8886accbd000 Dez 19 05:30:31 kernel: RDX: ffff8883456cce48 RSI: ffff8885bf5a9730 RDI: ffff8886accbd040 Dez 19 05:30:31 kernel: RBP: ffff8885bf5a9000 R08: 0000000000000000 R09: 0000000080200013 Dez 19 05:30:31 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8886accbd040 Dez 19 05:30:31 kernel: R13: ffff8881fce8f400 R14: 0000000000000000 R15: ffff8886a4000001 Dez 19 05:30:31 kernel: FS: 0000000000000000(0000) GS:ffff888fefac0000(0000) knlGS:0000000000000000 Dez 19 05:30:31 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dez 19 05:30:31 kernel: CR2: 00007fa7eb423000 CR3: 0000000105184005 CR4: 00000000003706e0 Dez 19 05:30:31 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dez 19 05:30:31 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Dez 19 05:30:31 kernel: Call Trace: Dez 19 05:30:31 kernel: <TASK> Dez 19 05:30:31 kernel: ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 19 05:30:31 kernel: ? __warn+0x81/0x130 Dez 19 05:30:31 kernel: ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 19 05:30:31 kernel: ? report_bug+0x171/0x1a0 Dez 19 05:30:31 kernel: ? handle_bug+0x3c/0x80 Dez 19 05:30:31 kernel: ? exc_invalid_op+0x17/0x70 Dez 19 05:30:31 kernel: ? asm_exc_invalid_op+0x1a/0x20 Dez 19 05:30:31 kernel: ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu] Dez 19 05:30:31 kernel: amdgpu_vm_fini+0x49/0x550 [amdgpu] Dez 19 05:30:31 kernel: amdgpu_driver_postclose_kms+0x191/0x280 [amdgpu] Dez 19 05:30:31 kernel: drm_file_free+0x21c/0x270 Dez 19 05:30:31 kernel: drm_release+0x74/0xf0 Dez 19 05:30:31 kernel: __fput+0xf5/0x290 Dez 19 05:30:31 kernel: delayed_fput+0x23/0x30 Dez 19 05:30:31 kernel: process_one_work+0x174/0x340 Dez 19 05:30:31 kernel: worker_thread+0x27b/0x3a0 Dez 19 05:30:31 kernel: ? __pfx_worker_thread+0x10/0x10 Dez 19 05:30:31 kernel: kthread+0xe8/0x120 Dez 19 05:30:31 kernel: ? __pfx_kthread+0x10/0x10 Dez 19 05:30:31 kernel: ret_from_fork+0x34/0x50 Dez 19 05:30:31 kernel: ? __pfx_kthread+0x10/0x10 Dez 19 05:30:31 kernel: ret_from_fork_asm+0x1b/0x30 Dez 19 05:30:31 kernel: </TASK> Dez 19 05:30:31 kernel: ---[ end trace 0000000000000000 ]--- Dez 19 05:30:31 kernel: amdgpu 0000:02:00.0: amdgpu: still active bo inside vm
This message is a reminder that Fedora Linux 39 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '39'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 39 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 39 entered end-of-life (EOL) status on 2024-11-26. Fedora Linux 39 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.