Bug 2252130 - Davinci Resolve crash at startup with kernel 6.6.2 + AMDGPU
Summary: Davinci Resolve crash at startup with kernel 6.6.2 + AMDGPU
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 39
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-29 18:02 UTC by Yannick Defais
Modified: 2024-11-27 22:11 UTC (History)
17 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2024-11-27 22:11:25 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Yannick Defais 2023-11-29 18:02:17 UTC
1. Please describe the problem:
Updated the kernel to 6.6.2-201.fc39. Since Davinci Resolve crash at startup. Using AMDGPU and Rocm.

2. What is the Version-Release number of the kernel:
Linux fedora 6.6.2-201.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 22 21:31:42 UTC 2023 x86_64 GNU/Linux

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
With kernel 6.5.12-300.fc39, davinci resolve works fine.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
To start Davinci resolve, one must install the linux package from the website:
https://www.blackmagicdesign.com/products/davinciresolve

Install it. Install rocm-opencl too (for AMDGPU).

Then you need some "magic" to get it run :
$ LD_PRELOAD="/usr/lib64/libglib-2.0.so.0 /usr/lib64/libgio-2.0.so.0 /usr/lib64/libgmodule-2.0.so.0" /opt/resolve/bin/resolve

One can test with kernel 6.5.x, then it start, or with kernel 6.6, then it crash.
5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:


6. Are you running any modules that not shipped with directly Fedora's kernel?:
lsmod
Module                  Size  Used by
uinput                 20480  0
rfcomm                102400  16
snd_seq_dummy          12288  0
snd_hrtimer            12288  1
nf_conntrack_netbios_ns    12288  1
nf_conntrack_broadcast    12288  1 nf_conntrack_netbios_ns
nft_fib_inet           12288  1
nft_fib_ipv4           12288  1 nft_fib_inet
nft_fib_ipv6           12288  1 nft_fib_inet
nft_fib                12288  3 nft_fib_ipv6,nft_fib_ipv4,nft_fib_inet
nft_reject_inet        12288  16
nf_reject_ipv4         16384  1 nft_reject_inet
nf_reject_ipv6         20480  1 nft_reject_inet
nft_reject             12288  1 nft_reject_inet
nft_ct                 24576  8
nft_chain_nat          12288  3
nf_nat                 65536  1 nft_chain_nat
nf_conntrack          200704  4 nf_nat,nft_ct,nf_conntrack_netbios_ns,nf_conntrack_broadcast
nf_defrag_ipv6         24576  1 nf_conntrack
nf_defrag_ipv4         12288  1 nf_conntrack
ip_set                 65536  0
nf_tables             368640  418 nft_ct,nft_reject_inet,nft_fib_ipv6,nft_fib_ipv4,nft_chain_nat,nft_reject,nft_fib,nft_fib_inet
nfnetlink              20480  3 nf_tables,ip_set
qrtr                   57344  4
bnep                   36864  2
sunrpc                888832  1
binfmt_misc            28672  1
iwlmvm                696320  0
vfat                   20480  1
mac80211             1572864  1 iwlmvm
fat                   106496  1 vfat
intel_rapl_msr         20480  0
intel_rapl_common      40960  1 intel_rapl_msr
edac_mce_amd           53248  0
snd_hda_codec_hdmi     94208  2
kvm_amd               204800  0
snd_hda_intel          65536  5
libarc4                12288  1 mac80211
snd_intel_dspcfg       40960  1 snd_hda_intel
snd_usb_audio         462848  8
snd_intel_sdw_acpi     16384  1 snd_intel_dspcfg
kvm                  1372160  1 kvm_amd
iwlwifi               471040  1 iwlmvm
snd_hda_codec         225280  2 snd_hda_codec_hdmi,snd_hda_intel
btusb                  86016  0
btrtl                  32768  1 btusb
snd_usbmidi_lib        49152  1 snd_usb_audio
snd_ump                36864  1 snd_usb_audio
snd_hda_core          151552  3 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec
btintel                57344  1 btusb
snd_rawmidi            57344  2 snd_usbmidi_lib,snd_ump
btbcm                  24576  1 btusb
mc                     90112  1 snd_usb_audio
snd_hwdep              20480  2 snd_usb_audio,snd_hda_codec
btmtk                  12288  1 btusb
snd_seq               126976  7 snd_seq_dummy
irqbypass              12288  1 kvm
snd_seq_device         16384  3 snd_seq,snd_ump,snd_rawmidi
cfg80211             1331200  3 iwlmvm,iwlwifi,mac80211
bluetooth            1060864  44 btrtl,btmtk,btintel,btbcm,bnep,btusb,rfcomm
rapl                   20480  0
intel_wmi_thunderbolt    16384  0
wmi_bmof               12288  0
snd_pcm               184320  6 snd_hda_codec_hdmi,snd_hda_intel,snd_usb_audio,snd_hda_codec,snd_hda_core
pcspkr                 12288  0
snd_timer              53248  3 snd_seq,snd_hrtimer,snd_pcm
k10temp                16384  0
i2c_piix4              32768  0
snd                   155648  39 snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_usb_audio,snd_usbmidi_lib,snd_hda_codec,snd_timer,snd_ump,snd_pcm,snd_rawmidi
rfkill                 40960  9 iwlmvm,bluetooth,cfg80211
thunderbolt           516096  0
soundcore              16384  1 snd
joydev                 24576  0
gpio_amdpt             16384  0
gpio_generic           20480  1 gpio_amdpt
loop                   40960  0
zram                   32768  2
dm_crypt               65536  1
hid_logitech_hidpp     77824  0
amdgpu              12435456  199
i2c_algo_bit           20480  1 amdgpu
drm_ttm_helper         12288  1 amdgpu
ttm                   110592  2 amdgpu,drm_ttm_helper
drm_exec               12288  1 amdgpu
drm_suballoc_helper    12288  1 amdgpu
crct10dif_pclmul       12288  1
amdxcp                 12288  1 amdgpu
crc32_pclmul           12288  0
drm_buddy              20480  1 amdgpu
crc32c_intel           16384  3
hid_logitech_dj        40960  0
polyval_clmulni        12288  0
polyval_generic        12288  1 polyval_clmulni
r8169                 114688  0
gpu_sched              57344  1 amdgpu
nvme                   65536  3
ghash_clmulni_intel    16384  0
drm_display_helper    229376  1 amdgpu
nvme_core             229376  4 nvme
sha512_ssse3           53248  0
sp5100_tco             20480  0
ccp                   155648  1 kvm_amd
nvme_common            24576  1 nvme_core
cec                    86016  1 drm_display_helper
video                  77824  1 amdgpu
wmi                    45056  3 video,intel_wmi_thunderbolt,wmi_bmof
ip6_tables             36864  0
ip_tables              36864  0
fuse                  208896  5

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Reproducible: Always

Comment 1 Michael Oppliger 2023-12-02 16:27:18 UTC
I experience something similar - I have an AMD RX 570 GPU with Fedora 39 + ROCm 5.7.1 + Kernel 6.6.3 and am running GPU compute projects through BOINC.
With all 6.6.x kernels tested so far the computation "fails" --> it does not throw obvious errors in the application but it never finishes computing.
Booting the same system (without config changes) with kernel 6.5.12 does work just fine.

The following errors are logged with journalctl -k:


Dez 02 16:09:18 kernel: amdgpu 0000:02:00.0: amdgpu: Disabling VM faults because of PRT request!

Dez 02 17:01:53 kernel: amdgpu: Failed to reserve buffers in ttm.
Dez 02 17:01:53 kernel: amdgpu: Failed to reserve buffers in ttm.
Dez 02 17:01:53 kernel: amdgpu: Failed to reserve buffers in ttm.
Dez 02 17:01:53 kernel: ------------[ cut here ]------------
Dez 02 17:01:53 kernel: WARNING: CPU: 0 PID: 10 at drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:1518 amdgpu_amd>
Dez 02 17:01:53 kernel: Modules linked in: uinput snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_br>
Dez 02 17:01:53 kernel:  drm_suballoc_helper amdxcp polyval_clmulni drm_buddy polyval_generic ghash_clmulni_intel g>
Dez 02 17:01:53 kernel: CPU: 0 PID: 10 Comm: kworker/0:1 Not tainted 6.6.3-200.fc39.x86_64 #1
Dez 02 17:01:53 kernel: Hardware name: Hewlett-Packard HP Z440 Workstation/212B, BIOS M60 v02.61 03/23/2023
Dez 02 17:01:53 kernel: Workqueue: events delayed_fput
Dez 02 17:01:53 kernel: RIP: 0010:amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 02 17:01:53 kernel: Code: df 5b 5d 41 5c e9 7a ad cc db 5b 5d 41 5c c3 cc cc cc cc e8 fc 5b 46 dc eb cc be 03 0>
Dez 02 17:01:53 kernel: RSP: 0000:ffffc900000b7cc0 EFLAGS: 00010206
Dez 02 17:01:53 kernel: RAX: ffff88818b509020 RBX: ffff88818b509000 RCX: ffff88818b509000
Dez 02 17:01:53 kernel: RDX: ffff88832e9a7d48 RSI: ffff888269f1b730 RDI: ffff88818b509040
Dez 02 17:01:53 kernel: RBP: ffff888269f1b000 R08: 0000000000000000 R09: 0000000080200010
Dez 02 17:01:53 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88818b509040
Dez 02 17:01:53 kernel: R13: ffff88829ff07a00 R14: 0000000000000000 R15: ffff88862f000001
Dez 02 17:01:53 kernel: FS:  0000000000000000(0000) GS:ffff888fefa00000(0000) knlGS:0000000000000000
Dez 02 17:01:53 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dez 02 17:01:53 kernel: CR2: 0000559837911784 CR3: 0000000555518002 CR4: 00000000003706f0
Dez 02 17:01:53 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dez 02 17:01:53 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dez 02 17:01:53 kernel: Call Trace:
Dez 02 17:01:53 kernel:  <TASK>
Dez 02 17:01:53 kernel:  ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 02 17:01:53 kernel:  ? __warn+0x81/0x130
Dez 02 17:01:53 kernel:  ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 02 17:01:53 kernel:  ? report_bug+0x171/0x1a0
Dez 02 17:01:53 kernel:  ? handle_bug+0x3c/0x80
Dez 02 17:01:53 kernel:  ? exc_invalid_op+0x17/0x70
Dez 02 17:01:53 kernel:  ? asm_exc_invalid_op+0x1a/0x20
Dez 02 17:01:53 kernel:  ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 02 17:01:53 kernel:  amdgpu_vm_fini+0x49/0x550 [amdgpu]
Dez 02 17:01:53 kernel:  amdgpu_driver_postclose_kms+0x191/0x280 [amdgpu]
Dez 02 17:01:53 kernel:  drm_file_free+0x21c/0x270
Dez 02 17:01:53 kernel:  drm_release+0x74/0xf0
Dez 02 17:01:53 kernel:  __fput+0xf5/0x290
Dez 02 17:01:53 kernel:  delayed_fput+0x23/0x30
Dez 02 17:01:53 kernel:  process_one_work+0x174/0x340
Dez 02 17:01:53 kernel:  worker_thread+0x27b/0x3a0
Dez 02 17:01:53 kernel:  ? __pfx_worker_thread+0x10/0x10
Dez 02 17:01:53 kernel:  kthread+0xe8/0x120
Dez 02 17:01:53 kernel:  ? __pfx_kthread+0x10/0x10
Dez 02 17:01:53 kernel:  ret_from_fork+0x34/0x50
Dez 02 17:01:53 kernel:  ? __pfx_kthread+0x10/0x10
Dez 02 17:01:53 kernel:  ret_from_fork_asm+0x1b/0x30
Dez 02 17:01:53 kernel:  </TASK>
Dez 02 17:01:53 kernel: ---[ end trace 0000000000000000 ]---
Dez 02 17:01:53 kernel: amdgpu 0000:02:00.0: amdgpu: still active bo inside vm



I'll try to get some more data on this and report back here.

Comment 2 Michael Oppliger 2023-12-09 10:08:06 UTC
This seems to be fixed upstream in the upcoming kernel 6.7 series - so we'll just have to wait for it to be backported.

See https://gitlab.freedesktop.org/drm/amd/-/issues/3007#note_2199326 as well as https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=v6.7-rc4&qt=grep&q=amdkfd

Comment 3 Yannick Defais 2023-12-14 09:19:34 UTC
Still broken with 6.6.4 and 6.6.6.

Comment 4 Michael Oppliger 2023-12-19 04:48:56 UTC
OpenCL is now working for me since kernel 6.6.7 - although the traces are still logged:


Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm.
Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm.
Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm.
Dez 19 05:30:31 kernel: amdgpu: Failed to reserve buffers in ttm.
Dez 19 05:30:31 kernel: ------------[ cut here ]------------
Dez 19 05:30:31 kernel: WARNING: CPU: 3 PID: 11999 at drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:1518 amdgpu_a>
Dez 19 05:30:31 kernel: Modules linked in: uinput snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_bro>
Dez 19 05:30:31 kernel:  polyval_clmulni drm_suballoc_helper amdxcp polyval_generic drm_buddy ghash_clmulni_intel nv>
Dez 19 05:30:31 kernel: CPU: 3 PID: 11999 Comm: kworker/3:2 Tainted: G        W          6.6.7-200.fc39.x86_64 #1
Dez 19 05:30:31 kernel: Hardware name: Hewlett-Packard HP Z440 Workstation/212B, BIOS M60 v02.61 03/23/2023
Dez 19 05:30:31 kernel: Workqueue: events delayed_fput
Dez 19 05:30:31 kernel: RIP: 0010:amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 19 05:30:31 kernel: Code: df 5b 5d 41 5c e9 ba 7b cd cf 5b 5d 41 5c c3 cc cc cc cc e8 7c 30 47 d0 eb cc be 03 00>
Dez 19 05:30:31 kernel: RSP: 0018:ffffc90009043cc0 EFLAGS: 00010287
Dez 19 05:30:31 kernel: RAX: ffff8886accbd020 RBX: ffff8886accbd000 RCX: ffff8886accbd000
Dez 19 05:30:31 kernel: RDX: ffff8883456cce48 RSI: ffff8885bf5a9730 RDI: ffff8886accbd040
Dez 19 05:30:31 kernel: RBP: ffff8885bf5a9000 R08: 0000000000000000 R09: 0000000080200013
Dez 19 05:30:31 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8886accbd040
Dez 19 05:30:31 kernel: R13: ffff8881fce8f400 R14: 0000000000000000 R15: ffff8886a4000001
Dez 19 05:30:31 kernel: FS:  0000000000000000(0000) GS:ffff888fefac0000(0000) knlGS:0000000000000000
Dez 19 05:30:31 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dez 19 05:30:31 kernel: CR2: 00007fa7eb423000 CR3: 0000000105184005 CR4: 00000000003706e0
Dez 19 05:30:31 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dez 19 05:30:31 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dez 19 05:30:31 kernel: Call Trace:
Dez 19 05:30:31 kernel:  <TASK>
Dez 19 05:30:31 kernel:  ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 19 05:30:31 kernel:  ? __warn+0x81/0x130
Dez 19 05:30:31 kernel:  ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 19 05:30:31 kernel:  ? report_bug+0x171/0x1a0
Dez 19 05:30:31 kernel:  ? handle_bug+0x3c/0x80
Dez 19 05:30:31 kernel:  ? exc_invalid_op+0x17/0x70
Dez 19 05:30:31 kernel:  ? asm_exc_invalid_op+0x1a/0x20
Dez 19 05:30:31 kernel:  ? amdgpu_amdkfd_gpuvm_destroy_cb+0x116/0x120 [amdgpu]
Dez 19 05:30:31 kernel:  amdgpu_vm_fini+0x49/0x550 [amdgpu]
Dez 19 05:30:31 kernel:  amdgpu_driver_postclose_kms+0x191/0x280 [amdgpu]
Dez 19 05:30:31 kernel:  drm_file_free+0x21c/0x270
Dez 19 05:30:31 kernel:  drm_release+0x74/0xf0
Dez 19 05:30:31 kernel:  __fput+0xf5/0x290
Dez 19 05:30:31 kernel:  delayed_fput+0x23/0x30
Dez 19 05:30:31 kernel:  process_one_work+0x174/0x340
Dez 19 05:30:31 kernel:  worker_thread+0x27b/0x3a0
Dez 19 05:30:31 kernel:  ? __pfx_worker_thread+0x10/0x10
Dez 19 05:30:31 kernel:  kthread+0xe8/0x120
Dez 19 05:30:31 kernel:  ? __pfx_kthread+0x10/0x10
Dez 19 05:30:31 kernel:  ret_from_fork+0x34/0x50
Dez 19 05:30:31 kernel:  ? __pfx_kthread+0x10/0x10
Dez 19 05:30:31 kernel:  ret_from_fork_asm+0x1b/0x30
Dez 19 05:30:31 kernel:  </TASK>
Dez 19 05:30:31 kernel: ---[ end trace 0000000000000000 ]---
Dez 19 05:30:31 kernel: amdgpu 0000:02:00.0: amdgpu: still active bo inside vm

Comment 5 Aoife Moloney 2024-11-13 10:10:41 UTC
This message is a reminder that Fedora Linux 39 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '39'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 39 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 6 Aoife Moloney 2024-11-27 22:11:25 UTC
Fedora Linux 39 entered end-of-life (EOL) status on 2024-11-26.

Fedora Linux 39 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.