When enabling HIP rendering with the latest rocm-hip update (https://koji.fedoraproject.org/koji/buildinfo?buildID=2217754) and applied patch on Blender (https://koji.fedoraproject.org/koji/buildinfo?buildID=2218279), the rendering silently failed without a backtrace Reproducible: Always Steps to Reproduce: 1.Install Blender and rocm-hip component then start the application 2.Make sure HIP is enabled for the AMD hardware and switch rendering to GPU Compute 3.Render a model Actual Results: Silent failure Expected Results: Rendering should be successful See attachment
Created attachment 1972007 [details] Backtrace using sysprof
Adding Tom Rix to take a look at the crash.
is this a blender build --with rocm ? The other recent change on blender is toolchain changed to clang, has that been ruled out ? i have local blender build going with the latest rocclr. will try to reproduce this in the morning.
Yes according to the spec file (rawhide as an exampel): https://src.fedoraproject.org/rpms/blender/c/caea81a044c4574dcfae5da78c1488df9727dc03?branch=rawhide The build still uses gcc compiler by default and clang for rocclr component from this log: https://kojipkgs.fedoraproject.org//packages/blender/3.5.1/7.fc38/data/logs/x86_64/build.log
my system is rawhide, so bear with me, i know the problem was reported on f38. Local building blender with and without -with rocm works now mock of the default fails with Error: Transaction test error: file /usr/lib64/libpcre.so conflicts between attempted installs of pcre-devel-8.45-1.fc38.3.x86_64 and openCOLLADA-devel-1.6.70-4.fc39.x86_64 The basic startup of blender works for both can you suggest a file to load or a rendering stress test I could run ?
Any file using cycle rendering like https://www.blender.org/download/demo-files/#cycles
I think the issue may affect rawhide as well.
my card maybe too old to run, it falls back to cpu :( rocminfo Name: gfx803 Uuid: GPU-XX Marketing Name: AMD Radeon RX 550 / 550 Series Vendor Name: AMD
Your card is a Polaris which is unsupported by rocclr (Minimum requirement is at least gfx900 (Vega series)). Name: gfx1030 Uuid: GPU-XX Marketing Name: AMD Radeon RX 6950 XT Vendor Name: AMD The issue also affect APU like Ryzen 7 5825U.
(In reply to Tom Rix from comment #5) > my system is rawhide, so bear with me, i know the problem was reported on > f38. > Local building blender with and without -with rocm works now > mock of the default fails with > Error: Transaction test error: > file /usr/lib64/libpcre.so conflicts between attempted installs of > pcre-devel-8.45-1.fc38.3.x86_64 and openCOLLADA-devel-1.6.70-4.fc39.x86_64 > > The basic startup of blender works for both > can you suggest a file to load or a rendering stress test I could run ? I forgot to mention you can use mock to build Fedora using the command: "fedpkg --release=f38 scratch-build --target=f38-build-side-69204 --arch=x86_64 --srpm"
Detailed traceback from journalctl when running GPU cycle (in this example on Radeon RX 6950XT) Jun 20 16:54:51 systemd-coredump[72648]: Process 72540 (blender) of user 1000 dumped core. Module blender from rpm blender-3.5.1-7.fc38.x86_64 #1 0x0000557fd8ddf657 _ZL14print_resourceRSoRKN7blender3gpu6shader16ShaderCreateInfo8ResourceEb (blender + 0x2bca657) #2 0x0000557fd8e2ac78 _ZNK7blender3gpu8GLShader17resources_declareB5cxx11ERKNS0_6shader16ShaderCreateInfoE (blender + 0x2c15c78) #3 0x0000557fd8d78125 GPU_shader_create_from_info (blender + 0x2b63125) #4 0x0000557fd72368bd OVERLAY_grid_cache_init (blender + 0x10218bd) #5 0x0000557fd723bff1 _ZL18OVERLAY_cache_initPv.lto_priv.0 (blender + 0x1026ff1) #6 0x0000557fd71ae2d5 drw_engines_cache_init.lto_priv.0 (blender + 0xf992d5) #7 0x0000557fd71e7e64 DRW_draw_render_loop_2d_ex (blender + 0xfd2e64) #8 0x0000557fd7f37907 image_main_region_draw.lto_priv.0 (blender + 0x1d22907) #9 0x0000557fd75c0ad5 ED_region_do_draw (blender + 0x13abad5) #10 0x0000557fd7074b65 wm_draw_update (blender + 0xe5fb65) #11 0x0000557fd6ad943c main (blender + 0x8c443c) #14 0x0000557fd6b20835 _start (blender + 0x90b835) Jun 20 16:54:49 audit[72540]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=3 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=72540 comm="blender" exe="/usr/bin/blender" sig=11 res=1 Jun 20 16:54:49 kernel: blender[72646]: segfault at 0 ip 0000000000000000 sp 00007f869ce66078 error 14 in blender[557fd6215000+459a000] likely on CPU 21 (core 11, socket 0) Jun 20 16:54:49 audit[72540]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=3 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=72540 comm="blender" exe="/usr/bin/blender" sig=11 res=1 Jun 20 16:54:49 blender.desktop[72540]: Writing: /tmp/bmw27_gpu.crash.txt Jun 20 16:54:49 blender.desktop[72540]: Read blend: /home/luya/Documents/design stuff/blender/bmw27/bmw27_gpu.blend Jun 20 16:54:49 blender.desktop[72540]: Read prefs: /home/luya/.config/blender/3.5/config/userpref.blend Jun 20 16:54:43 systemd[2030]: Started app-gnome-blender-72540.scope - Application launched by gnome-shell
Same issue with Fedora 39 and rocm in 5.7.1 : In : CPU Ryzen 7 5800x and 2700x Motherboard 470x and A320 Radeon 5700x and RX5700XT nov. 06 08:50:29 zeus5.cc.local kernel: BUG: kernel NULL pointer dereference, address: 00000000000007a0 nov. 06 08:50:29 zeus5.cc.local kernel: #PF: supervisor write access in kernel mode nov. 06 08:50:29 zeus5.cc.local kernel: #PF: error_code(0x0002) - not-present page nov. 06 08:50:29 zeus5.cc.local kernel: PGD 215148067 P4D 215148067 PUD 215147067 PMD 0 nov. 06 08:50:29 zeus5.cc.cc.local kernel: CPU: 2 PID: 7498 Comm: blender Not tainted 6.5.10-300.fc39.x86_64 #1 nov. 06 08:50:29 zeus5.cc.local kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.I0 07/27/2022 nov. 06 08:50:29 zeus5.cc.local kernel: RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: Code: 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 <48> 89 0e 31 c0 e9 73 b3 90 da 0f 1f 00 90 90 90 90 90 90 90 90 90 nov. 06 08:50:29 zeus5.cc.local kernel: RSP: 0018:ffffb405074a7940 EFLAGS: 00010206 nov. 06 08:50:29 zeus5.cc.local kernel: RAX: 0000000000000000 RBX: 00000001c7200000 RCX: 00400001c70005f1 nov. 06 08:50:29 zeus5.cc.local kernel: RDX: 0000000000000000 RSI: 00000000000007a0 RDI: ffff8ef719c00000 nov. 06 08:50:29 zeus5.ipazeus.local kernel: RBP: ffffb405074a7aa8 R08: 00400000000005f1 R09: 0000000000200000 nov. 06 08:50:29 zeus5.cc.local kernel: R10: 00400000000005f1 R11: 0000000000000009 R12: 0000000000200000 nov. 06 08:50:29 zeus5.cc.local kernel: R13: 0000000000000004 R14: 00000000000007a0 R15: 0000000000000001 nov. 06 08:50:29 zeus5.cc.local kernel: FS: 00007f41e709a580(0000) GS:ffff8f05fea80000(0000) knlGS:0000000000000000 nov. 06 08:50:29 zeus5.cc.local kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 nov. 06 08:50:29 zeus5.cc.local kernel: CR2: 00000000000007a0 CR3: 00000002a0a64000 CR4: 0000000000750ee0 nov. 06 08:50:29 zeus5.cc.local kernel: PKRU: 55555554 nov. 06 08:50:29 zeus5.cc.local kernel: Call Trace: nov. 06 08:50:29 zeus5.cc.local kernel: <TASK> nov. 06 08:50:29 zeus5.cc.local kernel: ? __die+0x23/0x70 nov. 06 08:50:29 zeus5.cc.local kernel: ? page_fault_oops+0x171/0x4e0 nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f nov. 06 08:50:29 zeus5.cc.local kernel: ? exc_page_fault+0x7f/0x180 nov. 06 08:50:29 zeus5.cc.local kernel: ? asm_exc_page_fault+0x26/0x30 nov. 06 08:50:29 zeus5.cc.local kernel: ? amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_cpu_update+0x92/0x110 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_ptes_update+0x32c/0x930 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_update_range+0x241/0x740 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_vm_bo_update+0x305/0x570 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_gem_va_ioctl+0x54f/0x590 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: drm_ioctl_kernel+0xcd/0x170 nov. 06 08:50:29 zeus5.cc.local kernel: drm_ioctl+0x26d/0x4b0 nov. 06 08:50:29 zeus5.cc.local kernel: ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: __x64_sys_ioctl+0x97/0xd0 nov. 06 08:50:29 zeus5.cc.local kernel: do_syscall_64+0x60/0x90 nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f nov. 06 08:50:29 zeus5.cc.local kernel: ? __count_memcg_events+0x42/0x90 nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f nov. 06 08:50:29 zeus5.cc.local kernel: ? count_memcg_events.constprop.0+0x1a/0x30 nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f nov. 06 08:50:29 zeus5.cc.local kernel: ? handle_mm_fault+0x9e/0x350 nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f nov. 06 08:50:29 zeus5.cc.local kernel: ? do_user_addr_fault+0x179/0x640 nov. 06 08:50:29 zeus5.cc.local kernel: ? srso_alias_return_thunk+0x5/0x7f nov. 06 08:50:29 zeus5.cc.local kernel: ? exc_page_fault+0x7f/0x180 nov. 06 08:50:29 zeus5.v.local kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 nov. 06 08:50:29 zeus5.cc.local kernel: RIP: 0033:0x7f41e6d2f13d nov. 06 08:50:29 zeus5.cc.local kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 nov. 06 08:50:29 zeus5.cc.local kernel: RSP: 002b:00007ffd67cd8a50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 nov. 06 08:50:29 zeus5.cc.local kernel: RAX: ffffffffffffffda RBX: 00007f418043b820 RCX: 00007f41e6d2f13d nov. 06 08:50:29 zeus5.cc.local kernel: RDX: 00007ffd67cd8af0 RSI: 00000000c0286448 RDI: 000000000000000b nov. 06 08:50:29 zeus5.cc.local kernel: RBP: 00007ffd67cd8aa0 R08: ffff80011e800000 R09: 000000000000000e nov. 06 08:50:29 zeus5.cc.local kernel: R10: 000000000000003c R11: 0000000000000246 R12: 00007ffd67cd8af0 nov. 06 08:50:29 zeus5.cc.local kernel: R13: 00000000c0286448 R14: 000000000000000b R15: 00007f41da478c00 nov. 06 08:50:29 zeus5.cc.local kernel: </TASK> nov. 06 08:50:29 zeus5.cc.local kernel: Modules linked in: uinput snd_seq_dummy snd_hrtimer rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache netfs cfg80211 nft_masq team_mode_roundrobin team nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr binfmt_misc dm_crypt snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio snd_hda_codec intel_rapl_msr snd_usbmidi_lib intel_rapl_common snd_ump snd_hda_core edac_mce_amd snd_rawmidi snd_hwdep mc kvm_amd snd_seq snd_seq_device kvm snd_pcm irqbypass snd_timer rapl snd wmi_bmof mxm_wmi pcspkr soundcore vfat i2c_piix4 k10temp fat joydev gpio_amdpt gpio_generic auth_rpcgss sunrpc loop zram amdgpu hid_logitech_hidpp drm_ttm_helper ttm video drm_suballoc_helper amdxcp iommu_v2 drm_buddy crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel polyval_clmulni nov. 06 08:50:29 zeus5.ipazeus.local kernel: polyval_generic igb drm_display_helper nvme ghash_clmulni_intel dca ccp sha512_ssse3 cec r8169 nvme_core sp5100_tco i2c_algo_bit nvme_common wmi hid_logitech_dj scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath nct6775 nct6775_core hwmon_vid fuse nov. 06 08:50:29 zeus5.cc.local kernel: CR2: 00000000000007a0 nov. 06 08:50:29 zeus5.cc.local kernel: ---[ end trace 0000000000000000 ]--- nov. 06 08:50:29 zeus5.cc.local kernel: RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x30 [amdgpu] nov. 06 08:50:29 zeus5.cc.local kernel: Code: 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 <48> 89 0e 31 c0 e9 73 b3 90 da 0f 1f 00 90 90 90 90 90 90 90 90 90 nov. 06 08:50:29 zeus5.cc.local kernel: RSP: 0018:ffffb405074a7940 EFLAGS: 00010206 nov. 06 08:50:29 zeus5.cc.local kernel: RAX: 0000000000000000 RBX: 00000001c7200000 RCX: 00400001c70005f1 nov. 06 08:50:29 zeus5.cc.local kernel: RDX: 0000000000000000 RSI: 00000000000007a0 RDI: ffff8ef719c00000 nov. 06 08:50:29 zeus5.cc.local kernel: RBP: ffffb405074a7aa8 R08: 00400000000005f1 R09: 0000000000200000 nov. 06 08:50:29 zeus5.cc.local kernel: R10: 00400000000005f1 R11: 0000000000000009 R12: 0000000000200000 nov. 06 08:50:29 zeus5.cc.local kernel: R13: 0000000000000004 R14: 00000000000007a0 R15: 0000000000000001 nov. 06 08:50:29 zeus5.cc.local kernel: FS: 00007f41e709a580(0000) GS:ffff8f05fea80000(0000) knlGS:0000000000000000 nov. 06 08:50:29 zeus5.cc.local kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 nov. 06 08:50:29 zeus5.cc.local kernel: CR2: 00000000000007a0 CR3: 00000002a0a64000 CR4: 0000000000750ee0 nov. 06 08:50:29 zeus5.cc.local kernel: PKRU: 55555554 nov. 06 08:50:29 zeus5.cc.local kernel: note: blender[7498] exited with irqs disabled All apps used the OpenCL with ROCM cause the issue and another physical machine locking.
Duplicated by: https://bugzilla.redhat.com/show_bug.cgi?id=2249261 I'm facing the same issue on the supported hardware (Radeon RX 6650 RDNA2 GPU) The interesting part is that this hack works: https://www.reddit.com/r/Fedora/comments/11qh9j3/getting_bender_hip_to_work/ - so it's not a problem of hardware. It was broken in F38 couple of weeks ago, I was hoping it'll be fixed in F39 but it's still broken. Blender from Rawhide also crashes. I can run some debugging if needed, just leave me a message what you need me to run.
I am having the same problem with F39 Name: gfx1031 Uuid: GPU-XX Marketing Name: AMD Radeon RX 6700 XT Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU kernel: 6.6.2-201.fc39.x86_64 The system crashed when it tried to render cycles with the GPU.
So basically you're using the Arch userspace in a container? Yeah I suspect it's LLVM to blame here, because Fedora isn't building the same LLVM version as what you're getting in Arch. Arch is building AMD's fork of LLVM for ROCm, which is a snapshot of the LLVM developer branch, while Fedora uses a single LLVM for all components based on the release versions. So either this was fixed in LLVM development but hasn't gotten down to an LLVM release, or there's something in that AMD fork that hasn't landed in upstream LLVM yet. Using a fork of LLVM is a pretty heavy way of maintaining ROCm, so there's some trade offs when going with Arch's approach unfortunately (Fedora's ROCm effort is pretty resource limited right now). I've considered making a COPR though to provide this as an alternative, but I haven't had the time to get that going.
Quick update, I spoke with upstream, it seems it's fixed in LLVM 18, but I need to work out what to backport into Fedora.
It looks like this bug started being apparent sometime after Linux kernel 5.15. https://projects.blender.org/blender/blender/issues/100353
> So basically you're using the Arch userspace in a container? Yes, exactly. Because it's a container the only difference is userspace, dependencies and packaging/compilation. Hardware and kernel is the same. > Quick update, I spoke with upstream, it seems it's fixed in LLVM 18, but I need to work out what to backport into Fedora. Thank you for you update and finding the culprit! Can't wait for the fix.
I'm seeing this issue and getting a similar dump that is in comment 12 with any application that needs to use OpenCL. Over the last week researching this, I found these open issues (and several duplicates): https://gitlab.freedesktop.org/drm/amd/-/issues/2991 and https://github.com/ROCm/ROCm/issues/2596. There very well could be an LLVM issue, given I'm seeing errors that would indicate that when my system doesn't immediately freeze up.
So I spoke to upstream. They're pretty confident it's a patch that needs backoorting to upstream llvm 17. Their llvm stable tree is here (currently llvm 17): https://github.com/ROCm/llvm-project/tree/amd-mainline-open They suspect it's related to SGPR spills: E.g.: https://github.com/ROCm/llvm-project/blob/amd-mainline-open/llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp If anyone wants to try to debug, feel free. I might not have time to debug this due to the holidays.
So quick update, I spoke with upstream, it seems like in ROCm 6.1, they're going to reorg the source code relating to LLVM, so it'll make fixing LLVM issues much much easier. Their LLVM 17 development tree is here (will be used to branch 6.1 at some point): https://github.com/ROCm/llvm-project/tree/amd-mainline-open So in theory it should be as simple as doing a diff of llvm/lib/Target/AMDGPU between rocm/llvm-project (amd-mainline-open) and llvm/llvm-project (release/17.x). As I said prior, upstream thinks that a bugfix patch is missing in the upstream llvm tree, which is what Fedora uses. I'm a bit tied up right now, so this might take some time unless someone else has interest in trying to rebuild llvm from source.
Filed a bug report to LLVM with linked comment #21.
Hi, I just updated to kernel 6.6.14-200.fc39.x86_64 and Blender 4.02 Cycles HIP rendering is working well. I think this bug is fixed. I downloaded the version from the Blender website (4.02 static) because the official fedora rpm crashed with /lib64/libc.so.6 error. Name: gfx1031 Uuid: GPU-XX Marketing Name: AMD Radeon RX 6700 XT Vendor Name: AMD OS Fedora 39 kernel 6.6.14-200
Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21. Fedora Linux 38 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.
I know this is an old bug and F38 (and F39) is EOL but I wanted to give update for posterity: Blender GPU rendering on Fedora 40 with ROCm stack on AMD RX 6650 works beautifully now! Thank you everyone.
Sweeeet!! Glad it works.