Bug 2156595

Summary: rocm-opencl makes clinfo crash when installed in parallel with mesa-libOpenCL
Product: [Fedora] Fedora Reporter: Dominik 'Rathann' Mierzejewski <dominik>
Component: rocm-openclAssignee: Jeremy Newton <alexjnewt>
Status: CLOSED ERRATA QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 37CC: alexjnewt, chplee, dkxls23, maigurs, obmun.h, vovkap97
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rocm-opencl-5.5.1-1.fc38 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-03 02:44:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dominik 'Rathann' Mierzejewski 2022-12-27 22:03:48 UTC
Description of problem:
Both clinfo and rocm-clinfo crash if mesa-libOpenCL and rocm-opencl are installed in parallel.

Version-Release number of selected component (if applicable):
clinfo-3.0.21.02.21-4.fc37.x86_64
mesa-libOpenCL-22.3.1-1.fc37.x86_64
rocm-clinfo-5.3.2-1.fc37.x86_64
rocm-opencl-5.3.2-1.fc37.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. dnf install clinfo mesa-libOpenCL rocm-clinfo rocm-opencl
2. clinfo

Actual results:
mesa: CommandLine Error: Option 'h' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
Aborted (core dumped)

Expected results:
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3486.0)
  Platform Profile                                FULL_PROFILE
...

Comment 1 Jacobo Cabaleiro 2023-02-01 23:46:29 UTC
Same issue still present with the more recent rocm-opencl-5.4.1-1.fc37.x86_64

Comment 2 Jeremy Newton 2023-04-20 19:51:39 UTC
*** Bug 2143687 has been marked as a duplicate of this bug. ***

Comment 3 Jeremy Newton 2023-04-20 22:14:31 UTC
This error:

> mesa: CommandLine Error: Option 'h' registered more than once!
> LLVM ERROR: inconsistency in registered CommandLine options
> Aborted (core dumped)

Is fixed in this Fedora 38 update:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-05720f124e

If you already upgraded to Fedora 38, please test.

I'll see if I can backport it to Fedora 37.

Comment 4 vovkap97 2023-04-21 07:28:03 UTC
It's showing another error, but the problem is still present:

sudo dnf install mesa-libOpenCL

rocm-clinfo
: CommandLine Error: Option 'abort-on-max-devirt-iterations-reached' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
fish: Job 1, 'rocm-clinfo' terminated by signal SIGABRT (Abort)


sudo dnf rm mesa-libOpenCL

rocm-clinfo
Number of platforms:                             2
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.1 AMD-APP (3513.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 3.0 PoCL 3.1  Linux, Release, RELOC, SPIR, LLVM 16.0.0, SLEEF, FP16, DISTRO, POCL_DEBUG
  Platform Name:                                 Portable Computing Language
  Platform Vendor:                               The pocl project
  Platform Extensions:                           cl_khr_icd cl_pocl_content_size


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    AMD Radeon RX 6900 XT
...


Versions:
dnf list --installed | grep rocm

rocm-clinfo.x86_64                                   5.4.3-2.fc38                       @updates
rocm-comgr.x86_64                                    16.0-2.fc38                        @updates
rocm-comgr-debuginfo.x86_64                          5.3.0-1.fc37                       @updates-debuginfo
rocm-comgr-devel.x86_64                              16.0-2.fc38                        @updates
rocm-compilersupport-debugsource.x86_64              5.3.0-1.fc37                       @updates-debuginfo
rocm-device-libs.x86_64                              16.0-1.fc38                        @fedora
rocm-opencl.x86_64                                   5.4.3-2.fc38                       @updates
rocm-opencl-devel.x86_64                             5.4.3-2.fc38                       @updates
rocm-runtime.x86_64                                  5.4.1-3.fc38                       @fedora
rocm-runtime-devel.x86_64                            5.4.1-3.fc38                       @fedora
rocm-smi.noarch                                      4.0.0-8.fc38                       @fedora
rocminfo.x86_64                                      5.4.1-2.fc38                       @fedora

Comment 5 Jeremy Newton 2023-04-21 16:42:16 UTC
Thanks for the feedback, I'll contact upstream with this info.

Comment 6 Jeremy Newton 2023-04-21 20:38:19 UTC
So I back-ported the fix to f37 and I can't reproduce any error right now with this update:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-994e29c721

It's possible the LLVM 16 upgrade in Fedora 38 causes a regression (as compared to Fedora 37's LLVM 15), or maybe there's something unique to your system that makes it not reproduce on my end. I think I might have a AMD Radeon RX 6700 accessible to me that I can test out, as the current HW on my system is from the RX 5xxx series.

I also spoke to the upstream developers and the fix that they suggested might require major packaging changes in other fedora packages. Either way, I'll need to reproduce before I can proceed with any fix yet.

Comment 7 vovkap97 2023-04-22 09:14:32 UTC
I've tested with f37 in toolbox:  
toolbox create --release 37
sudo dnf install 'rocm-*'


rocm-clinfo
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.1 AMD-APP (3513.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2

sudo dnf install  mesa-libOpenCL

rocm-clinfo
Segmentation fault (core dumped)


backtrace:

Thread 1 "rocm-clinfo" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff72cc2de in clover::device::supports_ir (ir=PIPE_SHADER_IR_NATIVE,
    this=0x55555569dc10) at ../src/gallium/frontends/clover/core/device.cpp:502
#2  clover::device::device (this=this@entry=0x55555569dc10, platform=..., ldev=0x5555556a1210)
    at ../src/gallium/frontends/clover/core/device.cpp:165
#3  0x00007ffff72db5cf in clover::create<clover::device, clover::platform&, pipe_loader_device*&>
    () at ../src/gallium/frontends/clover/util/pointer.hpp:240
#4  clover::platform::platform (
    this=this@entry=0x7ffff7537100 <(anonymous namespace)::_clover_platform>)
    at ../src/gallium/frontends/clover/core/platform.cpp:41
#5  0x00007ffff729f7fd in __static_initialization_and_destruction_0 (__priority=65535,
    __initialize_p=1) at ../src/gallium/frontends/clover/api/platform.cpp:34
#6  0x00007ffff7fcccde in call_init (env=0x7fffffffe1d8, argv=0x7fffffffe1c8, argc=1,
    l=<optimized out>) at dl-init.c:70
#7  call_init (l=<optimized out>, argc=1, argv=0x7fffffffe1c8, env=0x7fffffffe1d8)
    at dl-init.c:26
#8  0x00007ffff7fccdcc in _dl_init (main_map=0x555555614f20, argc=1, argv=0x7fffffffe1c8,
    env=0x7fffffffe1d8) at dl-init.c:117
#9  0x00007ffff7ca8f14 in __GI__dl_catch_exception (exception=<optimized out>,
    operate=<optimized out>, args=<optimized out>)
    at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:182
#10 0x00007ffff7fd3736 in dl_open_worker (a=a@entry=0x7fffffffd7c0) at dl-open.c:808
#11 0x00007ffff7ca8ebe in __GI__dl_catch_exception (exception=<optimized out>,
    operate=<optimized out>, args=<optimized out>)
    at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:208
#12 0x00007ffff7fd3acc in _dl_open (file=0x555555613970 "libMesaOpenCL.so.1",
    mode=<optimized out>, caller_dlopen=0x7ffff7f9789f <_open_driver+303>, nsid=<optimized out>,
    argc=1, argv=0x7fffffffe1c8, env=0x7fffffffe1d8) at dl-open.c:884
#13 0x00007ffff7be123c in dlopen_doit (a=a@entry=0x7fffffffda30) at dlopen.c:56
#14 0x00007ffff7ca8ebe in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd990,
    operate=<optimized out>, args=<optimized out>)
    at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:208
#15 0x00007ffff7ca8f73 in __GI__dl_catch_error (objname=0x7fffffffd9e8,
    errstring=0x7fffffffd9f0, mallocedp=0x7fffffffd9e7, operate=<optimized out>,
    args=<optimized out>) at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:227
#16 0x00007ffff7be0d0f in _dlerror_run (operate=operate@entry=0x7ffff7be11e0 <dlopen_doit>,
    args=args@entry=0x7fffffffda30) at dlerror.c:138
#17 0x00007ffff7be12f1 in dlopen_implementation (dl_caller=<optimized out>,
--Type <RET> for more, q to quit, c to continue without paging--c
    mode=<optimized out>, file=<optimized out>) at dlopen.c:71
#18 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#19 0x00007ffff7f9789f in _load_icd (lib_path=0x555555613970 "libMesaOpenCL.so.1", num_icds=1)
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:208
#20 _open_driver (num_icds=num_icds@entry=1,
    dir_path=dir_path@entry=0x7ffff7fac0a4 "/etc/OpenCL/vendors",
    file_path=file_path@entry=0x555555578f43 "mesa.icd")
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:261
#21 0x00007ffff7f9ad16 in _open_drivers (dir_path=<optimized out>, dir=<optimized out>)
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:274
#22 __initClIcd () at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:767
#23 _initClIcd_real () at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:824
#24 0x00007ffff7f9ce14 in _initClIcd ()
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:853
#25 clGetPlatformIDs (num_entries=0, platforms=0x0, num_platforms=0x7fffffffdc14)
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:1018
#26 0x000055555555e547 in cl::Platform::get (platforms=platforms@entry=0x7fffffffdd90)
    at /usr/src/debug/rocm-opencl-5.4.3-1.fc37.x86_64/tools/clinfo/../../khronos/headers/opencl2.2/CL/../CL/cl2.hpp:2474
#27 0x0000555555556f58 in main (argc=<optimized out>, argv=<optimized out>)
    at /usr/src/debug/rocm-opencl-5.4.3-1.fc37.x86_64/tools/clinfo/clinfo.cpp:75

Comment 8 Jeremy Newton 2023-05-02 04:07:23 UTC
*** Bug 2149162 has been marked as a duplicate of this bug. ***

Comment 9 Jeremy Newton 2023-05-05 17:15:20 UTC
*** Bug 2157619 has been marked as a duplicate of this bug. ***

Comment 10 Jeremy Newton 2023-05-05 17:58:26 UTC
Some observations:
- I can't reproduce this on up to date Fedora 37 system
- I can reproduce with a RX 6750 XT on Fedora 38
- I can't reproduce on Fedora 37 with Fedora 38 toolbox with the same HW

Seems strange. I'll update this if I ever figure it out.

Comment 11 Jeremy Newton 2023-05-05 18:01:10 UTC
* note I can't reproduce on other HW period.

Comment 12 Jeremy Newton 2023-05-31 03:00:29 UTC
I believe this update fixes the issue:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-68012d0819

I can't reproduce it anymore now.

Can anyone confirm?

Comment 13 vovkap97 2023-06-01 09:14:24 UTC
It's working now, with and without mesa. Good job!

Comment 14 Fedora Update System 2023-06-02 03:40:44 UTC
FEDORA-2023-68012d0819 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-68012d0819

Comment 15 Jeremy Newton 2023-06-02 03:41:34 UTC
No problem! I tagged it on the update.

Comment 16 Fedora Update System 2023-06-03 02:44:28 UTC
FEDORA-2023-68012d0819 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.