Bug 2156595 - rocm-opencl makes clinfo crash when installed in parallel with mesa-libOpenCL
Summary: rocm-opencl makes clinfo crash when installed in parallel with mesa-libOpenCL
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: rocm-opencl
Version: 37
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jeremy Newton
QA Contact:
URL:
Whiteboard:
: 2143687 2149162 2157619 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-27 22:03 UTC by Dominik 'Rathann' Mierzejewski
Modified: 2023-06-03 02:44 UTC (History)
6 users (show)

Fixed In Version: rocm-opencl-5.5.1-1.fc38
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-03 02:44:28 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Dominik 'Rathann' Mierzejewski 2022-12-27 22:03:48 UTC
Description of problem:
Both clinfo and rocm-clinfo crash if mesa-libOpenCL and rocm-opencl are installed in parallel.

Version-Release number of selected component (if applicable):
clinfo-3.0.21.02.21-4.fc37.x86_64
mesa-libOpenCL-22.3.1-1.fc37.x86_64
rocm-clinfo-5.3.2-1.fc37.x86_64
rocm-opencl-5.3.2-1.fc37.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. dnf install clinfo mesa-libOpenCL rocm-clinfo rocm-opencl
2. clinfo

Actual results:
mesa: CommandLine Error: Option 'h' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
Aborted (core dumped)

Expected results:
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3486.0)
  Platform Profile                                FULL_PROFILE
...

Comment 1 Jacobo Cabaleiro 2023-02-01 23:46:29 UTC
Same issue still present with the more recent rocm-opencl-5.4.1-1.fc37.x86_64

Comment 2 Jeremy Newton 2023-04-20 19:51:39 UTC
*** Bug 2143687 has been marked as a duplicate of this bug. ***

Comment 3 Jeremy Newton 2023-04-20 22:14:31 UTC
This error:

> mesa: CommandLine Error: Option 'h' registered more than once!
> LLVM ERROR: inconsistency in registered CommandLine options
> Aborted (core dumped)

Is fixed in this Fedora 38 update:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-05720f124e

If you already upgraded to Fedora 38, please test.

I'll see if I can backport it to Fedora 37.

Comment 4 vovkap97 2023-04-21 07:28:03 UTC
It's showing another error, but the problem is still present:

sudo dnf install mesa-libOpenCL

rocm-clinfo
: CommandLine Error: Option 'abort-on-max-devirt-iterations-reached' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
fish: Job 1, 'rocm-clinfo' terminated by signal SIGABRT (Abort)


sudo dnf rm mesa-libOpenCL

rocm-clinfo
Number of platforms:                             2
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.1 AMD-APP (3513.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 3.0 PoCL 3.1  Linux, Release, RELOC, SPIR, LLVM 16.0.0, SLEEF, FP16, DISTRO, POCL_DEBUG
  Platform Name:                                 Portable Computing Language
  Platform Vendor:                               The pocl project
  Platform Extensions:                           cl_khr_icd cl_pocl_content_size


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    AMD Radeon RX 6900 XT
...


Versions:
dnf list --installed | grep rocm

rocm-clinfo.x86_64                                   5.4.3-2.fc38                       @updates
rocm-comgr.x86_64                                    16.0-2.fc38                        @updates
rocm-comgr-debuginfo.x86_64                          5.3.0-1.fc37                       @updates-debuginfo
rocm-comgr-devel.x86_64                              16.0-2.fc38                        @updates
rocm-compilersupport-debugsource.x86_64              5.3.0-1.fc37                       @updates-debuginfo
rocm-device-libs.x86_64                              16.0-1.fc38                        @fedora
rocm-opencl.x86_64                                   5.4.3-2.fc38                       @updates
rocm-opencl-devel.x86_64                             5.4.3-2.fc38                       @updates
rocm-runtime.x86_64                                  5.4.1-3.fc38                       @fedora
rocm-runtime-devel.x86_64                            5.4.1-3.fc38                       @fedora
rocm-smi.noarch                                      4.0.0-8.fc38                       @fedora
rocminfo.x86_64                                      5.4.1-2.fc38                       @fedora

Comment 5 Jeremy Newton 2023-04-21 16:42:16 UTC
Thanks for the feedback, I'll contact upstream with this info.

Comment 6 Jeremy Newton 2023-04-21 20:38:19 UTC
So I back-ported the fix to f37 and I can't reproduce any error right now with this update:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-994e29c721

It's possible the LLVM 16 upgrade in Fedora 38 causes a regression (as compared to Fedora 37's LLVM 15), or maybe there's something unique to your system that makes it not reproduce on my end. I think I might have a AMD Radeon RX 6700 accessible to me that I can test out, as the current HW on my system is from the RX 5xxx series.

I also spoke to the upstream developers and the fix that they suggested might require major packaging changes in other fedora packages. Either way, I'll need to reproduce before I can proceed with any fix yet.

Comment 7 vovkap97 2023-04-22 09:14:32 UTC
I've tested with f37 in toolbox:  
toolbox create --release 37
sudo dnf install 'rocm-*'


rocm-clinfo
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.1 AMD-APP (3513.0)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2

sudo dnf install  mesa-libOpenCL

rocm-clinfo
Segmentation fault (core dumped)


backtrace:

Thread 1 "rocm-clinfo" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff72cc2de in clover::device::supports_ir (ir=PIPE_SHADER_IR_NATIVE,
    this=0x55555569dc10) at ../src/gallium/frontends/clover/core/device.cpp:502
#2  clover::device::device (this=this@entry=0x55555569dc10, platform=..., ldev=0x5555556a1210)
    at ../src/gallium/frontends/clover/core/device.cpp:165
#3  0x00007ffff72db5cf in clover::create<clover::device, clover::platform&, pipe_loader_device*&>
    () at ../src/gallium/frontends/clover/util/pointer.hpp:240
#4  clover::platform::platform (
    this=this@entry=0x7ffff7537100 <(anonymous namespace)::_clover_platform>)
    at ../src/gallium/frontends/clover/core/platform.cpp:41
#5  0x00007ffff729f7fd in __static_initialization_and_destruction_0 (__priority=65535,
    __initialize_p=1) at ../src/gallium/frontends/clover/api/platform.cpp:34
#6  0x00007ffff7fcccde in call_init (env=0x7fffffffe1d8, argv=0x7fffffffe1c8, argc=1,
    l=<optimized out>) at dl-init.c:70
#7  call_init (l=<optimized out>, argc=1, argv=0x7fffffffe1c8, env=0x7fffffffe1d8)
    at dl-init.c:26
#8  0x00007ffff7fccdcc in _dl_init (main_map=0x555555614f20, argc=1, argv=0x7fffffffe1c8,
    env=0x7fffffffe1d8) at dl-init.c:117
#9  0x00007ffff7ca8f14 in __GI__dl_catch_exception (exception=<optimized out>,
    operate=<optimized out>, args=<optimized out>)
    at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:182
#10 0x00007ffff7fd3736 in dl_open_worker (a=a@entry=0x7fffffffd7c0) at dl-open.c:808
#11 0x00007ffff7ca8ebe in __GI__dl_catch_exception (exception=<optimized out>,
    operate=<optimized out>, args=<optimized out>)
    at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:208
#12 0x00007ffff7fd3acc in _dl_open (file=0x555555613970 "libMesaOpenCL.so.1",
    mode=<optimized out>, caller_dlopen=0x7ffff7f9789f <_open_driver+303>, nsid=<optimized out>,
    argc=1, argv=0x7fffffffe1c8, env=0x7fffffffe1d8) at dl-open.c:884
#13 0x00007ffff7be123c in dlopen_doit (a=a@entry=0x7fffffffda30) at dlopen.c:56
#14 0x00007ffff7ca8ebe in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd990,
    operate=<optimized out>, args=<optimized out>)
    at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:208
#15 0x00007ffff7ca8f73 in __GI__dl_catch_error (objname=0x7fffffffd9e8,
    errstring=0x7fffffffd9f0, mallocedp=0x7fffffffd9e7, operate=<optimized out>,
    args=<optimized out>) at /usr/src/debug/glibc-2.36-9.fc37.x86_64/elf/dl-error-skeleton.c:227
#16 0x00007ffff7be0d0f in _dlerror_run (operate=operate@entry=0x7ffff7be11e0 <dlopen_doit>,
    args=args@entry=0x7fffffffda30) at dlerror.c:138
#17 0x00007ffff7be12f1 in dlopen_implementation (dl_caller=<optimized out>,
--Type <RET> for more, q to quit, c to continue without paging--c
    mode=<optimized out>, file=<optimized out>) at dlopen.c:71
#18 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#19 0x00007ffff7f9789f in _load_icd (lib_path=0x555555613970 "libMesaOpenCL.so.1", num_icds=1)
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:208
#20 _open_driver (num_icds=num_icds@entry=1,
    dir_path=dir_path@entry=0x7ffff7fac0a4 "/etc/OpenCL/vendors",
    file_path=file_path@entry=0x555555578f43 "mesa.icd")
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:261
#21 0x00007ffff7f9ad16 in _open_drivers (dir_path=<optimized out>, dir=<optimized out>)
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:274
#22 __initClIcd () at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:767
#23 _initClIcd_real () at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:824
#24 0x00007ffff7f9ce14 in _initClIcd ()
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:853
#25 clGetPlatformIDs (num_entries=0, platforms=0x0, num_platforms=0x7fffffffdc14)
    at /usr/src/debug/ocl-icd-2.3.1-2.fc37.x86_64/ocl_icd_loader.c:1018
#26 0x000055555555e547 in cl::Platform::get (platforms=platforms@entry=0x7fffffffdd90)
    at /usr/src/debug/rocm-opencl-5.4.3-1.fc37.x86_64/tools/clinfo/../../khronos/headers/opencl2.2/CL/../CL/cl2.hpp:2474
#27 0x0000555555556f58 in main (argc=<optimized out>, argv=<optimized out>)
    at /usr/src/debug/rocm-opencl-5.4.3-1.fc37.x86_64/tools/clinfo/clinfo.cpp:75

Comment 8 Jeremy Newton 2023-05-02 04:07:23 UTC
*** Bug 2149162 has been marked as a duplicate of this bug. ***

Comment 9 Jeremy Newton 2023-05-05 17:15:20 UTC
*** Bug 2157619 has been marked as a duplicate of this bug. ***

Comment 10 Jeremy Newton 2023-05-05 17:58:26 UTC
Some observations:
- I can't reproduce this on up to date Fedora 37 system
- I can reproduce with a RX 6750 XT on Fedora 38
- I can't reproduce on Fedora 37 with Fedora 38 toolbox with the same HW

Seems strange. I'll update this if I ever figure it out.

Comment 11 Jeremy Newton 2023-05-05 18:01:10 UTC
* note I can't reproduce on other HW period.

Comment 12 Jeremy Newton 2023-05-31 03:00:29 UTC
I believe this update fixes the issue:
https://bodhi.fedoraproject.org/updates/FEDORA-2023-68012d0819

I can't reproduce it anymore now.

Can anyone confirm?

Comment 13 vovkap97 2023-06-01 09:14:24 UTC
It's working now, with and without mesa. Good job!

Comment 14 Fedora Update System 2023-06-02 03:40:44 UTC
FEDORA-2023-68012d0819 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-68012d0819

Comment 15 Jeremy Newton 2023-06-02 03:41:34 UTC
No problem! I tagged it on the update.

Comment 16 Fedora Update System 2023-06-03 02:44:28 UTC
FEDORA-2023-68012d0819 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.