Bug 1219646 - ocl-icd implementation is fundamentally flawed.
Summary: ocl-icd implementation is fundamentally flawed.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: ocl-icd
Version: 23
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Rob Clark
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1205880 (view as bug list)
Depends On: 1223055
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-07 21:33 UTC by Carlos O'Donell
Modified: 2015-08-12 12:39 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-12 12:39:25 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
khronos opencl icd license text (1.79 KB, text/plain)
2015-05-18 15:31 UTC, Rob Clark
no flags Details
gnome-photos deadlock (13.07 KB, text/plain)
2015-05-19 07:35 UTC, Yanko Kaneti
no flags Details
clinfo pstack (7.70 KB, text/plain)
2015-05-19 09:02 UTC, Yanko Kaneti
no flags Details

Description Carlos O'Donell 2015-05-07 21:33:24 UTC
The ocl-icd implementation registers the constructor _initClIcd in order to load a driver. This is a fundamentally flawed design.

To give you an example: vlc creates a plugin that indirectly uses a library that needs OpenCL functionality. That library is linked against libOpenCL.so.1 as it should be. When that plugin is loaded the core runtime (glibc) is in the middle of dlopen and calling library constructors. The _initClIcd function itself calls dlopen to load the OpenCL drivers. This results in a recursive call to dlopen. Once you recurse into the runtime you are only allowed to call synchronously-reentrant-safe functions or asynchronous-signal-safe functions. Nothing more complicated than that is supported. For ocl-icd to run dlopen from a constructor is a fundamentally flawed design. The pseudo-libOpenCL.so.1 needs to be redesigned to use stub functions that use late binding after all the DSOs have been loaded and dlopen is no longer running.

Without this fix in ocl-icd the vlc libavcodec_plugin will call ffmpeg which is linked against libOpenCL.so.1 and this will result in undefined behaviour as libOpenCL.so.1 invokes dlopen (recursively now) and then calls other non SR-safe or AS-safe functions.

The other alternative is to disable libOpenCL support in ffmpeg, but I'm highlighting the flaw in ocl-icd since it's this design which is broken.

Comment 1 Carlos O'Donell 2015-05-07 21:35:27 UTC
*** Bug 1205880 has been marked as a duplicate of this bug. ***

Comment 2 Yanko Kaneti 2015-05-18 13:43:52 UTC
Another deadlock casualy here on rawhide is gnome-photos->gegl03->ocl-icd
Bug 1222551

Comment 3 Fabian Deutsch 2015-05-18 13:58:13 UTC
So, I believe I see the issue.

After all there are a few more ICD implementations out there.

Carlos, do you know of any icd implementation which is handling this better?

I'm at least aware if the khronos icd implementation
https://www.khronos.org/news/permalink/opencl-installable-client-driver-icd-loader

Comment 4 Rob Clark 2015-05-18 14:25:54 UTC
fwiw, I don't think the khronos icd implementation would have this issue.  But it does have a slightly strange license.

Comment 5 Fabian Deutsch 2015-05-18 14:35:33 UTC
There is also https://code.google.com/p/freeocl/ which is gpled.

Rob, do you know if we could use khronos' icd loader?

Comment 6 Rob Clark 2015-05-18 15:31:25 UTC
Created attachment 1026753 [details]
khronos opencl icd license text

Comment 7 Rob Clark 2015-05-18 15:34:01 UTC
(In reply to Fabian Deutsch from comment #5)
> There is also https://code.google.com/p/freeocl/ which is gpled.

that appears to be an opencl implementation, rather than an icd.

(ie. the icd is what would load the N different opencl implementations available for cpu and/or whatever gpu's you have)

> Rob, do you know if we could use khronos' icd loader?

I would sort of prefer it, except I'm a bit unsure about the license terms (which I have attached).  IANAL but seems like first clause would prohibit hosting of src rpms.  Also a bit unsure about the fourth clause.

Comment 8 François Cami 2015-05-18 16:32:44 UTC
IANAL but the license terms seems incompatible with opensource in general.
freeocl seems dormant at best...
I've contacted the ocl-icd upstream devs.

Comment 9 Fabian Deutsch 2015-05-18 17:37:22 UTC
I just got a reply from u/s:

"""
Things should be fixed in the git. We are expecting a tonight or
tomorrow release.
"""

So good news, and nice reaction, let's see what git shows us tomorrow.

Comment 11 Yanko Kaneti 2015-05-19 07:21:18 UTC
This seems to have helped here with both the vlc->ffmpeg->ocl-icd and gnome-phohos->gegl03->ocl-icd cases which no longer deadlock on startup. Both seem to load libOpenCL, whether or not its actually used is another matter.

Thanks


On the other hand, in what might a related issue,
clinfo + mesa-libOpenCL (as the only installed icd) doesn't seem to work and deadlocks again somewhere in the dynamic loader.

clinfo-0.1-0.6.git20150215.94fdb47.fc23.x86_64
mesa-libOpenCL-10.6.0-0.devel.6.5a55f68.fc23.x86_64

Comment 12 Yanko Kaneti 2015-05-19 07:35:06 UTC
Created attachment 1026987 [details]
gnome-photos deadlock

Hmm perhaps I spoke too soon. While vlc works, its probably because ffmpeg actually doesn't use much of OpenCL when playing mp4 video.

gnome-photos + mesa-libOpenCL actually deadlocks again. Attached a pstack.

Comment 13 Vincent 2015-05-19 08:39:21 UTC
I a first glance at the provided backtrace, ocl-icd is calling dlopen to load Mesa ICD because clGetPlatformIDs() was called (by the plugin code). There is no way for ocl-icd to delay again the dlopen: we need to call fonctions from the available ICD to complete the  clGetPlatformIDs() call, so we need to load ICD.

Does the bug occur only with the mesa-libOpenCL implementation or do other OpenCL implementation also trigger the bug?

Looking more in details, I see the thread 8 that is executing radeon code. Can it be a problem with Mesa ICD initialization?

Would it be possible to see the DSO involved for each line in the backtrace? I can recognize ocl-icd calls and some other but not all.
  For DSO that do not have their debug symbols available, we have the address so we need the memory mapping to know (/proc/PID/maps)
  For DSO that have their debug symbols available, the backtrace give the source file name but I cannot recognize all of them. Help would be welcome.

  Regards,
    Vincent

PS: about vocabulary in OpenCL world: an ICD is a OpenCL implementation that can be loaded by an ICD loader. So, the ICD loader is (here) ocl-icd (Khronos has an other (nonfree) one). It is an libOpenCL.so.1 ELF library that will dynamically load available ICD.
  mesa-libOpenCL (and pocl, and NVidia, Intel, ... ones) are ICD, i.e. OpenCL implementation that can be loaded by an ICD loader (this allows a OpenCL program to select its OpenCL implementation at runtime, or even to use several implementation at the same time).
  Some OpenCL implementation can also be used 'standalone'. In this case, they provide directly the library (generally as libOpenCL.so) but this forbid the possibility to choose the implementation at runtime and to use several implementation.
  Until ocl-icd exists, free OpenCL implementation where required to use the standalon scheme.
  And I aggree that ocl-icd should have been called ocl-icd-loader. Its name is a bit misleading (ie it is *not* an ICD)

Comment 14 Yanko Kaneti 2015-05-19 09:02:27 UTC
Created attachment 1027044 [details]
clinfo pstack

Attaching the pstack of the clifno deadlock which I presume should be less complicated because there is no plugin code involved before the ocd calls

Comment 15 Yanko Kaneti 2015-05-19 09:26:07 UTC
To reiterate in case someone hasn't followed this from the start:

This whole ocl deadlocking problem was revealed here starting with the rawhide glibc update from 2.21.90-7 to -8.  Just downgrading glibc to -7 alongside the other current rawhide bits and ocl-icd 2.2.4  is enough to make both clinfo and gnome-photos work. 

What upstream changes affecting the loader happened in glibc between -7 and -8 I don't know, perhaps the glibc people here could enlighten us.

Comment 16 Vincent 2015-05-19 10:58:59 UTC
We see several threads waiting for exit. They are blocked into tls_get_addr_tail.

And the dlopen call is blocked by a pthread_join in the Mesa library constructor waiting for one of the previous (blocked) thread to exit.

In the glibc sources on my laptop (2.21-0experimental0 from Debian/experimental), I see this in the dl-tls.c file, near the line 765:

tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv, struct link_map *the_map)
{
[...]
  /* Make sure that, if a dlopen running in parallel forces the
     variable into static storage, we'll wait until the address in the
     static TLS block is set up, and use that.  If we're undecided
     yet, make sure we make the decision holding the lock as well.  */
[... code with locking ...]
https://sources.debian.net/src/glibc/2.21-0experimental0/elf/dl-tls.c/#L763

  So, there is clearly here a wanted interaction between the tls_get_addr_tail and the dlopen code.

  What is the exact version of the libc used on the system (I do not use at all Fedora/Redhat myself)?
  And, if there is no fix already in the libc, I think we will need a libc specialist to look at this bug.

  But, for me, it does not seem to be related to ocl-icd itself anymore.

  Regards,
    Vincent

Comment 17 Vincent 2015-05-19 11:08:05 UTC
My previous comments have been written before Yanko Kaneti ones. Seeing its comment, I would be very interested by the diff of the glibc in rawhide between 2.21.90-7 and -8.

I suspect that the initial problem was not the dlopen in the dlopen (neverless, in any case, this is fixed in ocl-icl and I think it is indeed better : application linked with OpenCL but not using it won't load ICD now) but the problem still here. It seems that dlopen get a lock that block pthread_exit and, if the library constructor calls pthread_join, we can have a deadlock.

Perhaps the glibc upgrade introduce this bug/regression. Perhaps, it only reveals it more frequently due to different default timings in various functions.

  Regards
    Vincent

Comment 18 Fabian Deutsch 2015-05-19 12:19:21 UTC
This commit shows what versions were used in Fedora:
http://pkgs.fedoraproject.org/cgit/glibc.git/commit/?id=85b542148e0885ccdb8920e243ec0ffb061e2300

And the following link shows the difference between the two used glibc commits:
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=2e807f2;hp=7e9c7b9

Does this help Vincent?

Comment 19 Carlos O'Donell 2015-05-19 16:50:42 UTC
(In reply to Yanko Kaneti from comment #15)
> To reiterate in case someone hasn't followed this from the start:
> 
> This whole ocl deadlocking problem was revealed here starting with the
> rawhide glibc update from 2.21.90-7 to -8.  Just downgrading glibc to -7
> alongside the other current rawhide bits and ocl-icd 2.2.4  is enough to
> make both clinfo and gnome-photos work. 
> 
> What upstream changes affecting the loader happened in glibc between -7 and
> -8 I don't know, perhaps the glibc people here could enlighten us.

That question is a red herring. If you do something that is undefined, then the behaviour will eventually break. The behaviours expected by ocl-icd are not supported and will fail randomly depending on internal implementation detail changes in glibc.

Comment 20 Carlos O'Donell 2015-05-19 16:56:15 UTC
(In reply to Vincent from comment #16)
> We see several threads waiting for exit. They are blocked into
> tls_get_addr_tail.
> 
> And the dlopen call is blocked by a pthread_join in the Mesa library
> constructor waiting for one of the previous (blocked) thread to exit.
> 
> In the glibc sources on my laptop (2.21-0experimental0 from
> Debian/experimental), I see this in the dl-tls.c file, near the line 765:
> 
> tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv, struct link_map *the_map)
> {
> [...]
>   /* Make sure that, if a dlopen running in parallel forces the
>      variable into static storage, we'll wait until the address in the
>      static TLS block is set up, and use that.  If we're undecided
>      yet, make sure we make the decision holding the lock as well.  */
> [... code with locking ...]
> https://sources.debian.net/src/glibc/2.21-0experimental0/elf/dl-tls.c/#L763
> 
>   So, there is clearly here a wanted interaction between the
> tls_get_addr_tail and the dlopen code.
> 
>   What is the exact version of the libc used on the system (I do not use at
> all Fedora/Redhat myself)?
>   And, if there is no fix already in the libc, I think we will need a libc
> specialist to look at this bug.
> 
>   But, for me, it does not seem to be related to ocl-icd itself anymore.

Please file a new bug and we'll move forward with new stack traces to see where things are stuck.

Comment 21 Yanko Kaneti 2015-05-19 17:24:04 UTC
(In reply to Carlos O'Donell from comment #20)
> 
> Please file a new bug and we'll move forward with new stack traces to see
> where things are stuck.

https://bugzilla.redhat.com/show_bug.cgi?id=1223055

Comment 22 Vincent 2015-05-19 18:41:54 UTC
(In reply to Carlos O'Donell from comment #19)
> That question is a red herring. If you do something that is undefined, then
> the behaviour will eventually break. The behaviours expected by ocl-icd are
> not supported and will fail randomly depending on internal implementation
> detail changes in glibc.

For the record, ocl-icd now only calls dlopen in a normal exported function. So ocl-icd *do not do* anything not supported (unless calling dlopen is forbidden).

This is the dlopen-ed Mesa library that seems to call pthread stuff in its constructor. I do not know if this is supported but, if not, it would mean that there are severe restrictions about stuff that can be called in constructors. And I never see such information until now.

Comment 23 Fabian Deutsch 2015-05-19 18:52:13 UTC
Thanks all for this input. With the informations from comment 20 and comment 21 and comment 22 I'm moving this bug to MODIFIED, as the ocl-icd side seems to be fixed.

Comment 24 Carlos O'Donell 2015-05-19 19:22:15 UTC
(In reply to Vincent from comment #22)
> (In reply to Carlos O'Donell from comment #19)
> > That question is a red herring. If you do something that is undefined, then
> > the behaviour will eventually break. The behaviours expected by ocl-icd are
> > not supported and will fail randomly depending on internal implementation
> > detail changes in glibc.
> 
> For the record, ocl-icd now only calls dlopen in a normal exported function.
> So ocl-icd *do not do* anything not supported (unless calling dlopen is
> forbidden).

I agree. I just looked at the new ocl-icde code and it is as compliant as it can get. I'm very happy to see such a change made quickly in the upstream implementation.

> This is the dlopen-ed Mesa library that seems to call pthread stuff in its
> constructor. I do not know if this is supported but, if not, it would mean
> that there are severe restrictions about stuff that can be called in
> constructors. And I never see such information until now.

The dlopen-ed Mesa library is allowed to do pthread things, but we will have to see how the analysis goes for the new issue.

In summary I agree that ocl-icd looks fully fixed now.

Comment 25 Jan Kurik 2015-07-15 14:11:07 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23


Note You need to log in before you can comment on or make changes to this bug.