Bug 1205880

Summary: vlc dlopen hang with 2.21.90-8
Product: [Fedora] Fedora Reporter: Yanko Kaneti <yaneti>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: arjun, codonell, fweimer, jakub, law, pfrankli, spoyarek, vdanjean.ml
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-05-07 21:35:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pstack none

Description Yanko Kaneti 2015-03-25 20:41:33 UTC
Created attachment 1006467 [details]
pstack

Description of problem:
After the update in rawhide to glibc-2.21.90-8 vlc's threads doing dlopen module loading deadlock when attempting to play some files e.g. an *mp4 video. The thread resonsible for the UI works.

Downgrading glibc to 2.21.90-7 fixes the problem

Version-Release number of selected component (if applicable):
glibc-2.21.90-8.fc23.x86_64

How reproducible:
Always

Attaching a pstack at the time of the lock

Comment 1 Carlos O'Donell 2015-04-01 20:56:42 UTC
The process backtrace doesn't provide any really useful information. It could just be that that the threads have hit a pre-existing race condition.

The next step is for someone to look through the difference in the source trees between -7 and -8.

Have you seen any interesting differences that might cause the problem you're seeing?

Comment 2 Carlos O'Donell 2015-04-01 21:12:28 UTC
Actually I noticed something I missed, the first thread has called dlopen recursively by invoking dlopen in a constructor, followed by calling non-async-signal-safe functions. The libavcodec_plugin.so needs to be rearchitected to avoid initializing OpenCL's during startup. This has to be done at some later time after the plugin itself has been fully loaded.

To be clear:
- Recrusive calls to dlopen at present are only allowed to call async-signal safe functions. This is violated by having libavcodec_plugin attempt to load and initialize the OpenCL runtime in a constructor (during dlopen), which then calls dlopen itself.

More investigation is needed if this can be fixed upstream.

Comment 3 Yanko Kaneti 2015-04-06 16:44:02 UTC
FWIW this is still happening in -9 compared to -7

the libavcodec_plugin uses ffmpeg which itself does something with OpenCL, 
while your suggestion of re-architecture of this tangled web might be feasible, I am not the right person to investigate it. 

Its interesting to me what specific change brought this about, and how it might affect other software...

Comment 4 Yanko Kaneti 2015-05-07 11:37:12 UTC
JFTR this is still happening the same way with -11

Comment 5 Carlos O'Donell 2015-05-07 21:35:27 UTC
(In reply to Yanko Kaneti from comment #3)
> FWIW this is still happening in -9 compared to -7
> 
> the libavcodec_plugin uses ffmpeg which itself does something with OpenCL, 
> while your suggestion of re-architecture of this tangled web might be
> feasible, I am not the right person to investigate it. 
> 
> Its interesting to me what specific change brought this about, and how it
> might affect other software...

My opinion is that this is a fundamental design flaw in ocl-icd. It must not use dlopen from a constructor that runs when it itself is being loaded. It must use late binding or not support shared linkage.

*** This bug has been marked as a duplicate of bug 1219646 ***

Comment 6 Vincent 2015-05-19 18:52:59 UTC
Just for the record (the info is already in the other (duplicated) bug report), the bug is still there after removing the ocl-icd constructor. The bug appears when ocl-icd, in a regular exported function, calls dlopen() on a library that uses pthread_join() in its constructor.