Created attachment 797015 [details] vnc log Description of problem: After doing a default F20 graphical install on a PPC64 machine, I started vncserver. When I connect using vncviewer I only see a black screen. If I start opening and closing X clients, like firefox or gnome-terminal, from an ssh session (properly setting $DISPLAY), sometimes I can see the big "Oh No! Something has gone wrong" message just for a moment. I'm attaching vnc and messages logs. Please let me know if you need further info to help debug this issue. Version-Release number of selected component (if applicable): xorg-x11-server-Xorg-1.14.2-9.fc20.ppc64 mesa-libEGL-9.2-1.20130902.fc20.ppc64 mesa-libGL-9.2-1.20130902.fc20.ppc64 mesa-dri-drivers-9.2-1.20130902.fc20.ppc64 llvm-libs-3.3-0.8.rc3.fc20.ppc64 How reproducible: Always Steps to Reproduce: 1. Install F20 on a PPC64 machine 2. Install tigervnc-server 3. vncserver :1 4. connect to VNC server using vncviewer Actual results: Expected results: Additional info:
Created attachment 797016 [details] messages log
Proposing as PPC64 Alpha blocker, as a working graphical environment is part of the alpha criteria.
Sep 10 17:11:15 duff gnome-session[33991]: WARNING: Application 'gnome-shell.desktop' failed to register before timeout Is an important message. It means gnome-shell didn't register with gnome-session. But I don't see any other error messages. It's possible it's deadlocked on something? Simply attaching with gdb is how I normally debug things. Just: gdb attach $(pidof gnome-shell); t a a bt
KDE works fine on F20, so I'm moving this bug to gnome-shell, which seems more appropriate than the X server.
I've taken a look at it today and even looking through the changes in gnome-shell i can't see anything obvious. Same for gdb outputs, there was nothing really obvious there either. :( If anyone with more insight could take a peek at it that would be ace. At least for RH folks i can give access to a freshly installed Fedora 20 with all shizzle necessary. Not sure if we've updated the OSU machine yet with Fedora 20 (i suspect not). Thanks & regards, Phil
so i got access to one of phil's machines and noticed this in the backtrace: Thread 1 (Thread 0x1ffffaf1ba50 (LWP 28040)): #0 0x00001ffffcd81c1c in __pthread_cond_wait (cond=0x1001dac2440, mutex=0x1001dac1350) at pthread_cond_wait.c:178 #1 0x00001ffffcaf6260 in pa_cond_wait (c=<optimized out>, m=<optimized out>) at pulsecore/mutex-posix.c:139 #2 0x00001fffffc748ec in pa_threaded_mainloop_wait (m=0x1001dac14b0) at pulse/thread-mainloop.c:210 #3 0x00001ffff04b4da8 in pulse_driver_open (c=0x1001d5b53f0) at pulse.c:436 #4 0x00001fffff09046c in driver_open (c=0x1001d5b53f0) at dso.c:273 #5 0x00001fffff0850e8 in context_open_unlocked (c=0x1001d5b53f0) at common.c:293 #6 0x00001fffff0859bc in ca_context_open (c=0x1001d5b53f0) at common.c:318 #7 0x00001fffffec9030 in shell_global_init (global=0x1001d5cf840) at shell-global.c:283 #8 0x00001ffffd34c940 in g_type_create_instance (type=1100009388384) at gtype.c:1868 #9 0x00001ffffd3250f8 in g_object_new_internal (class=class@entry=0x1001dab3cd0, params=<optimized out>, params@entry=0x3fffc5828458, n_params=<optimized out>, n_params@entry=1) at gobject.c:1746 #10 0x00001ffffd3282d0 in g_object_new_valist (object_type=1100009388384, first_property_name=<optimized out>, var_args=0x3fffc58286a8 "") at gobject.c:2002 #11 0x00001fffffeca35c in _shell_global_init (first_property_name=0x100032c0 "session-mode") at shell-global.c:497 #12 0x000000001000228c in main (argc=1, argv=0x3fffc5828b78) at main.c:437 (gdb) thread 1 so the process is stalled in early initialization waiting to connect to pulseaudio. I then ran mv /usr/lib64/libcanberra-0.30/libcanberra-pulse.so /usr/lib64/libcanberra-0.30/libcanberra-pulse.so.save to prevent the shell from trying to connect to pulseaudio at start up and everything started working. So this is some issue in pulseaudio. reassigning. Could be some error path in pulseaudio is missing a pa_cond_signal call or something like that.
Thanks a lot Ray. I've reassigned the bug to pulseaudio now. For access to the Fedora 20 ppc64 instance with the issue just let me know via private email or ping on IRC. Thanks & regards, Phil
Can someone post verbose PA server logs as well?
Ok, so here's the full pulseaudio logging, following the debugging described in http://fedoraproject.org/wiki/How_to_debug_PulseAudio_problems The first part is the startup with pulseaudio -vvvvv The second part is the log after i do a vncserver Any clues as to what causes the problem? Thanks & regards, Phil
Created attachment 803919 [details] pulseaudio trace at startup of pulseaudio itself
Created attachment 803920 [details] pulseaudio trace at startup of vncserver
OK, i've also tried it with a fully downgraded pulseaudi stack to Fedora 19 and i'm getting the same error. So it seems the error comes from somewhere inbetween gnome-session and pulseaudio: Installing : pulseaudio-libs-3.0-10.fc19.ppc64 Installing : pulseaudio-3.0-10.fc19.ppc64 Installing : pulseaudio-utils-3.0-10.fc19.ppc64 Installing : pulseaudio-module-x11-3.0-10.fc19.ppc64 Installing : pulseaudio-module-bluetooth-3.0-10.fc19.ppc64 Installing : pulseaudio-libs-glib2-3.0-10.fc19.ppc64 Cleanup : pulseaudio-module-x11-4.0-4.gita89ca.fc20.ppc64 Cleanup : pulseaudio-module-bluetooth-4.0-4.gita89ca.fc20.ppc64 Cleanup : pulseaudio-4.0-4.gita89ca.fc20.ppc64 Cleanup : pulseaudio-utils-4.0-4.gita89ca.fc20.ppc64 Cleanup : pulseaudio-libs-glib2-4.0-4.gita89ca.fc20.ppc64 Cleanup : pulseaudio-libs-4.0-4.gita89ca.fc20.ppc64 Aka, i went from pulseaudio-4.0-4 back to pulseaudio-3.0-10 which worked nicely in Fedora 19 Thanks & regards, Phil
And tried the same with libcanberra, same error: Updating : libcanberra-0.30-4.fc20.ppc64 Updating : libcanberra-gtk3-0.30-4.fc20.ppc64 Updating : libcanberra-gtk2-0.30-4.fc20.ppc64 Cleanup : libcanberra-gtk2-0.30-3.fc19.ppc64 Cleanup : libcanberra-gtk3-0.30-3.fc19.ppc64 Cleanup : libcanberra-0.30-3.fc19.ppc64 So backing down to an older libcanberra didn't help either. Thanks & regards, Phil
And i've attached the pulseaudio log with the libcanberra-pulse.so moved away to make it work. And unfortunately they are identical... O_O So no clue whats happening and if it's actually a pulseaudio problem after all, it seems that pulseaudio gets the same messages and behaves exactly the same in both cases. Thanks & regards, Phil
Created attachment 803944 [details] pulseaudio trace at startup of vncserver without libcanberra-pulse.so
And another observation: If you attach a gbd session to the hanging running gnome-shell and then CONTinue the process, it suddenly works... Love it when things start to work as soon as you run them in a debugger. :( Thanks & regards, Phil
Ah, but only if pulseaudio isn't running as a process. As soon as i have pulseaudio running again it hangs again. Thanks & regards, Phil
#include <stdio.h> #include <stdlib.h> #include <pulse/thread-mainloop.h> #include <pulse/context.h> static void callback(pa_context *c, void *data) { pa_threaded_mainloop *ml = data; printf("Wake up! state: %u\n", pa_context_get_state(c)); pa_threaded_mainloop_signal(ml, 0); } int main(){ pa_threaded_mainloop *ml = pa_threaded_mainloop_new(); pa_context *c = pa_context_new(pa_threaded_mainloop_get_api(ml),"pulse-test"); pa_context_set_state_callback(c, callback, ml); pa_context_connect(c, NULL, 0, NULL); pa_threaded_mainloop_lock(ml); pa_threaded_mainloop_start(ml); unsigned int s = pa_context_get_state(c); printf("state: %u\n", s); pa_threaded_mainloop_wait(ml); printf("OK\n"); exit(0); } This code prints OK on ppc64 with glibc 2.17, but hangs with glibc 2.18.
Would it be possible for me to get access to the hardware for debugging? It's a shot in the dark, but I'm wondering if this might be related to the introduction of priority inversion for non-x86 arches in glibc 2.18. Easy way to check would be with this patch: diff --git a/src/pulsecore/mutex-posix.c b/src/pulsecore/mutex-posix.c index 74c5768..4747e45 100644 --- a/src/pulsecore/mutex-posix.c +++ b/src/pulsecore/mutex-posix.c @@ -42,14 +42,16 @@ struct pa_cond { pa_mutex* pa_mutex_new(bool recursive, bool inherit_priority) { pa_mutex *m; pthread_mutexattr_t attr; +#if 0 int r; +#endif pa_assert_se(pthread_mutexattr_init(&attr) == 0); if (recursive) pa_assert_se(pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE) == 0); -#ifdef HAVE_PTHREAD_PRIO_INHERIT +#if 0 if (inherit_priority) { r = pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT); pa_assert(r == 0 || r == ENOTSUP); @@ -58,7 +60,7 @@ pa_mutex* pa_mutex_new(bool recursive, bool inherit_priority) { m = pa_xnew(pa_mutex, 1); -#ifndef HAVE_PTHREAD_PRIO_INHERIT +#if 1 pa_assert_se(pthread_mutex_init(&m->mutex, &attr) == 0); #else
Thanks Arun! I can confirm that applying Arun's patch both my reproducer and GNOME work fine.
Okay, so the problem seems to be related to glibc 2.18's priority inheritance implementation on PPC. Gustavo, you had a pthread_mutex_* version of the reproducer that worked. Adding this after setting the mutex type to recursive might expose the problem independent of any PA calls: pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
Here is the pure pthread reproducer including Arun's suggestion. It indeed reproduces the issue. It prints OK with glibc-2.17 but hangs with glibc-2.18. #include <pthread.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> pthread_mutex_t mutex; pthread_mutexattr_t attr; pthread_cond_t cond; pthread_t tid; void *callback(void *arg){ for (;;){ sleep(2); pthread_mutex_lock(&mutex); printf("Wake up!\n"); pthread_cond_broadcast(&cond); pthread_mutex_unlock(&mutex); } } int main(){ pthread_mutexattr_init(&attr); pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE); pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT); pthread_mutex_init(&mutex, &attr); pthread_cond_init(&cond, NULL); pthread_mutex_lock(&mutex); pthread_create(&tid, NULL, callback, &cond); printf("Going to sleep\n"); for(;;){ pthread_cond_wait(&cond, &mutex); printf("OK\n"); exit(0); } }
Thanks a lot Arun! Reassigning to glibc.
Gustavo, Thank you for the pthread reproducer. Siddhesh, Could you look into this please?
(In reply to Carlos O'Donell from comment #24) > Siddhesh, > > Could you look into this please? Sure.
Siddhesh, if you need access to a Power box running Fedora 19 please contact me internally via email or IRC and i'll get you access to a machine that exposes this issue. Thanks! Regards, Phil
(In reply to Phil Knirsch from comment #26) > Siddhesh, if you need access to a Power box running Fedora 19 please contact > me internally via email or IRC and i'll get you access to a machine that > exposes this issue. Thanks Phil, I had an s390 box on which I could reproduce this problem - it's a fairly silly error where I had used the wrong flag (PTHREAD_MUTEX_ROBUST instead of the PTHREAD_MUTEX_ROBUST_NORMAL_NP). I've posted a patch upstream: https://sourceware.org/ml/libc-alpha/2013-10/msg00002.html I've also pushed a scratch build for f20. The srpm upload is taking a while, so I'll share the build URL once it is done.
Perfect, thanks Siddhesh! Shall we propose this as a freeze exception for Fedora 20 Beta? That would allow us to get this bug fixed in primary and then we can include the build for our Alpha. Thanks & regards, Phil
The scratch build in f20: http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=1433631 I've got a positive review upstream, so I'll push the patch upstream and in rawhide in a while. (In reply to Phil Knirsch from comment #28) > Shall we propose this as a freeze exception for Fedora 20 Beta? That would > allow us to get this bug fixed in primary and then we can include the build > for our Alpha. Sure, I'll push the patch to f20 as well then, so folks can test and give feedback on bodhi.
As per https://fedoraproject.org/wiki/QA:SOP_freeze_exception_bug_process i'm proposing this bug as a Freeze Exception for Beta for Primary as it's a blocker for Alpha for Fedora 20 PPC64. The fix should be very isolated for PPC64 and S390x, see Siddhesh's comments above. Thanks & regards, Phil
(In reply to Phil Knirsch from comment #30) > The fix should be very isolated for PPC64 and S390x, see Siddhesh's comments > above. ... and arm, although I haven't actually verified it. The bug (and the fix) is isolated to non-x86 architectures.
This also seems to affect building ghc on ppc64 and s390x (bug 989593).
Hm, I tried building ghc with glibc-2.18-10.fc20.ppc64 but it still hangs part-way through for me with #0 0x00001fffffec1ab0 in .pthread_cond_wait () from /lib64/libpthread.so.0
Discussed this in the 2013-10-02 Blocker Review Meeting [1]. This has been voted an AcceptedFreezeException. This is a showstopper for PPC64 and while not PA, a tested fix would be considered past freeze. [1] http://meetbot.fedoraproject.org/fedora-blocker-review/2013-10-02/
GNOME still not working after a default desktop install using latest ISO which includes the fixed glibc: http://ppc.koji.fedoraproject.org/stage/f20-20131002-RC4.1/ Same symptom (black screen), though it might be a different issue. This is what I see attaching gdb to the gnome-shell process: (gdb) info threads Id Target Id Frame 23 Thread 0x1ffff938eed0 (LWP 37187) "gdbus" 0x00000080cabaf250 in .__GI___poll () from /lib64/libc.so.6 22 Thread 0x1ffff268eed0 (LWP 37195) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 21 Thread 0x1ffff1e8eed0 (LWP 37196) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 20 Thread 0x1ffff168eed0 (LWP 37197) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 19 Thread 0x1ffff0e8eed0 (LWP 37198) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 18 Thread 0x1ffff068eed0 (LWP 37199) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 17 Thread 0x1fffefe8eed0 (LWP 37200) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 16 Thread 0x1fffef68eed0 (LWP 37201) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 15 Thread 0x1fffeee8eed0 (LWP 37202) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 14 Thread 0x1fffee68eed0 (LWP 37203) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 13 Thread 0x1fffede8eed0 (LWP 37204) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 12 Thread 0x1fffed68eed0 (LWP 37205) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 11 Thread 0x1fffece8eed0 (LWP 37206) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 10 Thread 0x1fffec68eed0 (LWP 37207) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 9 Thread 0x1fffebe8eed0 (LWP 37208) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 8 Thread 0x1fffeb68eed0 (LWP 37209) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 7 Thread 0x1fffeae8eed0 (LWP 37210) "gnome-shell" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 6 Thread 0x1fffea68eed0 (LWP 37211) "dconf worker" 0x00000080cabaf250 in .__GI___poll () from /lib64/libc.so.6 5 Thread 0x1fffe9e8eed0 (LWP 37212) "threaded-ml" 0x00000080cabaf250 in .__GI___poll () from /lib64/libc.so.6 4 Thread 0x1fffe968eed0 (LWP 37213) "JS GC Helper" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 3 Thread 0x1fffe8a8eed0 (LWP 37214) "JS Sour~ Thread" 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 2 Thread 0x1fffdd1deed0 (LWP 37235) "gmain" 0x00000080cabaf250 in .__GI___poll () from /lib64/libc.so.6 * 1 Thread 0x1ffffff9bdd0 (LWP 37179) "gnome-shell" 0x00000080cabaf250 in .__GI___poll () from /lib64/libc.so.6 (gdb) bt #0 0x00000080cabaf250 in .__GI___poll () from /lib64/libc.so.6 #1 0x00000080cb00e070 in .g_poll () from /lib64/libglib-2.0.so.0 #2 0x00000080caff9e14 in 000000ca.plt_call.strncasecmp@@GLIBC_2.3+0 () from /lib64/libglib-2.0.so.0 #3 0x00000080caffa530 in .g_main_loop_run () from /lib64/libglib-2.0.so.0 #4 0x00000080cda8ed4c in .meta_run () from /lib64/libmutter.so.0 #5 0x0000000010002294 in main (argc=1, argv=0x3fffe1637fb8) at main.c:439 (gdb) t 7 [Switching to thread 7 (Thread 0x1fffeae8eed0 (LWP 37210))] #0 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00000080cacd1ad0 in .pthread_cond_wait () from /lib64/libpthread.so.0 #1 0x00001ffff86ab1ac in 00000374.plt_call.__snprintf_chk@@GLIBC_2.4+0 () from /usr/lib64/dri/swrast_dri.so #2 0x00000080caccc30c in .start_thread () from /lib64/libpthread.so.0 #3 0x00000080cabbe1d0 in .__clone () from /lib64/libc.so.6 (gdb)
It looks like it is a different bug. It seems related to gnome-bluetooth. If I remove all references to bluetooth from /usr/share/gnome-shell/js/ui/panel.js then GNOME works fine. PS: I didn't see it before while testing the fixed glibc because I had already done the change described above while investigating this issue (now I reinstalled the system). Too many different issues can cause a black screen.
OK, now we are hitting a gnome-shell bug. Here is the upstream bug: https://bugzilla.gnome.org/show_bug.cgi?id=707430 It has been fixed upstream on gnome-shell version 3.9.92, which is already available on Fedora 20. We just have to pull it on the next RC compose.
So maybe the ghc hang is something else?
Jens, bug 989593 looks like something else not related to this bug.
glibc-2.18-11.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/glibc-2.18-11.fc20
Package glibc-2.18-11.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing glibc-2.18-11.fc20' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-18239/glibc-2.18-11.fc20 then log in and leave karma (feedback).
glibc-2.18-11.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.