Red Hat Bugzilla – Bug 786170
gam_server deadlocks, leading to all KDE applications hanging at start
Last modified: 2015-03-03 18:04:32 EST
Created attachment 558641 [details]
Description of problem:
Occasionally, gam_server deadlocks (see attached backtrace). When this happens, all KDE applications hang at start. Killing gam_server allows the user to regain an operational desktop.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Hard, it appears to be more likely to happen when attempting to launch a KDE application when the machine is out of RAM and swapping.
gam_server hangs, an so do all applications using it (i.e. attempt to connect to /tmp/fam-$username-)
Looking at gamin-0.1.10/server/inotify-helper.c, there is an obvious double lock:
ih_sub_foreach_worker locks inotify_lock on line 153, and calls ih_sub_cancel on line 162, which also attempts to lock inotify_lock on line 130. The backtrace indeed shows that gam_server is stuck ad inotify-helper.c::130.
I would suggest removing the G_LOCK and G_UNLOCK in ih_sub_cancel, since it is only called by ih_sub_foreach_worker anyway. I have made a corresponding patch.
Created attachment 559148 [details]
Adding firstname.lastname@example.org to CC per http://lists.fedoraproject.org/pipermail/users/2011-December/411145.html
FYI, this piece of code is embedded in glib, should be fixed there too.
Confirming this doesn't affect glib, ih_sub_cancel() is not used in nested calls and there should not be any lock held. In fact, glib uses slightly newer sources from John McCutchan.
Posted my comments in upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=669292
Fixed in gamin-0.1.10-12.fc18
marked a few more upstream references
Had someone in #fedora-kde irc today seemingly get hit by this already running gamin-0.1.10-12.fc17 , will keep an eye out and re-open if we can get more debugging done next time (instead of simply killing gam_server)
I hit this every few weeks, currently running gamin-0.1.10-12.fc17.x86_64. Today I had enough time to strace it, find that gam_server is the culprit and found this bugzilla entry.
What do you want me to do to further analyze this when I hit it next time?
Do (as root)
then run (as normal user)
gdb -p$(pidof gam_server)
and then type
and attach the backtrace.
Ups, should be
gdb -p $(pidof gam_server)
(note the space between p and $)
Thanks. So just a regular backtrace.
Now I remembered how to trigger this: open dolphin and navigate to a directory with many subdirs (about 600 in my case). Scroll a bit around in the view or enter a subdir. And it hits you within a minute or so. BTW, this dir is mounted via NFS4, if that matters.
Here is the backtrace:
#0 0x00007f506864c950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1 0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=9, data=data@entry=0x7fff62545660, len=len@entry=59) at gam_channel.c:826
#2 0x0000000000408794 in gam_send_event (conn=conn@entry=0x14dfb60, reqno=90, event=<optimized out>, path=0x15105b0 "###DIRNAME###", len=49)
#3 0x000000000040a36d in gam_eq_flush_callback (conn=0x14dfb60, event=0x1510810, eq=<optimized out>) at gam_eq.c:118
#4 gam_eq_flush (eq=0x14dd130, conn=conn@entry=0x14dfb60) at gam_eq.c:137
#5 0x00000000004081ea in gam_connection_eq_flush (data=0x14dfb60, data@entry=<error reading variable: value has been optimized out>) at gam_connection.c:174
#6 0x00007f50689673bb in g_timeout_dispatch (source=source@entry=0x1511a40, callback=<optimized out>, user_data=<optimized out>) at gmain.c:3882
#7 0x00007f5068966825 in g_main_dispatch (context=0x14dd4d0) at gmain.c:2539
#8 g_main_context_dispatch (context=context@entry=0x14dd4d0) at gmain.c:3075
#9 0x00007f5068966b58 in g_main_context_iterate (context=0x14dd4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146
#10 0x00007f5068966f52 in g_main_loop_run (loop=0x14df160) at gmain.c:3340
#11 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647
I edited the entry ###DIRNAME###, there was the name of one of my directories in it. ###DIRNAME### was not the current dir I was in, but one up in hierarchy and then two below. Don't know what dolphin was telling gam to do there, I never entered that ###DIRNAME### or the one above in this whole KDE session.
Version is gamin-0.1.10-12.fc17.x86_64,
Signature : RSA/SHA256, Wed 08 Feb 2012 05:46:09 PM CET, Key ID 50e94c991aca3465
Build Date : Wed 08 Feb 2012 05:27:13 PM CET
Build Host : x86-11.phx2.fedoraproject.org
Next one, tried it in another big dir hierarchy. Can reproduce it within minutes now.
#0 0x00007fb88cfd6950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1 0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=20, data=data@entry=0x7fff0bf90350, len=len@entry=51) at gam_channel.c:826
#2 0x0000000000408b6d in gam_send_ack (conn=conn@entry=0xf3e9d0, reqno=1009, path=path@entry=0xff1230 "##OTHERDIR##", len=len@entry=41) at gam_connection.c:686
#3 0x0000000000409020 in gam_connection_request (req=0xf3ee02, conn=0xf3e9d0) at gam_connection.c:411
#4 gam_connection_data (conn=conn@entry=0xf3e9d0, len=<optimized out>) at gam_connection.c:509
#5 0x0000000000407adf in gam_client_conn_read (source=0xf8f980, condition=<optimized out>, info=0xf3e9d0) at gam_channel.c:283
#6 0x00007fb88d2f0825 in g_main_dispatch (context=0xf2f4d0) at gmain.c:2539
#7 g_main_context_dispatch (context=context@entry=0xf2f4d0) at gmain.c:3075
#8 0x00007fb88d2f0b58 in g_main_context_iterate (context=0xf2f4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146
#9 0x00007fb88d2f0f52 in g_main_loop_run (loop=0xf31160) at gmain.c:3340
#10 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647
##OTHERDIR## again is a dir that I did not open. This time it is below the current one.
This is not the deadlock reported here, so it's better to file a new bug. While I'm at it, when gam_server is hanging, who is the file descriptor fd=20? Look at
ls -l /proc/$(pidof gam_server)/fd
Ok, I filed a new bug for this: #917848