Red Hat Bugzilla – Full Text Bug Listing
|Summary:||gam_server deadlocks, leading to all KDE applications hanging at start|
|Product:||[Fedora] Fedora||Reporter:||Sandro Mani <manisandro>|
|Component:||gamin||Assignee:||Tomáš Bžatek <tbzatek>|
|Status:||CLOSED RAWHIDE||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Version:||rawhide||CC:||gerd, rdieter, tbzatek, tsmetana|
|Fixed In Version:||gamin-0.1.10-12.fc18||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2012-02-08 11:28:32 EST||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Sandro Mani 2012-01-31 11:06:18 EST
Created attachment 558641 [details] Backtrace Description of problem: Occasionally, gam_server deadlocks (see attached backtrace). When this happens, all KDE applications hang at start. Killing gam_server allows the user to regain an operational desktop. Version-Release number of selected component (if applicable): gamin-0.1.10-11.fc17.x86_64 How reproducible: Unfrequently Steps to Reproduce: Hard, it appears to be more likely to happen when attempting to launch a KDE application when the machine is out of RAM and swapping. Actual results: gam_server hangs, an so do all applications using it (i.e. attempt to connect to /tmp/fam-$username-) Additional info: Backtrace attached
Comment 1 Sandro Mani 2012-02-02 17:14:12 EST
Looking at gamin-0.1.10/server/inotify-helper.c, there is an obvious double lock: ih_sub_foreach_worker locks inotify_lock on line 153, and calls ih_sub_cancel on line 162, which also attempts to lock inotify_lock on line 130. The backtrace indeed shows that gam_server is stuck ad inotify-helper.c::130. I would suggest removing the G_LOCK and G_UNLOCK in ih_sub_cancel, since it is only called by ih_sub_foreach_worker anyway. I have made a corresponding patch.
Comment 3 Sandro Mani 2012-02-02 17:16:23 EST
Comment 4 Tomáš Bžatek 2012-02-03 08:17:12 EST
FYI, this piece of code is embedded in glib, should be fixed there too. /me checks...
Comment 5 Tomáš Bžatek 2012-02-08 11:28:32 EST
Confirming this doesn't affect glib, ih_sub_cancel() is not used in nested calls and there should not be any lock held. In fact, glib uses slightly newer sources from John McCutchan. Posted my comments in upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=669292 Fixed in gamin-0.1.10-12.fc18
Comment 6 Rex Dieter 2012-06-29 16:37:49 EDT
marked a few more upstream references Had someone in #fedora-kde irc today seemingly get hit by this already running gamin-0.1.10-12.fc17 , will keep an eye out and re-open if we can get more debugging done next time (instead of simply killing gam_server)
Comment 7 Gerd v. Egidy 2013-03-03 15:57:31 EST
I hit this every few weeks, currently running gamin-0.1.10-12.fc17.x86_64. Today I had enough time to strace it, find that gam_server is the culprit and found this bugzilla entry. What do you want me to do to further analyze this when I hit it next time?
Comment 8 Sandro Mani 2013-03-03 16:03:12 EST
Do (as root) debuginfo-install gamin then run (as normal user) gdb -p$(pidof gam_server) and then type backtrace and attach the backtrace.
Comment 9 Sandro Mani 2013-03-03 16:04:12 EST
Ups, should be gdb -p $(pidof gam_server) (note the space between p and $)
Comment 10 Gerd v. Egidy 2013-03-03 17:39:39 EST
Thanks. So just a regular backtrace. Now I remembered how to trigger this: open dolphin and navigate to a directory with many subdirs (about 600 in my case). Scroll a bit around in the view or enter a subdir. And it hits you within a minute or so. BTW, this dir is mounted via NFS4, if that matters. Here is the backtrace: #0 0x00007f506864c950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=9, data=data@entry=0x7fff62545660, len=len@entry=59) at gam_channel.c:826 #2 0x0000000000408794 in gam_send_event (conn=conn@entry=0x14dfb60, reqno=90, event=<optimized out>, path=0x15105b0 "###DIRNAME###", len=49) at gam_connection.c:609 #3 0x000000000040a36d in gam_eq_flush_callback (conn=0x14dfb60, event=0x1510810, eq=<optimized out>) at gam_eq.c:118 #4 gam_eq_flush (eq=0x14dd130, conn=conn@entry=0x14dfb60) at gam_eq.c:137 #5 0x00000000004081ea in gam_connection_eq_flush (data=0x14dfb60, data@entry=<error reading variable: value has been optimized out>) at gam_connection.c:174 #6 0x00007f50689673bb in g_timeout_dispatch (source=source@entry=0x1511a40, callback=<optimized out>, user_data=<optimized out>) at gmain.c:3882 #7 0x00007f5068966825 in g_main_dispatch (context=0x14dd4d0) at gmain.c:2539 #8 g_main_context_dispatch (context=context@entry=0x14dd4d0) at gmain.c:3075 #9 0x00007f5068966b58 in g_main_context_iterate (context=0x14dd4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146 #10 0x00007f5068966f52 in g_main_loop_run (loop=0x14df160) at gmain.c:3340 #11 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647 I edited the entry ###DIRNAME###, there was the name of one of my directories in it. ###DIRNAME### was not the current dir I was in, but one up in hierarchy and then two below. Don't know what dolphin was telling gam to do there, I never entered that ###DIRNAME### or the one above in this whole KDE session. Version is gamin-0.1.10-12.fc17.x86_64, Signature : RSA/SHA256, Wed 08 Feb 2012 05:46:09 PM CET, Key ID 50e94c991aca3465 Build Date : Wed 08 Feb 2012 05:27:13 PM CET Build Host : x86-11.phx2.fedoraproject.org
Comment 11 Gerd v. Egidy 2013-03-03 17:50:01 EST
Next one, tried it in another big dir hierarchy. Can reproduce it within minutes now. #0 0x00007fb88cfd6950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=20, data=data@entry=0x7fff0bf90350, len=len@entry=51) at gam_channel.c:826 #2 0x0000000000408b6d in gam_send_ack (conn=conn@entry=0xf3e9d0, reqno=1009, path=path@entry=0xff1230 "##OTHERDIR##", len=len@entry=41) at gam_connection.c:686 #3 0x0000000000409020 in gam_connection_request (req=0xf3ee02, conn=0xf3e9d0) at gam_connection.c:411 #4 gam_connection_data (conn=conn@entry=0xf3e9d0, len=<optimized out>) at gam_connection.c:509 #5 0x0000000000407adf in gam_client_conn_read (source=0xf8f980, condition=<optimized out>, info=0xf3e9d0) at gam_channel.c:283 #6 0x00007fb88d2f0825 in g_main_dispatch (context=0xf2f4d0) at gmain.c:2539 #7 g_main_context_dispatch (context=context@entry=0xf2f4d0) at gmain.c:3075 #8 0x00007fb88d2f0b58 in g_main_context_iterate (context=0xf2f4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146 #9 0x00007fb88d2f0f52 in g_main_loop_run (loop=0xf31160) at gmain.c:3340 #10 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647 ##OTHERDIR## again is a dir that I did not open. This time it is below the current one.
Comment 12 Sandro Mani 2013-03-03 17:54:34 EST
This is not the deadlock reported here, so it's better to file a new bug. While I'm at it, when gam_server is hanging, who is the file descriptor fd=20? Look at ls -l /proc/$(pidof gam_server)/fd
Comment 13 Gerd v. Egidy 2013-03-04 17:29:30 EST
Ok, I filed a new bug for this: #917848