Bug 786170
Summary: | gam_server deadlocks, leading to all KDE applications hanging at start | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Sandro Mani <manisandro> | ||||||
Component: | gamin | Assignee: | Tomáš Bžatek <tbzatek> | ||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | rawhide | CC: | gerd, rdieter, tbzatek, tsmetana | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | gamin-0.1.10-12.fc18 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-02-08 16:28:32 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Looking at gamin-0.1.10/server/inotify-helper.c, there is an obvious double lock: ih_sub_foreach_worker locks inotify_lock on line 153, and calls ih_sub_cancel on line 162, which also attempts to lock inotify_lock on line 130. The backtrace indeed shows that gam_server is stuck ad inotify-helper.c::130. I would suggest removing the G_LOCK and G_UNLOCK in ih_sub_cancel, since it is only called by ih_sub_foreach_worker anyway. I have made a corresponding patch. Created attachment 559148 [details]
Patch
Adding rdieter.edu to CC per http://lists.fedoraproject.org/pipermail/users/2011-December/411145.html FYI, this piece of code is embedded in glib, should be fixed there too. /me checks... Confirming this doesn't affect glib, ih_sub_cancel() is not used in nested calls and there should not be any lock held. In fact, glib uses slightly newer sources from John McCutchan. Posted my comments in upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=669292 Fixed in gamin-0.1.10-12.fc18 marked a few more upstream references Had someone in #fedora-kde irc today seemingly get hit by this already running gamin-0.1.10-12.fc17 , will keep an eye out and re-open if we can get more debugging done next time (instead of simply killing gam_server) I hit this every few weeks, currently running gamin-0.1.10-12.fc17.x86_64. Today I had enough time to strace it, find that gam_server is the culprit and found this bugzilla entry. What do you want me to do to further analyze this when I hit it next time? Do (as root) debuginfo-install gamin then run (as normal user) gdb -p$(pidof gam_server) and then type backtrace and attach the backtrace. Ups, should be gdb -p $(pidof gam_server) (note the space between p and $) Thanks. So just a regular backtrace. Now I remembered how to trigger this: open dolphin and navigate to a directory with many subdirs (about 600 in my case). Scroll a bit around in the view or enter a subdir. And it hits you within a minute or so. BTW, this dir is mounted via NFS4, if that matters. Here is the backtrace: #0 0x00007f506864c950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=9, data=data@entry=0x7fff62545660, len=len@entry=59) at gam_channel.c:826 #2 0x0000000000408794 in gam_send_event (conn=conn@entry=0x14dfb60, reqno=90, event=<optimized out>, path=0x15105b0 "###DIRNAME###", len=49) at gam_connection.c:609 #3 0x000000000040a36d in gam_eq_flush_callback (conn=0x14dfb60, event=0x1510810, eq=<optimized out>) at gam_eq.c:118 #4 gam_eq_flush (eq=0x14dd130, conn=conn@entry=0x14dfb60) at gam_eq.c:137 #5 0x00000000004081ea in gam_connection_eq_flush (data=0x14dfb60, data@entry=<error reading variable: value has been optimized out>) at gam_connection.c:174 #6 0x00007f50689673bb in g_timeout_dispatch (source=source@entry=0x1511a40, callback=<optimized out>, user_data=<optimized out>) at gmain.c:3882 #7 0x00007f5068966825 in g_main_dispatch (context=0x14dd4d0) at gmain.c:2539 #8 g_main_context_dispatch (context=context@entry=0x14dd4d0) at gmain.c:3075 #9 0x00007f5068966b58 in g_main_context_iterate (context=0x14dd4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146 #10 0x00007f5068966f52 in g_main_loop_run (loop=0x14df160) at gmain.c:3340 #11 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647 I edited the entry ###DIRNAME###, there was the name of one of my directories in it. ###DIRNAME### was not the current dir I was in, but one up in hierarchy and then two below. Don't know what dolphin was telling gam to do there, I never entered that ###DIRNAME### or the one above in this whole KDE session. Version is gamin-0.1.10-12.fc17.x86_64, Signature : RSA/SHA256, Wed 08 Feb 2012 05:46:09 PM CET, Key ID 50e94c991aca3465 Build Date : Wed 08 Feb 2012 05:27:13 PM CET Build Host : x86-11.phx2.fedoraproject.org Next one, tried it in another big dir hierarchy. Can reproduce it within minutes now. #0 0x00007fb88cfd6950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=20, data=data@entry=0x7fff0bf90350, len=len@entry=51) at gam_channel.c:826 #2 0x0000000000408b6d in gam_send_ack (conn=conn@entry=0xf3e9d0, reqno=1009, path=path@entry=0xff1230 "##OTHERDIR##", len=len@entry=41) at gam_connection.c:686 #3 0x0000000000409020 in gam_connection_request (req=0xf3ee02, conn=0xf3e9d0) at gam_connection.c:411 #4 gam_connection_data (conn=conn@entry=0xf3e9d0, len=<optimized out>) at gam_connection.c:509 #5 0x0000000000407adf in gam_client_conn_read (source=0xf8f980, condition=<optimized out>, info=0xf3e9d0) at gam_channel.c:283 #6 0x00007fb88d2f0825 in g_main_dispatch (context=0xf2f4d0) at gmain.c:2539 #7 g_main_context_dispatch (context=context@entry=0xf2f4d0) at gmain.c:3075 #8 0x00007fb88d2f0b58 in g_main_context_iterate (context=0xf2f4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146 #9 0x00007fb88d2f0f52 in g_main_loop_run (loop=0xf31160) at gmain.c:3340 #10 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647 ##OTHERDIR## again is a dir that I did not open. This time it is below the current one. This is not the deadlock reported here, so it's better to file a new bug. While I'm at it, when gam_server is hanging, who is the file descriptor fd=20? Look at ls -l /proc/$(pidof gam_server)/fd Ok, I filed a new bug for this: #917848 |
Created attachment 558641 [details] Backtrace Description of problem: Occasionally, gam_server deadlocks (see attached backtrace). When this happens, all KDE applications hang at start. Killing gam_server allows the user to regain an operational desktop. Version-Release number of selected component (if applicable): gamin-0.1.10-11.fc17.x86_64 How reproducible: Unfrequently Steps to Reproduce: Hard, it appears to be more likely to happen when attempting to launch a KDE application when the machine is out of RAM and swapping. Actual results: gam_server hangs, an so do all applications using it (i.e. attempt to connect to /tmp/fam-$username-) Additional info: Backtrace attached