Bug 786170

Summary: gam_server deadlocks, leading to all KDE applications hanging at start
Product: [Fedora] Fedora Reporter: Sandro Mani <manisandro>
Component: gaminAssignee: Tomáš Bžatek <tbzatek>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: gerd, rdieter, tbzatek, tsmetana
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gamin-0.1.10-12.fc18 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-08 11:28:32 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
Backtrace
none
Patch none

Description Sandro Mani 2012-01-31 11:06:18 EST
Created attachment 558641 [details]
Backtrace

Description of problem:
Occasionally, gam_server deadlocks (see attached backtrace). When this happens, all KDE applications hang at start. Killing gam_server allows the user to regain an operational desktop.


Version-Release number of selected component (if applicable):
gamin-0.1.10-11.fc17.x86_64

How reproducible:
Unfrequently

Steps to Reproduce:
Hard, it appears to be more likely to happen when attempting to launch a KDE application when the machine is out of RAM and swapping.
  
Actual results:
gam_server hangs, an so do all applications using it (i.e. attempt to connect to /tmp/fam-$username-)

Additional info:
Backtrace attached
Comment 1 Sandro Mani 2012-02-02 17:14:12 EST
Looking at gamin-0.1.10/server/inotify-helper.c, there is an obvious double lock:
ih_sub_foreach_worker locks inotify_lock on line 153, and calls ih_sub_cancel on line 162, which also attempts to lock inotify_lock on line 130. The backtrace indeed shows that gam_server is stuck ad inotify-helper.c::130.

I would suggest removing the G_LOCK and G_UNLOCK in ih_sub_cancel, since it is only called by ih_sub_foreach_worker anyway. I have made a corresponding patch.
Comment 2 Sandro Mani 2012-02-02 17:14:56 EST
Created attachment 559148 [details]
Patch
Comment 4 Tomáš Bžatek 2012-02-03 08:17:12 EST
FYI, this piece of code is embedded in glib, should be fixed there too.
/me checks...
Comment 5 Tomáš Bžatek 2012-02-08 11:28:32 EST
Confirming this doesn't affect glib, ih_sub_cancel() is not used in nested calls and there should not be any lock held. In fact, glib uses slightly newer sources from John McCutchan.

Posted my comments in upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=669292

Fixed in gamin-0.1.10-12.fc18
Comment 6 Rex Dieter 2012-06-29 16:37:49 EDT
marked a few more upstream references

Had someone in #fedora-kde irc today seemingly get hit by this already running gamin-0.1.10-12.fc17 , will keep an eye out and re-open if we can get more debugging done next time (instead of simply killing gam_server)
Comment 7 Gerd v. Egidy 2013-03-03 15:57:31 EST
I hit this every few weeks, currently running gamin-0.1.10-12.fc17.x86_64. Today I had enough time to strace it, find that gam_server is the culprit and found this bugzilla entry.

What do you want me to do to further analyze this when I hit it next time?
Comment 8 Sandro Mani 2013-03-03 16:03:12 EST
Do (as root)

debuginfo-install gamin

then run (as normal user)

gdb -p$(pidof gam_server)

and then type

backtrace

and attach the backtrace.
Comment 9 Sandro Mani 2013-03-03 16:04:12 EST
Ups, should be

gdb -p $(pidof gam_server)

(note the space between p and $)
Comment 10 Gerd v. Egidy 2013-03-03 17:39:39 EST
Thanks. So just a regular backtrace.

Now I remembered how to trigger this: open dolphin and navigate to a directory with many subdirs (about 600 in my case). Scroll a bit around in the view or enter a subdir. And it hits you within a minute or so. BTW, this dir is mounted via NFS4, if that matters.

Here is the backtrace:

#0  0x00007f506864c950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=9, data=data@entry=0x7fff62545660, len=len@entry=59) at gam_channel.c:826
#2  0x0000000000408794 in gam_send_event (conn=conn@entry=0x14dfb60, reqno=90, event=<optimized out>, path=0x15105b0 "###DIRNAME###", len=49)
    at gam_connection.c:609
#3  0x000000000040a36d in gam_eq_flush_callback (conn=0x14dfb60, event=0x1510810, eq=<optimized out>) at gam_eq.c:118
#4  gam_eq_flush (eq=0x14dd130, conn=conn@entry=0x14dfb60) at gam_eq.c:137
#5  0x00000000004081ea in gam_connection_eq_flush (data=0x14dfb60, data@entry=<error reading variable: value has been optimized out>) at gam_connection.c:174
#6  0x00007f50689673bb in g_timeout_dispatch (source=source@entry=0x1511a40, callback=<optimized out>, user_data=<optimized out>) at gmain.c:3882
#7  0x00007f5068966825 in g_main_dispatch (context=0x14dd4d0) at gmain.c:2539
#8  g_main_context_dispatch (context=context@entry=0x14dd4d0) at gmain.c:3075
#9  0x00007f5068966b58 in g_main_context_iterate (context=0x14dd4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146
#10 0x00007f5068966f52 in g_main_loop_run (loop=0x14df160) at gmain.c:3340
#11 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647

I edited the entry ###DIRNAME###, there was the name of one of my directories in it. ###DIRNAME### was not the current dir I was in, but one up in hierarchy and then two below. Don't know what dolphin was telling gam to do there, I never entered that ###DIRNAME### or the one above in this whole KDE session.

Version is gamin-0.1.10-12.fc17.x86_64,
Signature   : RSA/SHA256, Wed 08 Feb 2012 05:46:09 PM CET, Key ID 50e94c991aca3465
Build Date  : Wed 08 Feb 2012 05:27:13 PM CET
Build Host  : x86-11.phx2.fedoraproject.org
Comment 11 Gerd v. Egidy 2013-03-03 17:50:01 EST
Next one, tried it in another big dir hierarchy. Can reproduce it within minutes now.

#0  0x00007fb88cfd6950 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000000408106 in gam_client_conn_write (source=<optimized out>, fd=20, data=data@entry=0x7fff0bf90350, len=len@entry=51) at gam_channel.c:826
#2  0x0000000000408b6d in gam_send_ack (conn=conn@entry=0xf3e9d0, reqno=1009, path=path@entry=0xff1230 "##OTHERDIR##", len=len@entry=41) at gam_connection.c:686
#3  0x0000000000409020 in gam_connection_request (req=0xf3ee02, conn=0xf3e9d0) at gam_connection.c:411
#4  gam_connection_data (conn=conn@entry=0xf3e9d0, len=<optimized out>) at gam_connection.c:509
#5  0x0000000000407adf in gam_client_conn_read (source=0xf8f980, condition=<optimized out>, info=0xf3e9d0) at gam_channel.c:283
#6  0x00007fb88d2f0825 in g_main_dispatch (context=0xf2f4d0) at gmain.c:2539
#7  g_main_context_dispatch (context=context@entry=0xf2f4d0) at gmain.c:3075
#8  0x00007fb88d2f0b58 in g_main_context_iterate (context=0xf2f4d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3146
#9  0x00007fb88d2f0f52 in g_main_loop_run (loop=0xf31160) at gmain.c:3340
#10 0x00000000004035e6 in main (argc=<optimized out>, argv=<optimized out>) at gam_server.c:647

##OTHERDIR## again is a dir that I did not open. This time it is below the current one.
Comment 12 Sandro Mani 2013-03-03 17:54:34 EST
This is not the deadlock reported here, so it's better to file a new bug. While I'm at it, when gam_server is hanging, who is the file descriptor fd=20? Look at 

ls -l /proc/$(pidof gam_server)/fd
Comment 13 Gerd v. Egidy 2013-03-04 17:29:30 EST
Ok, I filed a new bug for this: #917848